GSOC 2019 – M. Stoeckl’s website

GSOC 2019 – M. Stoeckl’s website

2019-06-20: The rate of additions to the code has started to slow, because waypipe already has most of the features it would ever need. The most significant change for the preceding week was making most build dependencies optional: only a core of libffi, libwayland, and wayland-protocols are required. (librt and pthreads are already dependencies of libwayland.) There has been a slight speedup to the damage merging algorithm, but on further reflection the whole “extended interval” construction may limit the performance of the buffer diff construction and application procedures. Instead, the horizontally banded damage tracking data structure underlying pixman may make it easier to ensure that buffers are scanned monotonically — or it may prove to be unavoidably slow to construct, in the worst case. The next most notable change was the introduction of a headless test, checking that when run with “headless” weston that applications do not crash when run indirectly using waypipe. All the remaining changes are essentially bug fixes and small expansions of the previous multithreading and video encoding work.

Performance testing, on the other hand, has had a few interesting results. Using perf to trace both the USDT probes mentioned earlier, and scheduler context switches, makes it easy to find out why a given data transfer takes the time that it does. I’ve written a small script that generates timeline plots from the perf script output. The following image shows a (zoomed out) 120 second long plot, revealing a number of different interaction patterns in a sample program:

Zooming in a bit shows a brief 0.3 second chart of waypipe operations. The 5 rows, from bottom to top, are a waypipe server instance, followed by three associated worker threads, and with a waypipe client instance (connected via ssh) at the top. Overlapping time intervals are nested, so that they remain visibie. Gray intervals indicate the range of times during which the program is scheduled to run; orange, the time range needed for the waypipe client to read out the full transfer contents; green, the time range needed for the waypipe server to send the data transfer; and red, the time spent by worker threads compressing the buffer changed records needed by the following data transfer operation.

Most of the time is spent reading and writing the data transfers to the ssh connection. While these are ongoing, no other work is performed. A possible optimization would be to switch to a “streaming” data transfer model, in which the write operation is run in parallel with buffer diff compression, sending data as soon as it is compressed. The other side of the connection would perform streaming decompression, probably still with a single core. While this application only updates a single image buffer at a time, for applications which maintain multiple windows, transfer latency can be reduced by writing each window’s buffer change transcript as soon as it is available, instead of bundling everything into a single large transfer.

The red intervals corresponding to the compress_buffer set of USDT probes often differ significantly in duration; dynamically adjusting the workload between threads to be more even may offer significant latency reduction.

Finally, looking at microsecond-scale details, we can see that during a transfer between matching waypipe instances, the receiving program makes a large number of context switches to periodically check for (and one time, handle) new events from either pipes or a connected Wayland compositor or application, and then resume the nonblocking read operation. Given the number of context switches, it may be more efficient — even on single-core computers — to create separate threads that only blocking read/write to the ssh connection, and notify the main thread on completion; but on the other hand, this may interfere with other optimizations, and would introduce a minimum of two context additional switches, and possibly cross-cache data migration as well. Sadly aio is oriented around file operations.

2019-06-13: The last week has been rather busy, introducing minimal support for:

  • DMABUFs with auxiliary plane data, like the XRGB8888 format with the i915 modifier Y_TILED_CCS. Unfortunately, this format is relatively new and CPU access via the GBM API is rather slow.
  • Compression of large data transfers with either ZStd or LZ4. Compression ratios for the two vary from 30x for some text and whitespace heavy images, to only 1.5x for mildly noisy image data.
  • Lossy transfer of linear XRGB8888 dmabufs with h264, a video format with very fast encoders and decoders which can be configured to run with a limited bandwidth. While there are visible artifacts, even factor 50 compression is often usable. It should be noted that the image quality is, at the moment, slightly reduced since each image buffer has its own attached video stream, and the application alternates between buffers.
  • The wlr-export-dmabuf protocol, which is used by a screen recording tool.
  • USDT (static tracepoints) to permit efficient profiling of key operations, using perf or a number of other profiling tools.

More recently, waypipe has gained a more detailed damage tracking mechanism for shared-memory images (and potentially dmabufs with a linear layout.)

On the left, the above picture shows a collection of damage rectangles superimposed on a grid which contains the pixels (or bytes, or uint32_t‘s) in a shared memory buffer, ordered from left to right and then top to bottom. Because the diff construction routine used to determine which bytes have actually changed, and then copy out the changed bytes, is optimized to scan specific intervals, waypipe converts the set of damage rectangles to a set of disjoint intervals in the linear memory space. Because converting a 400 by 600 pixel rectangle into 600 distinct intervals would be a waste of memory, the conversion routine stores such rectangular units as “extended intervals”, collections of regular intervals of the same width, with starting points separated by multiples of the image stride. Finally, to limit the total number of rectangles produced, and to avoid the overhead from tracking intervals of changed bytes from exceeding the size of the actual buffer contents, intervals are merged together if the minimum gap between the two — in the linear memory space — is less than or equal to a prespecified constant margin. In some cases, this heuristic can require merging rectangles which are on opposite sides of the image, but nevertheless close in memory. waypipe currently implements this transformation with a naive O(n^2) approach; much lower asymptotic runtimes are possible in theory, and may be mentioned here in the future.

2019-05-29: waypipe is mostly usable now – it works with reasonably low overhead for mostly-static GUI applications like kwrite and libreoffice, although it will probably crash and leak memory every now and then.

I’ve implemented a Wayland protocol parser for waypipe, and now use it to track the ownership and lifetime of wl_shm_pool buffers. An especially annoying detail of the Wayland protocol is that one cannot determine which passed file descriptor corresponds to which message without parsing the message. (This also makes proper handling of inert objects more complicated.) At the very least, the protocol makes it possible to identify message boundaries in the byte stream by including a 16-bit length field. Since protocol messages are effectively limited to <4096 bytes by libwayland fiat, and there are ≪1024 requests or events per interface in practice, a simple way to fix this issue would be to partition the second word of a wayland protocol header into a 12 bit byte-stream length field, 10 bit message id field, and 10 bit file-descriptor-stream length field. (12/16/4 also works, and can be made backwards-compatible.)

Another very useful change has been to make waypipe‘s connections to the compositor or application and to the matching waypipe instance nonblocking. As long as waypipe itself is not stuck with CPU-intensive computations, messages from the wayland client to the wayland server do not interfere/synchronize with messages moving in the other direction. This fixes a key repeat issue with waypipe from earlier, in which key repeat events were delayed by large screen updates. As networks can still be unreliable, it may still be useful to have waypipe modify messages to disable key repeat.

A useful trick when testing waypipe is to artificially adjust network parameters. This can be done with Linux traffic control tools, such as NetEm. For example, sudo tc qdisc add dev lo root netem rate 1000kbit will throttle bandwidth through the loopback interface, and sudo tc qdisc add dev lo root netem delay 100ms will add 100ms of latency.

When running games over waypipe, FPS often drops by a factor of 2 or more, as frames are delayed by the round trip time for the frame display callback. A workaround is to run the game nested inside weston; the compositor updates the game at e.g. 60 fps, and sends over a subset of the rendered frames. Depending on how the game loop is written, this trades a very slow game for one in which frames are dropped.

It turns out that most toolkits and applications already work when restricted to protocols whose underlying file descriptors waypipe can successfully translate. I’ve modified waypipe to filter out the compositor messages advertising the availability of zwp_linux_dmabuf_v1 and wl_drm; see the protocol-aware branch.

I’ve tested the following programs (with ssh compression enabled, and the appropriate environment variables set to use wayland instead of X11):

  • Alacritty (winit), kitty (GLFW), Termite (GTK3), weston-terminal (GTK3), Konsole (Qt5)
  • Firefox (mostly GTK3), eventually crashed waypipe, fix in progress
  • SuperTux (SDL2), SuperTuxKart (no toolkit)
  • VLC (mostly Qt5), eventually crashed sway by committing an xdg_surface with no role

Note that SuperTuxKart uses OpenGL 3.3, albeit via LLVMpipe, on the remote system. Due to current bandwidth and computational limitations, for a 1024×768 window, it only runs at about 2 FPS.

It is always useful to consider what the best possible performance would be. My test system, over WiFi, manages a 1MB/s transfer rate. A 2560×1440 screen, with 24 bits per pixel, and 60Hz refresh rate, displays 663MB/s of data. However, if the underlying data is for a 3d game, or some program with a scrolling viewport, then video bandwidth estimates should apply; one might only need 3 MB/s instead, which most modern networks, computers and hardware video decoders can handle. (For comparison, with a text-heavy website, losslessly compressing each frame as a PNG produces a transfer rate of about 25MB/s.) Of course, even if enough bandwidth is available, CPU usage and processing delay is still an issue. Furthermore, by default, the wl_surface interface uses a frame request to set up callbacks so that typical applications only provide a new frame sometime after the old frame has arrived at the compositor. This helps avoid computing unused frames, but introduces a round-trip latency between successive frames that is increased by any delays in waypipe itself.

The minor issues mentioned last time have been fixed. The crash with weston-terminal was caused by a pty file descriptor from forkpty, which was both readable and writable, unlike pipe ends on Linux, which are only one of the two. A simple flag-setting signal handler for SIGINT now controls main loop termination for waypipe client and server processes, and permits cleanup of the Unix domain socket address files. A no-op handler for SIGCHLD is now set, and the flag SA_RESTART ensures that only the poll system calls are interrupted when child processes end. The ssh reverse tunnel closing issue was caused by the main client process not closeing the socket connection file descriptors.

At the moment, the performance of waypipe can be significantly improved by adding the -C option to enable compression to ssh. For example, try waypipe ssh -C [email protected] weston-stacking, and fullscreen the window. On one of my systems, over a slow connection, enabling compression reduces the time to draw the updated frame from about 10 seconds to 1.

Still working on protocol-agnostic improvements, I have added translation of pipe operations to the prototype. These are required for wl_data_offer and primary_selection, which are used by copy/paste operations. An additional complication is that both protocols rely on the write end of the shared pipe being closed, to signify the completion of a data transfer. Fortunately, poll provides a relatively clean way to detect pipe closures, with the POLLHUP event flag. (At the moment, weston-terminal crashes on paste thanks to a non-S_ISFIFO fd; termite appears to work.)

There’s not much more that can be done with a proxy that doesn’t parse/modify protocol messages, so the first main work period (2019-05-27+) will probably be dedicated to that. Also, it looks like direct forwarding of e.g. wl_drm will not be possible, because the protocol sends a file descriptor for a DRM device, and it does not seem as though one cannot respond to the ioctl calls used to manipulate the device from userspace. It may also be possible (but very complicated) to translate protocol between wl_shm locally and e.g. linux_dmabuf+linux_explicit_synchronization remotely, since, to the extent that the compositor needs, the products of both are abstracted through wl_buffer.

I have also updated the command line interface for waypipe, because manually setting socket paths is inconvenient. If waypipe is correctly installed (visible in $PATH) on both a local and a remote system, then prefixing e.g. ssh -C [email protected] with waypipe to produce waypipe ssh -C [email protected] will automatically launch a waypipe client, add reverse tunneling flags, force pseudo-terminal allocation (-t), remotely run a waypipe server which runs either the default $SHELL or whatever program was requested, and close the client when ssh completes.

There are a few minor annoyances with the interface at the moment: ssh doesn’t always exit cleanly, because it appears the reverse tunnel fails to close cleanly; waypipe litters socket bind paths in /tmp; and it usually takes a second to close a waypipe ssh session, because I’ve yet to set up signal handling and currently poll waitpid. (Unfortunately, signalfd is Linux-only, and pdfork is FreeBSD-only.)

It lives, it moves!

I wrote a short proof of concept implementation of waypipe, which only supports shared memory buffers, and has about as unoptimized a change-detection and transfer mechanism as I can manage. Change-detection is performed by maintaining a twin copy; for transfer, waypipe sends over the entire buffer if any part changed. The program works, but lags quite horribly when I run weston-terminal over a WiFi network. For the code, see the protocol-agnostic branch, but do beware that it is research-quality (written with no particular care for clarity/maintenance).

Using wf-recorder, I recorded a short video of the proof of concept in action. Startup takes a while, and there is visible latency tracking mouse hover over even a small popup. When typing, with standard key repeat settings on the compositor side, one must be very careful to time keypresses so as to queue up the meaningful keystrokes when the underlying channel is hanging, lest non-idempotent actions be repeated.

In other news, by running buggy test clients, I’ve managed to crash sway often enough that I’ll try to resolve the underlying issue sometime this week. On that note, I had been wondering why inert objects produce no errors, until reading this document.

Task: implement a tool (henceforth called waypipe) which can be used to rely both messages and data between any Wayland client and compositor over a single transport channel. This should enable workflows similar to those using ssh -X.

The main difficulty in producing such a tool is that Wayland protocol messages primarily include control information, and the large data transfers of graphical applications are implemented through shared memory. waypipe must then identify and serialize changes to the shared memory buffers into messages to be transferred over a socket.

The above project will be funded via Google Summer of Code 2019.

Home RSS