Sunday, 1 June 2014

Getting on Linux my dedicated Amd hd7730M up to par with on Windows


When I bought my laptop: a laptop with an Intel hd4000 and a dedicated card Amd hd7730m, the simplest way to have a good battery life, while being able to use the dedicated card sometimes, was to install Catalyst.

While on Windows, the two cards mix well together: you have a tool to configure the card that should be used for recently launched applications, and the dedicated card shutdown itself when it is not used, it was not the case for Linux. With Catalyst you had the choice between hd4000 only or hd7730m only. If you changed the parameter you had to restart X to have the change effective.

The main issues with that were that Catalyst required specifics parameters for the intel DDX (no sna), Catalyst is difficult to install without problems, and added 30s more to my startup time. Also when using the Amd card, tearings were everywhere and destroyed the experience.

The interpretation I have of these issues is that Catalyst probably uses some hacks to get access to the screen framebuffer from the intel DDX, and copies directly what needs to be updated to it (not caring of the screen refresh, thus causing the tearings).

Not very happy with this situation, when I heard that the open source driver were getting the ability to fire up or shutdown the dedicated card when needed, and that DPM was working, I was optimistic that I could use the open source driver to have a better experience.

Help the open source driver

Last year, I had some free time to spend, and decided to help open source to get what I wanted: Currently with DRI2 DRI_PRIME support you can use the dedicated card for a specific application (and after that when the application closes, the dedicated card will shutdown), but you still get tearings.
That's because the way it's implemented, basically it's the same than for Catalyst: copying to the screen framebuffer without caring of the screen refresh.

To get rid of the tearings, I thought at that time, that the best solution was to implement DRI_PRIME for Wayland, since Wayland compositors are supposed to have no tearings.
It took several months before understanding correctly what was needed to be done and how.
Finally I had done some patches and have even done a talk at Fosdem ! Unfortunately these patches didn't have that much success for review, the main reason I think is the lack of testers and the need to be reviewed by Kristian Hoesberg to get merged into that part of Mesa, and he is very occupied by other stuffs (and I think he has to keep a priority for Intel employees, and that also the fact that Intel doesn't need DRI_PRIME is a factor. But that's only suppositions).

I then decided to see what could be done for GLX DRI3. Afterall I worked on XWayland in the meantime, and understood the technology behind DRI3.
When developping the patches (both for Wayland EGL and GLX DRI3), I made the assumptions we would have dma-buf fences to help synchronizing the two cards read and writes.
Thus I choosed a different way than DRI2: instead of always writing to the same buffer, we would write to different buffers (like is done for DRI3 and Wayland by default).

The problem I thought of was if the buffer is cleared, then we write to it, and in the meanwhile the other card reads the buffer before it has been finished written. With dma-buf fences, the second card would wait. I thought in the meantime we would use a hack, like glFinish (ie wait the rendering has been done) before sharing the buffer, but that decreases performance. Alternatively, we could have used a nested Weston compositor to add some latency.

Hopefully this is not a big problem, because you don't clear the buffer! You write on the previous content of it. Imagine you use 3 buffers. You render on one, while the other ones may or may be not used by the server (X or Wayland compositor). When you 'send' (actually it's just saying to the server you have rendered) the buffer, the server 'releases' an older one (actually it just tells you it has finished with it). And you reuse this 'old' buffer. So when you write on it, there's still the old content, which is 1-3 frames old. So if never the server card reads before you finished writing, you see old content on the image: not as bad as black content!
In practice everything is ok for most applications, but for some the old content may be a problem. We'll have to wait dma-buf fences to have it 100% ok.

Now another problem: The integrated card can't read buffers in the dedicated card space, and it doesn't understand its 'tiling mode' (pixels are ordered in a special way for better performance), so you have to make the content available in RAM in linear mode. In practice you have a buffer in VRAM with tiling and a buffer in Ram with no tiling, and you ask the card to copy the first buffer to the second one (the card is clever and wait the rendering of the first buffer is finished before doing the copy). Between VRAM and RAM you have a pci-express channel.

The not that clever way to do the copy is to fill the machine compute units with the pixels, and ask it to execute the identity operation to make it fill the other buffer. Unfortunately it's what is done on my card on the open source driver. For nouveau it's using the 2D engine or the DMA engine, which does it faster. Thus I added to Mesa some bits to make it use the DMA engine of my radeonsi card (DMA = direct memory addressing: it's manipulating the memory without the help of the cpu). My benchmarks shows that there is a performance boost for cpu limited apps! With it, I get better performance than under DRI2 DRI_PRIME.

Also when writing the DRI_PRIME patches, I wanted to add new abilities: Be able, like on Windows, to tell the driver to use the radeonsi card for a specific app, without having to use the command line DRI_PRIME env var. And also not having to have to use a command line tool (xrandr) at every reboot to parameter it.
I advise you to look at for more details.

In the meantime, I also figured out that if you run a composited environment, you shouldn't have tearings with DRI2 DRI_PRIME. So that is not an added value of my patches. However there is a lot of apps that show black content with DRI2 DRI_PRIME: this is entirely solved with my patches!


Now what's remaining to have something as good as on Windows ? Well better performance !

I have looked at what is missing to radeonsi. And it looks like parts of it is compilation optimizations, and better hyperz (some features are unsupported). I had the chance to test some new compilation optimizations coming to radeonsi. Here are some benchmarks:
"hyperz" refers to R600_DEBUG=hyperz.
"dma" refers to using the DMA engine for the copy buffer in VRAM -> buffer in RAM
"si_scheduler" refers to the compilations optimizations I'm refering to.

These apps don't benefit that much from the compilation optimizations, but benefit a lot from using the DMA engine for the copy.

It's the opposite for Unigine heaven 4.0 (which seems particularly indifferent to the cpu speed): I get a score of 203 with nothing, 206 with hyperz. 206 with hyperz+dma, and 216 with hyperz+dma+si_scheduler.
We are not far from the score of Catalyst: 231 ! Perhaps with the PTE patches which are not in the kernel I use, we could beat it ?

Also the dolphin emulator has a lot of stuttering with radeonsi, but when I use the compilation optimizations, most of these are gone.

So overall I'm very happy. With all the recent improvements, my laptop on Linux should be finally be up to par with on Windows.

You can find the compilation optimizations here:

Sunday, 20 April 2014

Correct some Wayland misconceptions

Wayland seems to have a lot of hype, and among the comments, I see some misconceptions about Wayland. This blog post has for purpose to correct some misconceptions and help match better the expectations with what Wayland has to offer.

Some common Misconceptions

  • X applications and Games will be faster under Wayland

A fast explanation of how XWayland works:

The Wayland compositor handles inputs (mouse, etc) and outputs (display, etc). It redirects X apps inputs to XWayland, which then tells the client the inputs. When a client has to draw an output, it does as it would do under plain X, and XWayland shares the result with the Wayland compositor.

Thus there is no way that optimised XWayland can beat optimized plain X on performance. However due to compositing, you can have something tear free.

To do efficient compositing, you need a buffer_age extension to be supported. It indicates to the compositor when the buffer it'll render to was rendered for the last time, which helps the compositor to determine the part of the screen which it doesn't need to repaint, since it wasn't updated. Under X DRI2, this extension isn't supported, thus either the compositor has to do fullscreen repaint, or it guesses which parts must be redrawn. None is optimal.

X DRI3+PRESENT should solve this. For Wayland, the buffer_age extension is already supported.

As for fullscreen games, for which you likely want compositing to be bypassed, the cpu overhead of using XWayland+Wayland compared to just X (similar for X vs native Wayland game) should have negligible impact on your fps.

  • Boot time will be faster

Try to launch X alone in a tty. You'll find that it starts pretty fast (~1 second for me). The boot time difference with the move to Wayland should be negligible. (Also you'll have to launch XWayland if you use X apps. XWayland launches faster however.)

You are wondering now what will bring you Wayland then?
Lets investigate this:

What you could expect

  • Better security

I'm not a security expert, thus I won't explain much the topic. Inputs are more secure, and due to the wide use of dma-buf sharing, your outputs should be secure.

  • Better Battery life

This point will highly depend on the compositor. With Wayland, the compositor has a lot of power in its hands. Most Wayland applications listen to the frame callback, which is a Wayland event to indicate the client it's a good time to redraw if it wants to update something. The compositor could use that to reduce consumption to send this event less often in clever situations. It could also reduce the cost of compositing with some tricks, etc. I don't expect very optimized behaviors to appear soon, but in long term I expect a Wayland environment to have better Battery life than an X environment.

  • Performance corner cases better handled

Some operations are known to be faster on CPU, some others faster on GPU.

X has a rendering API which abstracts what you are rendering on. Some of these rendering operations are faster on a GPU, while some are better on CPU (and it highly depends on how the DDX was written). This can lead to strange situations.

Lets take OpenTTD, one of my favorite games. It renders the whole game with the cpu. Then it lets SDL (1.2) put the result on the screen. It indicates which parts of the screen were updated.

If your DDX is using the GPU to render, you would be much better if you upload the whole buffer, instead of each small updated parts. however if it uses the CPU, you want only the small updated parts to be uploaded.

SDL2 has some detection code to detect on what you are, and upload the whole part on some systems for which it would be better. SDL 1.2 doesn't make that difference. SDL 1.2 will use XShmPutImage on each damaged part to upload the result.

 An XShmPutImage operation consist in: create a pixmap. upload the result to it. Copy the pixmap into a subpart of the destination. For Glamor it is really really slow. For each XShmPutImage it converts the whole buffer into a gl texture. At next XShmPutImage, it cannot know if never this CPU buffer was updated, so it has to convert again the whole buffer into texture. That makes OpenTTD unplayable with Glamor. When using something else than Glamor, XShmPutImage is much faster. It illustrates the problem: writing an efficient DDX is very complex since you have to support operations which are not very fast for your device.

The same situation on Wayland is simpler: you differentiate buffers on GPU (you can use EGL to get/send them, or vaapi will create one, etc) and buffers on CPU (shm buffers). They are abstracted into a wl_buffer, which is a buffer that the compositor is aware of. The API handles wl_buffers.

Since you have rendered everything on CPU, you use a shm buffer to send the result to wayland (and you indicate to Wayland the updated parts). Whatever the Wayland compositor uses (CPU or GPU), it'll choose the best way to get the result on screen.

Also on Wayland you have a subsurface mechanism, which enables you to mix different wl_buffers for your app: you can have video content accelerated with GL, while having the GUI rendered with CPU.

  • Fewer glitches

Have you ever experienced problems with some applications causing problems or glitches to the whole desktop ? Like for example an application setting itself fullscreen at a low resolution, then crashing ? Hopefully this should go away with Wayland, since the applications do not control resolution, and just ask kindly to the compositor that they are better with a specific resolution. If the client dies, the compositor can go back to the previous resolution automatically. This is one among other examples.

  • Better user experience

This depends a lot on what DE do, but potentially user experience can be improved with more user control, better visuals, etc.

There are many other topics that could be discussed.
Similarly, I do not detain the truth and do not see the future.
I hope this discussion has helped you understand better what Wayland can improve for you, and wish the best success to Wayland.