Sunday, 1 June 2014

Getting on Linux my dedicated Amd hd7730M up to par with on Windows

Motivations


When I bought my laptop: a laptop with an Intel hd4000 and a dedicated card Amd hd7730m, the simplest way to have a good battery life, while being able to use the dedicated card sometimes, was to install Catalyst.

While on Windows, the two cards mix well together: you have a tool to configure the card that should be used for recently launched applications, and the dedicated card shutdown itself when it is not used, it was not the case for Linux. With Catalyst you had the choice between hd4000 only or hd7730m only. If you changed the parameter you had to restart X to have the change effective.

The main issues with that were that Catalyst required specifics parameters for the intel DDX (no sna), Catalyst is difficult to install without problems, and added 30s more to my startup time. Also when using the Amd card, tearings were everywhere and destroyed the experience.

The interpretation I have of these issues is that Catalyst probably uses some hacks to get access to the screen framebuffer from the intel DDX, and copies directly what needs to be updated to it (not caring of the screen refresh, thus causing the tearings).

Not very happy with this situation, when I heard that the open source driver were getting the ability to fire up or shutdown the dedicated card when needed, and that DPM was working, I was optimistic that I could use the open source driver to have a better experience.

Help the open source driver


Last year, I had some free time to spend, and decided to help open source to get what I wanted: Currently with DRI2 DRI_PRIME support you can use the dedicated card for a specific application (and after that when the application closes, the dedicated card will shutdown), but you still get tearings.
That's because the way it's implemented, basically it's the same than for Catalyst: copying to the screen framebuffer without caring of the screen refresh.

To get rid of the tearings, I thought at that time, that the best solution was to implement DRI_PRIME for Wayland, since Wayland compositors are supposed to have no tearings.
It took several months before understanding correctly what was needed to be done and how.
Finally I had done some patches and have even done a talk at Fosdem ! Unfortunately these patches didn't have that much success for review, the main reason I think is the lack of testers and the need to be reviewed by Kristian Hoesberg to get merged into that part of Mesa, and he is very occupied by other stuffs (and I think he has to keep a priority for Intel employees, and that also the fact that Intel doesn't need DRI_PRIME is a factor. But that's only suppositions).

I then decided to see what could be done for GLX DRI3. Afterall I worked on XWayland in the meantime, and understood the technology behind DRI3.
When developping the patches (both for Wayland EGL and GLX DRI3), I made the assumptions we would have dma-buf fences to help synchronizing the two cards read and writes.
Thus I choosed a different way than DRI2: instead of always writing to the same buffer, we would write to different buffers (like is done for DRI3 and Wayland by default).

The problem I thought of was if the buffer is cleared, then we write to it, and in the meanwhile the other card reads the buffer before it has been finished written. With dma-buf fences, the second card would wait. I thought in the meantime we would use a hack, like glFinish (ie wait the rendering has been done) before sharing the buffer, but that decreases performance. Alternatively, we could have used a nested Weston compositor to add some latency.

Hopefully this is not a big problem, because you don't clear the buffer! You write on the previous content of it. Imagine you use 3 buffers. You render on one, while the other ones may or may be not used by the server (X or Wayland compositor). When you 'send' (actually it's just saying to the server you have rendered) the buffer, the server 'releases' an older one (actually it just tells you it has finished with it). And you reuse this 'old' buffer. So when you write on it, there's still the old content, which is 1-3 frames old. So if never the server card reads before you finished writing, you see old content on the image: not as bad as black content!
In practice everything is ok for most applications, but for some the old content may be a problem. We'll have to wait dma-buf fences to have it 100% ok.

Now another problem: The integrated card can't read buffers in the dedicated card space, and it doesn't understand its 'tiling mode' (pixels are ordered in a special way for better performance), so you have to make the content available in RAM in linear mode. In practice you have a buffer in VRAM with tiling and a buffer in Ram with no tiling, and you ask the card to copy the first buffer to the second one (the card is clever and wait the rendering of the first buffer is finished before doing the copy). Between VRAM and RAM you have a pci-express channel.

The not that clever way to do the copy is to fill the machine compute units with the pixels, and ask it to execute the identity operation to make it fill the other buffer. Unfortunately it's what is done on my card on the open source driver. For nouveau it's using the 2D engine or the DMA engine, which does it faster. Thus I added to Mesa some bits to make it use the DMA engine of my radeonsi card (DMA = direct memory addressing: it's manipulating the memory without the help of the cpu). My benchmarks shows that there is a performance boost for cpu limited apps! With it, I get better performance than under DRI2 DRI_PRIME.

Also when writing the DRI_PRIME patches, I wanted to add new abilities: Be able, like on Windows, to tell the driver to use the radeonsi card for a specific app, without having to use the command line DRI_PRIME env var. And also not having to have to use a command line tool (xrandr) at every reboot to parameter it.
I advise you to look at http://lists.freedesktop.org/archives/mesa-dev/2014-May/060131.html for more details.

In the meantime, I also figured out that if you run a composited environment, you shouldn't have tearings with DRI2 DRI_PRIME. So that is not an added value of my patches. However there is a lot of apps that show black content with DRI2 DRI_PRIME: this is entirely solved with my patches!

Performance


Now what's remaining to have something as good as on Windows ? Well better performance !

I have looked at what is missing to radeonsi. And it looks like parts of it is compilation optimizations, and better hyperz (some features are unsupported). I had the chance to test some new compilation optimizations coming to radeonsi. Here are some benchmarks:
"hyperz" refers to R600_DEBUG=hyperz.
"dma" refers to using the DMA engine for the copy buffer in VRAM -> buffer in RAM
"si_scheduler" refers to the compilations optimizations I'm refering to.







These apps don't benefit that much from the compilation optimizations, but benefit a lot from using the DMA engine for the copy.

It's the opposite for Unigine heaven 4.0 (which seems particularly indifferent to the cpu speed): I get a score of 203 with nothing, 206 with hyperz. 206 with hyperz+dma, and 216 with hyperz+dma+si_scheduler.
We are not far from the score of Catalyst: 231 ! Perhaps with the PTE patches which are not in the kernel I use, we could beat it ?

Also the dolphin emulator has a lot of stuttering with radeonsi, but when I use the compilation optimizations, most of these are gone.

So overall I'm very happy. With all the recent improvements, my laptop on Linux should be finally be up to par with on Windows.

You can find the compilation optimizations here: http://cgit.freedesktop.org/~tstellar/llvm/?h=si-scheduler-v2