Thursday, 30 July 2015

A new SI scheduler

This was quite some time I wanted to participate to the llvm project by improving the SI backend (the compiler used for the shaders of GCN cards for the Mesa project). In fact it has been for more than a year that I have been looking at it seriously.

During the last two years, there were several tries from Tom Stellard to improve the backend by enabling the llvm scheduler. At first there were bugs, because the backend wasn't ready, but then finally his work was merged and the performance has been greatly improved (llvm >= 3.7).

However if you look at the compiler output and read the hardware documentations, you have reasons to say the output is far from optimal. There are several reasons to that.

Time for some explanations about what is going on.

How does the hardware work ?

The different open documentations will describe you in details how it works, but here a summary of my understanding.

. The hardware is divided in Compute Units (CU). Each having a Scalar Unit (SU) and four Vector Units (VU). A VU can operates on 16 elements at a time (per cycle), while a SU can operate on one a time (one/cycle). In addition the SU handles some other operations like branching.

Every VU has some private memory where it stores its registers. There are two types of registers: scalar registers (s0, s1, etc), which represent one 4 bytes number each, and vector registers (v0, v1, etc), which represent a set of 64 4bytes numbers each. When you operate on the vector registers, every operation is done in parallel on the 64 numbers. Every time you do some work with them, you actually work with 64 inputs. For example you work on 64 different pixels at a time (for each of them the inputs are slightly different, and thus you get slightly different color at the end).

Every VU has room for 512 scalar registers and 256 vector registers.

However shaders don't need that much registers ! On the other hand, shaders will sometimes load memory that is far away in memory and that will take some time to load.
To solve that, AMD attributes several wavefronts to each VU (a wavefront = the work for 64 different inputs). The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some memory result, it decides to work on another one.
The less the shader uses registers, the more you can attribute wavefronts to the VU. Be aware more wavefront doesn't mean things go several times faster. It just mean in the end you'll have hidden latencies better, waited less, and thus executed faster.
Wavefronts are attributed per VU. VU do not exchange wavefronts. At max 10 wavefronts can be attributed per VU (thus 40 per CU). if you look on internet for images from CodeXL or other AMD tools, you'll find tables with the relationship between number of SGPRs (s registers) and VGPRs (v registers) to the number of wavefronts, but basically for SGPRS it is min(104, 512/numwavefronts) and VGPRS 256/numwavefronts. I don't know for the 104, I guess it is hw limitation.

You may have noticed there is 64 elements per v register, whereas a VU can execute work for only 16 elements. In fact VU operations (except for some of them which will take longer) will execute in 4 cycles.
Every cycle the CU takes the next VU, execute work associated to its wavefronts: it uses the SU in one cycle if the operation needs it, it can begin to load memory, ask the VU to execute some operations, etc. The documentation suggests that the CU could pick from different wavefronts to execute one on the SU (if it only needs the SU), one other on the VU (the wavefront would need only the VU), load memory for one other wavefront, etc.

Given that architecture, it is equivalent (and simpler) to think as if the CU had only one SU and one VU, and that VU operations would take one cycle, just like the SU.

The AMD docs refer to the hw system choosing for the VU which wavefronts to execute as a scheduler. This is to differentiate the scheduler which we'll take later.

So what is the job of the compiler in all that ?
There is three types of memory accesses with different latencies. We will focus and simplify by subdivising into two cases: SGPR load (s register) and VGPR load (v register). SGPR loads typically are for shader constants filled by the program at the last moment (after compilation), memory addresses for textures, etc. These are constants needed for all the 64 elements of the wavefront we work on. Some types of constants are in a special fast access cache, moreover the isn't much to load: a scalar register is only 4 bytes ! Thus we will consider these loads as low latency loads. The second type we will consider is VGPR load. The location of the memory accessed is harder to predict, and the buffers accessed are bigger. Thus we can hit cache that is far away, etc. Moreover there is a lot of data to load: 64 x 4 bytes ! We will consider these loads as high latency loads.

When the load instruction is executed, the hardware doesn't directly wait for the result. We have instructions to wait for the result. Until we have waited for a result, we are not allowed to read the registers which have to contain the results. For SGPR loads, the instruction s_waitcnt lgkmcnt(0) will wait for all previous SGPR loads (for the wavefront) to have executed. For VGPR loads, s_waitcnt vmcnt(n) will wait for all except the n last VGPR loads (in order) to have executed.

To have good performance, we want to hide latencies as much as possible, thus we either want that when we have s_waitcnt instructions, the result is already there, thus the wait does nothing, or that we have sufficiently enough wavefronts attributed to the VU (thus few registers used) to always have one wavefront ready to execute. Both case need we schedule load instructions sufficiently far away from their users.

However since we have two very different types of loads: low latency, instruction to wait all previous loads - high latency, instruction more refined to wait for only a few previous loads, we'd like to handle the two cases differently.

For low latency loads, the ideal pattern would be to load some constants we'd need in the near future, then do a few unrelated instructions, then wait for the results, execute other loads we'd need later, then execute the instructions that needed the first loads. Of course we'd also want to optimize that such that when constants are neighbors in memory, we load them at the same time (there are instructions to load 2, 4 or 8 neighbors).

For high latency loads, we want to place the load and the instructions needing them such that there is the highest number of instructions between the moment you load the data, and the moment you need it.

Interesting to notice is that for high latency loads, you usually need an address, which needs a low latency load.

A way to solve the problem is to place all loads at the beginning, however it is very bad for register usage, since low latency loads will increase the number of s registers used, and high latency loads will increase the number of v registers used. We want to avoid at all costs register spilling, which happen when you run out of registers, and we want to reduce register usage to enable the hardware to attribute several wavefronts to the VU to better hide latency.

Imagine you have a load l1, an instruction i1, and an instruction i2 needing the result of l1.
If you have 10 wavefronts attributed to the VU (max 24 s registers, and 48 v registers), then between the moment you ask for the memory location, and the moment you execute i2, you can execute 10 i1 operations (one per wavefront). Thus you will wait less for the memory result to have arrived. That's why we want to use as few registers as possible.

In a given wavefront, the instructions are executed in-order.
Thus the goal of the scheduler (compiler step to determine instruction order) is to hide the best it can the latencies.

We have:
. Latencies are hidden differently depending on register usage
. Two very different types of latencies (with special and different wait-for-the-result behaviours)

Both of these cannot be handled well with the llvm default scheduler.
To simplify a lot, the llvm default scheduler starts from the top, and maintains a list of instructions it can schedule given what has been already scheduled, and picks one instruction from the list. It chooses the instruction with some heuristics, and will try reduce register usage if it uses too many of them. Else it tries to reduce latencies (the backend tells what is the latency of the instructions).

Likely this scheduler has good behavior for cpu, but not really for gpus.

What I have seen bad with the scheduler is:
. It cannot foresee precisely whether the instruction it schedules will really decrease register usage. When it tries to reduce register usage, and no instruction reduces register usage, it will take the instructions that increases the less register usage. This is often not the good solution for SI. and will generate very bad results with a lot of registers used.
. Latencies just do not fit at all the 'this instruction is going to take XXX cycles' model. This is absolutely not compatible with the low latency instructions, with their 'wait for all instructions to complete', and high latency instructions are just going to need too many latencies, and the scheduler will not be clever about how to hide them.

That's why I decided to start from something entirely different for a new SI scheduler.

My ideas are the following:
. A lot of instructions can be seen as having a common goal of computing a few results from some inputs, but needing quite some intermediate results. For example sampling a texture (which is a high latency instruction) needs a texture address and a location, thus we could group together "load the address - compute the location - sample the texture". That would be better for register usage.
. Sometimes there aren't that many intermediate results, but it would be better to schedule first all instructions needing some intermediate results, then all other instructions needing some other intermediate results. Similarly that would be better for register usage.
. If we work by blocks of instructions, it can be easier and faster to try different orders to hide latencies.

Thus I decided to schedule blocks of instructions:
First I must build blocks of instructions, such that these blocks would make sense and reduce register usage. I decided to put all high latency instructions alone, and add to their block the instructions that only them need (load address, compute some results, etc). For the other blocks, I decided to group instructions depending on the high latency instructions they depend on. Then I add to the blocks the instructions that are alone, and on which the instructions on the blocks rely on (constant loading for example)
Second I must schedule instructions inside the blocks. I schedule low latency instructions first, then instructions not needing the result of the low latencies, then the other instructions. I schedule by trying to keep register usage as low as possible, but in practice inside a block any order gives about the same register usage.
Third I chooses the order of the blocks. There I schedule with my first priority being hiding the high latencies.

Fourth I do an optional step which is I try some minor order changes to try to optimize the score '(mean number of instructions hiding high latencies + min number of instructions hiding high latencies) * number of wavefronts'.

Fifth, I move the low latencies outside of their block, so that we load the memory just after the moment we wait for the low latencies of the previous block.

An optional sixth step which I haven't done is that we could further optimize low latencies by moving even more some of them such that we load consecutive locations at the same time.

In other words we had three parameters in our problem: high latencies, register usage, low latencies. My scheduling is done such that we only care about one at a time (we don't need to care about register usage since the block thing helps having it low - low latencies are handled inside a block - high latencies are handled by choosing the block order), and then I try to optimize more globally the relation between high latencies and register usage, and I optimize again separately the low latencies.

Some issues to my algorithm:
. register usage is sensitive to the algorithm to create the blocks. I haven't found the best way yet.
. The relationship between high latencies and register usage isn't perfectly handled. I'm trying different algorithm to choose the order of the blocks, and then I run the optimization step on the results. We'll see the best algorithm on different programs.
. It isn't as simple as the default scheduler.

However it is a significant improvement over the default scheduler.

I haven't entirely finalized the scheduler. I want to implement some more variants for the block order phase, and I need to implement a few things to make the block order optimization phase faster. I also need to improve the block creation phase, because some shaders have big blocks, and can have too many constants to load, and thus the current rule makes it use too many scalar registers.

Also interesting to notice is that llvm has a second scheduler (not used currently by the SI backend) which runs after register allocation, and which focuses even more on hiding latencies than the first scheduler. We do not need it at all with my scheduler, since the scheduler is already entirely aware of register usage.

More to come...

Sunday, 1 June 2014

Getting on Linux my dedicated Amd hd7730M up to par with on Windows

Motivations


When I bought my laptop: a laptop with an Intel hd4000 and a dedicated card Amd hd7730m, the simplest way to have a good battery life, while being able to use the dedicated card sometimes, was to install Catalyst.

While on Windows, the two cards mix well together: you have a tool to configure the card that should be used for recently launched applications, and the dedicated card shutdown itself when it is not used, it was not the case for Linux. With Catalyst you had the choice between hd4000 only or hd7730m only. If you changed the parameter you had to restart X to have the change effective.

The main issues with that were that Catalyst required specifics parameters for the intel DDX (no sna), Catalyst is difficult to install without problems, and added 30s more to my startup time. Also when using the Amd card, tearings were everywhere and destroyed the experience.

The interpretation I have of these issues is that Catalyst probably uses some hacks to get access to the screen framebuffer from the intel DDX, and copies directly what needs to be updated to it (not caring of the screen refresh, thus causing the tearings).

Not very happy with this situation, when I heard that the open source driver were getting the ability to fire up or shutdown the dedicated card when needed, and that DPM was working, I was optimistic that I could use the open source driver to have a better experience.

Help the open source driver


Last year, I had some free time to spend, and decided to help open source to get what I wanted: Currently with DRI2 DRI_PRIME support you can use the dedicated card for a specific application (and after that when the application closes, the dedicated card will shutdown), but you still get tearings.
That's because the way it's implemented, basically it's the same than for Catalyst: copying to the screen framebuffer without caring of the screen refresh.

To get rid of the tearings, I thought at that time, that the best solution was to implement DRI_PRIME for Wayland, since Wayland compositors are supposed to have no tearings.
It took several months before understanding correctly what was needed to be done and how.
Finally I had done some patches and have even done a talk at Fosdem ! Unfortunately these patches didn't have that much success for review, the main reason I think is the lack of testers and the need to be reviewed by Kristian Hoesberg to get merged into that part of Mesa, and he is very occupied by other stuffs (and I think he has to keep a priority for Intel employees, and that also the fact that Intel doesn't need DRI_PRIME is a factor. But that's only suppositions).

I then decided to see what could be done for GLX DRI3. Afterall I worked on XWayland in the meantime, and understood the technology behind DRI3.
When developping the patches (both for Wayland EGL and GLX DRI3), I made the assumptions we would have dma-buf fences to help synchronizing the two cards read and writes.
Thus I choosed a different way than DRI2: instead of always writing to the same buffer, we would write to different buffers (like is done for DRI3 and Wayland by default).

The problem I thought of was if the buffer is cleared, then we write to it, and in the meanwhile the other card reads the buffer before it has been finished written. With dma-buf fences, the second card would wait. I thought in the meantime we would use a hack, like glFinish (ie wait the rendering has been done) before sharing the buffer, but that decreases performance. Alternatively, we could have used a nested Weston compositor to add some latency.

Hopefully this is not a big problem, because you don't clear the buffer! You write on the previous content of it. Imagine you use 3 buffers. You render on one, while the other ones may or may be not used by the server (X or Wayland compositor). When you 'send' (actually it's just saying to the server you have rendered) the buffer, the server 'releases' an older one (actually it just tells you it has finished with it). And you reuse this 'old' buffer. So when you write on it, there's still the old content, which is 1-3 frames old. So if never the server card reads before you finished writing, you see old content on the image: not as bad as black content!
In practice everything is ok for most applications, but for some the old content may be a problem. We'll have to wait dma-buf fences to have it 100% ok.

Now another problem: The integrated card can't read buffers in the dedicated card space, and it doesn't understand its 'tiling mode' (pixels are ordered in a special way for better performance), so you have to make the content available in RAM in linear mode. In practice you have a buffer in VRAM with tiling and a buffer in Ram with no tiling, and you ask the card to copy the first buffer to the second one (the card is clever and wait the rendering of the first buffer is finished before doing the copy). Between VRAM and RAM you have a pci-express channel.

The not that clever way to do the copy is to fill the machine compute units with the pixels, and ask it to execute the identity operation to make it fill the other buffer. Unfortunately it's what is done on my card on the open source driver. For nouveau it's using the 2D engine or the DMA engine, which does it faster. Thus I added to Mesa some bits to make it use the DMA engine of my radeonsi card (DMA = direct memory addressing: it's manipulating the memory without the help of the cpu). My benchmarks shows that there is a performance boost for cpu limited apps! With it, I get better performance than under DRI2 DRI_PRIME.

Also when writing the DRI_PRIME patches, I wanted to add new abilities: Be able, like on Windows, to tell the driver to use the radeonsi card for a specific app, without having to use the command line DRI_PRIME env var. And also not having to have to use a command line tool (xrandr) at every reboot to parameter it.
I advise you to look at http://lists.freedesktop.org/archives/mesa-dev/2014-May/060131.html for more details.

In the meantime, I also figured out that if you run a composited environment, you shouldn't have tearings with DRI2 DRI_PRIME. So that is not an added value of my patches. However there is a lot of apps that show black content with DRI2 DRI_PRIME: this is entirely solved with my patches!

Performance


Now what's remaining to have something as good as on Windows ? Well better performance !

I have looked at what is missing to radeonsi. And it looks like parts of it is compilation optimizations, and better hyperz (some features are unsupported). I had the chance to test some new compilation optimizations coming to radeonsi. Here are some benchmarks:
"hyperz" refers to R600_DEBUG=hyperz.
"dma" refers to using the DMA engine for the copy buffer in VRAM -> buffer in RAM
"si_scheduler" refers to the compilations optimizations I'm refering to.







These apps don't benefit that much from the compilation optimizations, but benefit a lot from using the DMA engine for the copy.

It's the opposite for Unigine heaven 4.0 (which seems particularly indifferent to the cpu speed): I get a score of 203 with nothing, 206 with hyperz. 206 with hyperz+dma, and 216 with hyperz+dma+si_scheduler.
We are not far from the score of Catalyst: 231 ! Perhaps with the PTE patches which are not in the kernel I use, we could beat it ?

Also the dolphin emulator has a lot of stuttering with radeonsi, but when I use the compilation optimizations, most of these are gone.

So overall I'm very happy. With all the recent improvements, my laptop on Linux should be finally be up to par with on Windows.

You can find the compilation optimizations here: http://cgit.freedesktop.org/~tstellar/llvm/?h=si-scheduler-v2

Sunday, 20 April 2014

Correct some Wayland misconceptions

Wayland seems to have a lot of hype, and among the comments, I see some misconceptions about Wayland. This blog post has for purpose to correct some misconceptions and help match better the expectations with what Wayland has to offer.

Some common Misconceptions


  • X applications and Games will be faster under Wayland

A fast explanation of how XWayland works:

The Wayland compositor handles inputs (mouse, etc) and outputs (display, etc). It redirects X apps inputs to XWayland, which then tells the client the inputs. When a client has to draw an output, it does as it would do under plain X, and XWayland shares the result with the Wayland compositor.

Thus there is no way that optimised XWayland can beat optimized plain X on performance. However due to compositing, you can have something tear free.

To do efficient compositing, you need a buffer_age extension to be supported. It indicates to the compositor when the buffer it'll render to was rendered for the last time, which helps the compositor to determine the part of the screen which it doesn't need to repaint, since it wasn't updated. Under X DRI2, this extension isn't supported, thus either the compositor has to do fullscreen repaint, or it guesses which parts must be redrawn. None is optimal.

X DRI3+PRESENT should solve this. For Wayland, the buffer_age extension is already supported.

As for fullscreen games, for which you likely want compositing to be bypassed, the cpu overhead of using XWayland+Wayland compared to just X (similar for X vs native Wayland game) should have negligible impact on your fps.

  • Boot time will be faster

Try to launch X alone in a tty. You'll find that it starts pretty fast (~1 second for me). The boot time difference with the move to Wayland should be negligible. (Also you'll have to launch XWayland if you use X apps. XWayland launches faster however.)


You are wondering now what will bring you Wayland then?
Lets investigate this:

What you could expect


  • Better security

I'm not a security expert, thus I won't explain much the topic. Inputs are more secure, and due to the wide use of dma-buf sharing, your outputs should be secure.

  • Better Battery life

This point will highly depend on the compositor. With Wayland, the compositor has a lot of power in its hands. Most Wayland applications listen to the frame callback, which is a Wayland event to indicate the client it's a good time to redraw if it wants to update something. The compositor could use that to reduce consumption to send this event less often in clever situations. It could also reduce the cost of compositing with some tricks, etc. I don't expect very optimized behaviors to appear soon, but in long term I expect a Wayland environment to have better Battery life than an X environment.

  • Performance corner cases better handled

Some operations are known to be faster on CPU, some others faster on GPU.

X has a rendering API which abstracts what you are rendering on. Some of these rendering operations are faster on a GPU, while some are better on CPU (and it highly depends on how the DDX was written). This can lead to strange situations.

Lets take OpenTTD, one of my favorite games. It renders the whole game with the cpu. Then it lets SDL (1.2) put the result on the screen. It indicates which parts of the screen were updated.

If your DDX is using the GPU to render, you would be much better if you upload the whole buffer, instead of each small updated parts. however if it uses the CPU, you want only the small updated parts to be uploaded.

SDL2 has some detection code to detect on what you are, and upload the whole part on some systems for which it would be better. SDL 1.2 doesn't make that difference. SDL 1.2 will use XShmPutImage on each damaged part to upload the result.

 An XShmPutImage operation consist in: create a pixmap. upload the result to it. Copy the pixmap into a subpart of the destination. For Glamor it is really really slow. For each XShmPutImage it converts the whole buffer into a gl texture. At next XShmPutImage, it cannot know if never this CPU buffer was updated, so it has to convert again the whole buffer into texture. That makes OpenTTD unplayable with Glamor. When using something else than Glamor, XShmPutImage is much faster. It illustrates the problem: writing an efficient DDX is very complex since you have to support operations which are not very fast for your device.

The same situation on Wayland is simpler: you differentiate buffers on GPU (you can use EGL to get/send them, or vaapi will create one, etc) and buffers on CPU (shm buffers). They are abstracted into a wl_buffer, which is a buffer that the compositor is aware of. The API handles wl_buffers.

Since you have rendered everything on CPU, you use a shm buffer to send the result to wayland (and you indicate to Wayland the updated parts). Whatever the Wayland compositor uses (CPU or GPU), it'll choose the best way to get the result on screen.

Also on Wayland you have a subsurface mechanism, which enables you to mix different wl_buffers for your app: you can have video content accelerated with GL, while having the GUI rendered with CPU.

  • Fewer glitches

Have you ever experienced problems with some applications causing problems or glitches to the whole desktop ? Like for example an application setting itself fullscreen at a low resolution, then crashing ? Hopefully this should go away with Wayland, since the applications do not control resolution, and just ask kindly to the compositor that they are better with a specific resolution. If the client dies, the compositor can go back to the previous resolution automatically. This is one among other examples.

  • Better user experience

This depends a lot on what DE do, but potentially user experience can be improved with more user control, better visuals, etc.


There are many other topics that could be discussed.
Similarly, I do not detain the truth and do not see the future.
I hope this discussion has helped you understand better what Wayland can improve for you, and wish the best success to Wayland.