Thursday, 30 July 2015

A new SI scheduler

This was quite some time I wanted to participate to the llvm project by improving the SI backend (the compiler used for the shaders of GCN cards for the Mesa project). In fact it has been for more than a year that I have been looking at it seriously.

During the last two years, there were several tries from Tom Stellard to improve the backend by enabling the llvm scheduler. At first there were bugs, because the backend wasn't ready, but then finally his work was merged and the performance has been greatly improved (llvm >= 3.7).

However if you look at the compiler output and read the hardware documentations, you have reasons to say the output is far from optimal. There are several reasons to that.

Time for some explanations about what is going on.

How does the hardware work ?

The different open documentations will describe you in details how it works, but here a summary of my understanding.

. The hardware is divided in Compute Units (CU). Each having a Scalar Unit (SU) and four Vector Units (VU). A VU can operates on 16 elements at a time (per cycle), while a SU can operate on one a time (one/cycle). In addition the SU handles some other operations like branching.

Every VU has some private memory where it stores its registers. There are two types of registers: scalar registers (s0, s1, etc), which represent one 4 bytes number each, and vector registers (v0, v1, etc), which represent a set of 64 4bytes numbers each. When you operate on the vector registers, every operation is done in parallel on the 64 numbers. Every time you do some work with them, you actually work with 64 inputs. For example you work on 64 different pixels at a time (for each of them the inputs are slightly different, and thus you get slightly different color at the end).

Every VU has room for 512 scalar registers and 256 vector registers.

However shaders don't need that much registers ! On the other hand, shaders will sometimes load memory that is far away in memory and that will take some time to load.
To solve that, AMD attributes several wavefronts to each VU (a wavefront = the work for 64 different inputs). The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some memory result, it decides to work on another one.
The less the shader uses registers, the more you can attribute wavefronts to the VU. Be aware more wavefront doesn't mean things go several times faster. It just mean in the end you'll have hidden latencies better, waited less, and thus executed faster.
Wavefronts are attributed per VU. VU do not exchange wavefronts. At max 10 wavefronts can be attributed per VU (thus 40 per CU). if you look on internet for images from CodeXL or other AMD tools, you'll find tables with the relationship between number of SGPRs (s registers) and VGPRs (v registers) to the number of wavefronts, but basically for SGPRS it is min(104, 512/numwavefronts) and VGPRS 256/numwavefronts. I don't know for the 104, I guess it is hw limitation.

You may have noticed there is 64 elements per v register, whereas a VU can execute work for only 16 elements. In fact VU operations (except for some of them which will take longer) will execute in 4 cycles.
Every cycle the CU takes the next VU, execute work associated to its wavefronts: it uses the SU in one cycle if the operation needs it, it can begin to load memory, ask the VU to execute some operations, etc. The documentation suggests that the CU could pick from different wavefronts to execute one on the SU (if it only needs the SU), one other on the VU (the wavefront would need only the VU), load memory for one other wavefront, etc.

Given that architecture, it is equivalent (and simpler) to think as if the CU had only one SU and one VU, and that VU operations would take one cycle, just like the SU.

The AMD docs refer to the hw system choosing for the VU which wavefronts to execute as a scheduler. This is to differentiate the scheduler which we'll take later.

So what is the job of the compiler in all that ?
There is three types of memory accesses with different latencies. We will focus and simplify by subdivising into two cases: SGPR load (s register) and VGPR load (v register). SGPR loads typically are for shader constants filled by the program at the last moment (after compilation), memory addresses for textures, etc. These are constants needed for all the 64 elements of the wavefront we work on. Some types of constants are in a special fast access cache, moreover the isn't much to load: a scalar register is only 4 bytes ! Thus we will consider these loads as low latency loads. The second type we will consider is VGPR load. The location of the memory accessed is harder to predict, and the buffers accessed are bigger. Thus we can hit cache that is far away, etc. Moreover there is a lot of data to load: 64 x 4 bytes ! We will consider these loads as high latency loads.

When the load instruction is executed, the hardware doesn't directly wait for the result. We have instructions to wait for the result. Until we have waited for a result, we are not allowed to read the registers which have to contain the results. For SGPR loads, the instruction s_waitcnt lgkmcnt(0) will wait for all previous SGPR loads (for the wavefront) to have executed. For VGPR loads, s_waitcnt vmcnt(n) will wait for all except the n last VGPR loads (in order) to have executed.

To have good performance, we want to hide latencies as much as possible, thus we either want that when we have s_waitcnt instructions, the result is already there, thus the wait does nothing, or that we have sufficiently enough wavefronts attributed to the VU (thus few registers used) to always have one wavefront ready to execute. Both case need we schedule load instructions sufficiently far away from their users.

However since we have two very different types of loads: low latency, instruction to wait all previous loads - high latency, instruction more refined to wait for only a few previous loads, we'd like to handle the two cases differently.

For low latency loads, the ideal pattern would be to load some constants we'd need in the near future, then do a few unrelated instructions, then wait for the results, execute other loads we'd need later, then execute the instructions that needed the first loads. Of course we'd also want to optimize that such that when constants are neighbors in memory, we load them at the same time (there are instructions to load 2, 4 or 8 neighbors).

For high latency loads, we want to place the load and the instructions needing them such that there is the highest number of instructions between the moment you load the data, and the moment you need it.

Interesting to notice is that for high latency loads, you usually need an address, which needs a low latency load.

A way to solve the problem is to place all loads at the beginning, however it is very bad for register usage, since low latency loads will increase the number of s registers used, and high latency loads will increase the number of v registers used. We want to avoid at all costs register spilling, which happen when you run out of registers, and we want to reduce register usage to enable the hardware to attribute several wavefronts to the VU to better hide latency.

Imagine you have a load l1, an instruction i1, and an instruction i2 needing the result of l1.
If you have 10 wavefronts attributed to the VU (max 24 s registers, and 48 v registers), then between the moment you ask for the memory location, and the moment you execute i2, you can execute 10 i1 operations (one per wavefront). Thus you will wait less for the memory result to have arrived. That's why we want to use as few registers as possible.

In a given wavefront, the instructions are executed in-order.
Thus the goal of the scheduler (compiler step to determine instruction order) is to hide the best it can the latencies.

We have:
. Latencies are hidden differently depending on register usage
. Two very different types of latencies (with special and different wait-for-the-result behaviours)

Both of these cannot be handled well with the llvm default scheduler.
To simplify a lot, the llvm default scheduler starts from the top, and maintains a list of instructions it can schedule given what has been already scheduled, and picks one instruction from the list. It chooses the instruction with some heuristics, and will try reduce register usage if it uses too many of them. Else it tries to reduce latencies (the backend tells what is the latency of the instructions).

Likely this scheduler has good behavior for cpu, but not really for gpus.

What I have seen bad with the scheduler is:
. It cannot foresee precisely whether the instruction it schedules will really decrease register usage. When it tries to reduce register usage, and no instruction reduces register usage, it will take the instructions that increases the less register usage. This is often not the good solution for SI. and will generate very bad results with a lot of registers used.
. Latencies just do not fit at all the 'this instruction is going to take XXX cycles' model. This is absolutely not compatible with the low latency instructions, with their 'wait for all instructions to complete', and high latency instructions are just going to need too many latencies, and the scheduler will not be clever about how to hide them.

That's why I decided to start from something entirely different for a new SI scheduler.

My ideas are the following:
. A lot of instructions can be seen as having a common goal of computing a few results from some inputs, but needing quite some intermediate results. For example sampling a texture (which is a high latency instruction) needs a texture address and a location, thus we could group together "load the address - compute the location - sample the texture". That would be better for register usage.
. Sometimes there aren't that many intermediate results, but it would be better to schedule first all instructions needing some intermediate results, then all other instructions needing some other intermediate results. Similarly that would be better for register usage.
. If we work by blocks of instructions, it can be easier and faster to try different orders to hide latencies.

Thus I decided to schedule blocks of instructions:
First I must build blocks of instructions, such that these blocks would make sense and reduce register usage. I decided to put all high latency instructions alone, and add to their block the instructions that only them need (load address, compute some results, etc). For the other blocks, I decided to group instructions depending on the high latency instructions they depend on. Then I add to the blocks the instructions that are alone, and on which the instructions on the blocks rely on (constant loading for example)
Second I must schedule instructions inside the blocks. I schedule low latency instructions first, then instructions not needing the result of the low latencies, then the other instructions. I schedule by trying to keep register usage as low as possible, but in practice inside a block any order gives about the same register usage.
Third I chooses the order of the blocks. There I schedule with my first priority being hiding the high latencies.

Fourth I do an optional step which is I try some minor order changes to try to optimize the score '(mean number of instructions hiding high latencies + min number of instructions hiding high latencies) * number of wavefronts'.

Fifth, I move the low latencies outside of their block, so that we load the memory just after the moment we wait for the low latencies of the previous block.

An optional sixth step which I haven't done is that we could further optimize low latencies by moving even more some of them such that we load consecutive locations at the same time.

In other words we had three parameters in our problem: high latencies, register usage, low latencies. My scheduling is done such that we only care about one at a time (we don't need to care about register usage since the block thing helps having it low - low latencies are handled inside a block - high latencies are handled by choosing the block order), and then I try to optimize more globally the relation between high latencies and register usage, and I optimize again separately the low latencies.

Some issues to my algorithm:
. register usage is sensitive to the algorithm to create the blocks. I haven't found the best way yet.
. The relationship between high latencies and register usage isn't perfectly handled. I'm trying different algorithm to choose the order of the blocks, and then I run the optimization step on the results. We'll see the best algorithm on different programs.
. It isn't as simple as the default scheduler.

However it is a significant improvement over the default scheduler.

I haven't entirely finalized the scheduler. I want to implement some more variants for the block order phase, and I need to implement a few things to make the block order optimization phase faster. I also need to improve the block creation phase, because some shaders have big blocks, and can have too many constants to load, and thus the current rule makes it use too many scalar registers.

Also interesting to notice is that llvm has a second scheduler (not used currently by the SI backend) which runs after register allocation, and which focuses even more on hiding latencies than the first scheduler. We do not need it at all with my scheduler, since the scheduler is already entirely aware of register usage.

More to come...

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete