The Missing Guide to Modern Graphics APIs – 2. PSOs

Today, we’ll be looking at a fairly simple, but fundamental concept in modern APIs, and using it to springboard onto talking about some different GPU architectures, and that is, the “PSO”, or “Pipeline State Object”. Motivating this necessity is one of the main design considerations of modern graphics APIs, and that is, predictable performance. Basically, the goal is that for every function call in the API, you can reliably group it into one of two categories: either it returns pretty instantly, or it can take a long time to complete.

Previous APIs, like Direct3D 11 and OpenGL, made no such guarantees. Every call had the potential to stall, and often times, unless you had experience on console platforms or were blessed with insider knowledge mostly limited to triple-A game developers, you simply had no idea when these would happen. And even then, you could be in for a surprise: the stalls might change GPU vendors, graphics driver versions, and even based on your game itself. I vividly remember having the experience of debugging a gnarly graphics bug that magically disappeared when I renamed my executable from “ShooterGame.exe” to something else. As a game developer myself, I want my user experience to be as fluid as possible, only having these stalls when the user is at a loading screen, but often times it would be difficult to find a reliable set of API calls to guarantee the expensive stalls happen then. And if you’ve ever played a video game, you’ve certainly had the experience of having the first 10 seconds of wandering around a map be a very rough experience, until everything is loaded in and smooths out again.

While the old APIs and driver vendors are not fully to blame for this problem, the lack of guidance, and amount of guesswork means it’s incredibly difficult for a graphics programmer to get it right, especially when trying to ship a single executable that runs smoothly across three different GPU vendors, each with their own independent architectures and drivers.

So, what changed? If you are familiar with OpenGL, and have rendered transparent objects before, you have likely written code like the following:

// Bind the shader program.

// Configure the vertex data.
GLint posLocation = glGetAttribLocation(myShaderProgram, "Position");
glVertexAttribPointer(posLocation, 3, GL_FLOAT, sizeof(struct vertex), offsetof(struct vertex, pos));
GLint clrLocation = glGetAttribLocation(myShaderProgram, "Color");
glVertexAttribPointer(clrLocation, 4, GL_FLOAT, sizeof(struct vertex), offsetof(struct vertex, color));

// First, render opaque parts of the object.
glDisable(GL_BLEND); // Turn off any latent blend state.
glDepthMask(true); // We want to write to the depth buffer.
glDrawElements(GL_TRIANGLES, opaqueTriangleCount, 0);

// Now, render the transparent parts of the object.
glDrawElements(GL_TRIANGLES, transparentTriangleCount, transparentTriangleOffset);

It’s important to note that while GPUs have slowly started to become more programmable, large portions of the pipeline appear to be fixed-function, including things like the how vertex data gets unpacked into the shader attributes, the blending modes, or whether to write to the depth buffer. I say “appear”, because different GPUs have different implementations, and some of them might be programmable. For instance, on most iOS devices, all blending is compiled into the shader. Enabling and disabling the GL_BLEND flag might require the driver to take your shader and recompile it with some bits at the end to blend against the framebuffer. Though this can’t be done immediately inside the glEnable call; the glBlendFunc after affects the shader math using to blend values against the framebuffer, so most drivers will defer this work all the way until the glDrawElements.

One might be tempted to just include blending in the shader going forward, but remember that this is specifically an iOS thing; not all devices necessarily do full blending in the shader, so this isn’t universally possible to move in. The difficulty comes not from the fact that some pieces of state changes are expensive, but that since different GPU vendors have different GPU architectures, it’s not possible to portably know what’s expensive and what’s not. In fact, while I believe Apple used to publish an official guide about what would be an expensive state change in their OpenGL ES driver, they have since taken it down, and I am not aware of any authoritative source for this information now, despite being semi-widely known. However, we can infer that it is an expensive given how they have crafted the equivalent Metal APIs, as we’ll see later.

As another example, on some Mali GPUs, the vertex attributes are fetched entirely within the shader itself. Adjusting the number of vertex inputs might cause a shader to recompile. However, this only applies to older Mali models, on the newest, Bifrost-based architectures, there exists a native hardware block for vertex attribute fetching, and the driver doesn’t need to recompile shaders for different vertex attribute layouts, instead, it can push a few commands to the GPU to tell it to reconfigure its fixed-function hardware. Basically, almost any state might or might not affect whether the driver needs to recompile a shader, and there’s effectively no way to know up-front whether a given glDrawElements call might stall.

But not all is lost. The hope is that once you’ve drawn something once with a certain set of states, future calls will be quicker. This is why most best practice tell you to “pre-warm” your shaders and states by submitting some draw calls. If you are lucky, the Developer Tools or Driver’s Debug Output might inform you of such recompilations happening, but it’s still finicky and often difficult to track down.

Direct3D 11 appears to fare a tiny bit better, by offering blocks of state to be created and cached. In theory, creating one of these objects could be where the expensive work takes place, however the expense tends to come from the combinatorics of the shaders and state blocks together. Since these state blocks could still be bound at any time before a draw call, it was only marginally better in practice.

What do modern APIs offer in contrast? Instead of the mix-and-match states of previous APIs, the modern graphics APIs offer “Pipeline State Objects”, or “PSOs”, which contain all possible states that might be expensive to change at runtime, or any states that a compiled optimizer might be able to use as influences. Calling the “CreatePipelineStateObject” function is expected to be slow, but once that’s done, all the work is done. No more driver optimizations, no background threads*, no draw-time work. The pipeline state is usable once “CreatePipelineStateObject” returns, and it doesn’t need to be touched again.

You can see such structures with varying amounts of functionality in Vulkan’s VkGraphicsPipelineCreateInfo, D3D12’s
D3D12_GRAPHICS_PIPELINE_STATE_DESC, and Metal’s MTLRenderPipelineDescriptor. Metal has a fairly conservative view of pipeline state, with only a handful of fields that suggests a truly expensive recompilation stage. Since Apple pretty tightly controls their hardware, and the sets of GPUs that can run Metal, it can have much more direct control over the drivers and the APIs they expose to developers. Direct3D 12 is somewhere in the middle, with a moderate number of fields.

Vulkan, owing to its ambitious vision for deployment across mobile and desktop, has the largest amount of state. Now, some of it can be dynamic, by passing certain VkDynamicState flags, but given that it’s an option, it’s perhaps assumed that it might be more expensive than baking it into the pipeline, on some platforms. The downside to this approach is that since it’s no longer possible to mix and match state blocks, and pipelines must be created up-front, more pipelines must be created by applications, and they might take longer to compile. In practice, the hope is that if a driver does not need to compile something into the shader, it will only cache the parts that are expensive, and just record a few commands to reconfigure the hardware for the rest. But you, the application developer, need to assume that every PSO creation might be expensive. Even if it might not cause draw-time stalls, the driver might like to use this opportunity to apply extra optimizations now, given that it has all the state up-front, and can more likely do expensive things now, rather than at draw-call time, where any compilation must happen quickly.

To help alleviate and save on pipeline creation costs, modern APIs allow serializing created pipelines to binary blobs that can be saved to disk, along with loading them back in. This means that once a pipeline is created once, the application can save it off and reuse it in the future. In theory, anyway. In practice, the details can be messy.

Additionally, if one wants to just eat the cost, and do all of the PSO creation at draw time, that’s possible too — just write your application code so that you call the function to create your PSO right before the draw code. Don’t feel too bad about it, either. A lot of real-world games do work this way, or they might do a hybrid approach where a certain number of “core” PSOs are compiled at load time, with the rest lazily created. Or maybe you skip the draw if the PSO isn’t finished at draw time, but send it to compile on a background thread (oh yeah! Since all of this state is self-contained and passed in directly to the API, instead of hidden behind global state, it’s a lot lot easier to multi-thread!). The important thing is not that it’s always fast, but that stalls are predictable and understandable — the engine is guaranteed to know when each stall will happen.

One thing that might seem well-intentioned, but is ultimately pointless, are Vulkan’s Pipeline Derivatives. The goal is to make certain aspects of pipeline creation cheaper by suggesting that pipelines can derivate from a “template”, and change only a few state bits. However, since the application is unable to know what can be created cheaply without platform-specific knowledge, it is unable to know what state to go in the template, and what should go in the derivative. To give another motivating example, say I have a series of pipelines, with four different blend modes, and four different vertex input states. Should I create the templates with the same blend mode and change the vertex input states in my derivated pipelines? Or should I create the templates with the same vertex input states and change the blend modes? Knowing what we know now about iPhone and Mali GPUs, they both have different concepts of “expensive”, and to answer the question, needs to know what kind of hardware we’re running on, which really negates the whole point. Rather than pipeline derivatives, the better option is for the driver to cache things that can be cached independently. On devices where blend modes can change cheaply, if I ask the driver to compile two PSOs that only differ in their blend mode, it should be able to notice that the shaders have not changed, and share 99% of the work.

Also, to address that asterisk from earlier, while the intention is “no more background threads”, some graphics vendors might want to do additional optimization based on how the PSO is used at runtime… shaders can be complicated and featureful enough that there might still be enough “dead code” weighing the whole thing down. For instance, a driver might realize that a texture you’re loading is, in most cases, a solid color, usually white or black, and will build a different version of the same shader that skips the texture fetch, but only if the solid color texture is bound. While Direct3D 12 originally started with a desire of “no driver background threads”, it now readily admits that this is the case in practice. Thankfully, it supplies an API to try to curtail such background threads, at least during debugging.

Anyway, that’s about it for PSOs. Next time, we’ll finally be back to discussing GPU architecture, with a look at “render passes”, along with discussing a popular trend in the mobile space, the “tiler”.