The Ultimate Guide to Matrix Multiplication and Ordering

Matrix multiplication in graphics APIs is ridiculously confusing. People are often confused about the right order of multiplying their matrices, and about row-major, column-major, pre-multiplication, post-multiplication, row vectors and column vectors, and transposing.

I plan for this to be the last resource you’ll never need to check.

Why do we use matrices?

If matrices are so confusing, why do we even use them in the first place? Our goal with using matrices in graphics is using them to transform objects in space, or to transform space itself.

If we imagine a cube with some X, Y, and Z coordinates, we can translate that cube along the X axis by 5 units by adding a constant 5:

x' = x + 5

If we wanted to scale the cube by 2, we can multiply by 2:

x' = 2x

These are both very simple forms of transformations. So why go more complicated and use matrices? Well, let’s introduce rotation. Scaling and translating happen along a single axis, while rotation happens within a plane. Rotating a point around the Y axis changes the point’s values on the X and Z axis.

That means that to transform a point in a way that lets us scale, rotate, and translate it in a generic way, we need a formula that looks something like this:

x' = 1x + 7y + 9z + 1
y' = 4x + 1y + 6z + 0
z' = 3x + 8y + 2z + 3

If we assume that all of our equations will always be of the form Ax + By + Cz + D, then we can eliminate all of the “fluff” there and end up with a bag of 12 numbers:

As you might imagine, this “bag of numbers” is called a matrix.

Combining Translations

One interesting fact that isn’t at all obvious is that transforms compose differently depending on the order. Let’s go back to our simple example of scaling and translation using simple equations:

x' = 2x // Scale x by 2
x' = x + 5 // Translate x by 5

If we want to combine the two, the answer changes depending on whether we scale before translating or scale after translating. Scaling before translating means that we substitute the scaling equation into the translation one:

x' = 2x + 5 // Scaling by 2, then translating by 5

While scaling after translating means that we substitute the translation equation into the scaling one:

x' = 2(x + 5) // Translating by 5, then scaling by 2

And we can further simplify that “translating, then translating” equation by expanding terms to help see that this result is indeed very different.

x' = 2x + 10 // Translating by 5, then scaling by 2

As long as we reduce our formula to a standard polynomial expression like Ax + B, we can take those coefficients and put them directly in a matrix. In fact, doing this “substitution and simplify” will give us the exact same coefficients as if we multiplied two matrices together.

Matrix Fact #1: The standard algorithm for matrix multiplication is nothing more than a “compressed” form of writing two systems of linear equations, substituting one into the other, and simplifying all the way down.

Matrix multiplication being non-commutative is just a representation of this fact, that the order that you substitute equations into each other matters.

If it helps clear things up, you can think of matrix multiplication as something more along the lines of function application, rather than the multiplication of two scalars. If we imagine “translate by 5” as being function F and our “scale by 2” as some function G, then as we’ve just shown, F(G(x)) is not the same thing as G(F(x)).

With a full set of x, y, and z equations, you can still write it out, substitute, and simplify, however it quickly becomes tedious for even just a handful of equations:

x' = 1x + 7y + 9z + 1
y' = 4x + 1y + 6z + 0
z' = 3x + 8y + 2z + 3

x'' = 2x' + 1y' + 3z' + 10
y'' = 1x' + 9y' + 4z' + 12
z'' = 3x' + 9y' + 6z' + 3

x'' = 2(1x + 7y + 9z + 1) + 1(4x + 1y + 6z + 0) + 3(3x + 8y + 2z + 3) + 10
y'' = 1(1x + 7y + 9z + 1) + 9(4x + 1y + 6z + 0) + 4(3x + 8y + 2z + 3) + 12
z'' = 3(1x + 7y + 9z + 1) + 9(4x + 1y + 6z + 0) + 6(3x + 8y + 2z + 3) + 3

If you took this and simplified it all the way down, we’d end up with the exact same set of coefficients that we would have had we did matrix multiplication instead. But as you can imagine, as equations get more and more complex, and as we stack more and more transforms on top of each other, matrix multiplication is a lot easier than writing equations out and substituting.

If you were working with a set of transformations for an extended period of time, you would have invented matrices too!

Matrix/Matrix Multiplication

Let’s cover the easy case first, from a mathematical perspective, before we start talking about code and memory layouts. It’s well-established that Matrix/Matrix multiplication is not commutative, that is, matrix A times matrix B does not result in the same thing as matrix B times matrix A. However, Matrix/Matrix multiplication has exactly one formula, and it’s fairly easy to remember: it’s always Across times down. What do I mean by that?

Let’s multiply two 4×4 matrices together. On the left is A, and on the right is B. To get the result at any individual slot in the result, we take elements going across on the left, multiply them together with elements going down on the right, and then add those all up. In other words, each element of the result is a dot product of a row from matrix A with a column from matrix B.

Whether you’re using Vulkan or DirectX, GLSL or HLSL, row-major or column-major matrix math always respects this formula. Across times down. If there’s one thing to remember from this post, it’s that. Everything else is a corollary of that simple fact.

Matrix Fact #2: All matrices are multiplied as across times down no matter what shading language or graphics API you use.

What if the dimensions are different?

Let’s try to multiply a 2×4 matrix against a 4×3 matrix. Note that in standard mathematical lingo, a 2×4 matrix is a matrix with 2 rows and 4 columns.

Matrix Fact #3: the convention in mathematics is that an MxN matrix is made up of M rows and N columns.

Nonetheless, we can multiply a 2×4 matrix against a 4×3 matrix, resulting in a 2×3 matrix. Across times down.

However, if we try and multiply a 4×2 matrix against a 3×4 matrix, we run into an issue. We cannot do across times down without running off the end of one of the matrices.

This is an error. It is only possible to multiply matrices together if their “inner” dimensions match: when multiplying matrices A and B, the number of columns in A must match the number of rows in B.

Matrix/Vector Multiplication

Let’s stretch the limit here. What happens when we try to multiply a 4×4 matrix against a 4×1 matrix?

Across times down works just fine! Shading languages like HLSL and GLSL both use this formula when multiplying a matrix against a vector — they turn the vector on its side, becoming a “column vector”, or a matrix of 4×1, so that they can multiply together.

What happens when we multiply a vector against a matrix? You might expect to be an error, as it would be illegal to multiply a 4×1 matrix against a 4×4 matrix. However, shading languages like HLSL and GLSL do something somewhat unexpected here. When multiplying a vector times a matrix, they will arrange the same exact matrix as a 1×4 “row vector” matrix!

Note that vectors by themselves don’t have an inherent direction, it’s only by “arranging” them into 1xN or Nx1 matrices that they grow into “column vectors” or “row vectors”.

Matrix Fact #4:

  • Multiplying a matrix M against a vector v is equivalent to multiplying M against a new Nx1 matrix made up of v.
  • Multiplying a vector v against a matrix M is equivalent to multiplying a new 1xN matrix made up of v, against our matrix M.
  • An Nx1 matrix is called a “column vector” since it’s tall and skinny, while a 1xN matrix is called a “row vector” since it’s short and wide.

Multiplying a matrix by a vector or a vector by a matrix is sometimes called pre-multiplication or post-multiplication, but these are unclear phrases; sometimes pre-multiplication refers to the vector being on the left-hand side, and sometimes it refers to the matrix being on the left-hand side.

For clarity, whenever the vector is on the left-hand side of a matrix/vector multiplication, I prefer to call it “row-vector multiplication”, and whenever it’s on the right-hand side, I prefer to call it “column-vector multiplication”.

Useful Identities

We can take any matrix and transpose it. This effectively swaps rows and columns; what was a row is now a column, and vice versa.

Swapping a matrix like this is helpful for all number of reasons, but an important fact you can remember is that multiplying a matrix A against a matrix B is equivalent to transposing the two matrices, and swapping the multiplication order, and transposing the result. Note that transposing them swaps their dimensions, so the transpose of an e.g. 3×4 matrix becomes a 4×3 matrix. Combined with the across times down rule, it should be easy to verify yourself that the calculations will be the same.

Matrix Fact #5: Given matrices A and B, A * B is the same as transpose(transpose(B) * transpose(A)).

Since vectors can become either row-vectors or column-vectors based on their usage, this means that they “automatically transpose” themselves in shading languages. So effectively, A * v simplifies down to be the same as v * transpose(A), and v * A can be written as transpose(A) * v.

Row-Major and Column-Major

So far we’ve only talked about things from the mathematical perspective. Let’s talk about the computers side of things. A 3×4 matrix consists of 12 numbers, which we’ve stored in an array. How can we unpack this flat array of numbers into a matrix? We have two options: row-major matrix packing, and column-major matrix packing.

Row-major matrix packing is probably what seems the most obvious to a programmer: we unpack each number first from left to right, then from top to bottom. One way to imagine this is that we’ve divided up the 12 numbers from our array into 3 “row vectors” stacked on top of each other. This is why it’s called “row-major”.

Conversely, column-major matrix packing runs first top to bottom, then from left to right. We can visualize this packing order as being equivalent to four consecutive “column vectors” packed left to right, giving us the name “column-major”. Note that both row-major and column-major matrix packing still gives us matrices with the same shape and dimensions, it only changes how the array of data is interpreted.

Matrix Fact #6: Row-major and column-major are not properties of matrices by themselves and do not affect matrix multiplication; that is always across times down. Row-major and column-major are simply about the memory storage order of a matrix when loading from buffers and storing back to buffers.

Shading languages can control the memory packing of matrices through different means.

In HLSL, there are two different ways to affect packing order of matrices:

In GLSL, you can use the row_major and column_major layout options on structures or individual field.

In both HLSL and GLSL, column-major packing is the default, if no other overrides here are specified.

Also note that since transposing a matrix swaps the rows and columns, it’s another way to handle the packing differences; this is what the transpose parameter to the glUniformMatrixNxMv functions do; calling glUniformMatrix3x4fv with the transpose parameter set to false will treat your 12 numbers as 4 column-vectors, while with the transpose parameter set to true it will treat your 12 numbers as 3 row-vectors.

Matrix Fact #7: Transposing a matrix is effectively equivalent to changing its packing order (I’m being a bit wishy-washy here).

Shading Language Differences

While GLSL and HLSL are both similar in how they work with matrices, there’s still some things we need to cover.

The first is matrix type naming. To declare a 3×4 matrix, one with 3 rows and 4 columns:

  • In HLSL, the type is float3x4. HLSL chooses to name it as floatRxC, as is traditional.
  • In GLSL, the type is mat4x3. GLSL unfortunately chooses to name it as matCxR which clashes with existing mathematical practice.

Next up is the syntax for matrix multiplication and element-wise multiplication, sometimes called the Hadamard product. The Hadamard product only exists for two matrices of the same dimension, and is just the individual elements in each “slot” multiplied together.

  • HLSL uses mul(A, B) for standard matrix multiplication, and A * B for the Hadamard product.
  • GLSL uses A * B for standard matrix multiplication, and matrixCompMult for the Hadamard product.

When constructing matrices inside a shader, HLSL and GLSL act differently.

  • HLSL’s matrix constructor works by being supplied full row vectors at a time. For a 3×4 matrix, if you pass 12 numbers, it will first construct 3 row vectors out of each consecutive set of 4 numbers, and then stack them up on top of each other. You can also pass 3 float4 row vectors directly. This happens regardless of whatever pack_matrix pragma is set, or any compile arguments.
  • GLSL’s matrix constructor works by being supplied full column vectors at a time. For a 3×4 matrix, if you pass 12 numbers, it will first construct 4 column vectors out of each consecutive set of 3 numbers, and then stack them up on top of each other. You can also pass 4 vec3 column vectors directly. This happens regardless of whether the matrix is tagged as row_major or column_major layout.

When indexing into matrices inside a shader, HLSL and GLSL act differently.

  • In HLSL, indexing into a matrix with matrix[0][2] will return the value in the 0th row and 2nd column. matrix[3] will return the 3rd row as a vector. As above, this happens independently of any pack settings and command line arguments.
  • In GLSL, indexing into a matrix with matrix[0][2] will return the value in the 0th row and 2nd column. matrix[3] will return the 3rd column as a vector. As above, this happens independently of any layout settings on the matrix.

These three differences above, plus other more historical artifacts have led many to believe that HLSL (and DirectX) are row-major, while GLSL (and OpenGL) are column-major. While they have a preference for advanced cases of indexing and constructors, matrix multiplication works the exact same between HLSL and GLSL, and they support either mode for packing and unpacking.

Matrix Fact #8: There are some subtle differences that affect the behavior of matrix types between HLSL and GLSL, but they don’t change how multiplication works.

Space Transformations and Associativity

Our goal with matrices in computer graphics is to transform objects between different spaces. If we have a point in world space, and we would like to have a version of that point in view space, we transform the point by what’s commonly called a view matrix. I’m going to use column-vector multiplication here for example’s sake:

vec3 view_space_P = view_matrix * world_space_P;

The matrix is in charge of taking our point and transforming it into a new space. In computer graphics, we often wrangle many spaces, and want to build large transformation chains. For instance, a common viewing transformation looks something like this:

vec3 clip_space_P = projection_matrix * view_matrix * model_matrix * model_space_P;

Here, we are chaining together a set of transformations to our point P: the model-matrix transforms from model-space to world-space, the view-matrix transforms from world-space to view-space, and the projection matrix transforms from view-space to clip-space.

(And for anyone following along, yes, putting a vec3 into a projection_matrix is not what a standard transformation looks like. I simply didn’t wish to distract this discussion with homogeneous coordinates and vec4 for the purposes of this article.)

One interesting fact is that while matrix multiplication is not commutative, it is associative: that is, (A * B) * C is the same as A * (B * C). That means that we can think of the transformation in two different, identical ways:

  • We transform our vector model_space_P by the model_matrix to get a new vector in world-space, then transform that vector by the view_matrix to get a new vector in view-space, then transform that vector by the projection_matrix to get a new vector in clip-space. That is, projection_matrix * (view_matrix * (model_matrix * model_space_P))

  • We multiply the matrices together: projection_matrix * view_matrix results in a new matrix that transforms from world-space directly to clip-space, and projection_matrix * view_matrix * model_matrix results in a new matrix that transforms from model-space directly to clip-space. That is, ((projection_matrix * view_matrix) * model_matrix) * P

One interesting fact is that when multiplying matrices together, we never end up growing the size of the matrix: when multiplying two 4×4 matrices together, the end result is always a new 4×4 matrix. This means that no matter how many space transforms we end up wanting to do, whether we want to transform between 3 spaces or between 500 spaces, we can always collapse that space transformation into a single 4×4 matrix.

But note that this new matrix now depends on which order you’re intending to do the resulting multiplication! That is, a new matrix M = projection_matrix * view_matrix * model_matrix needs to be used as M * v. Your composition order needs to match your usage order.

Matrix Fact #9: Multiplying two matrices together builds a new matrix that combines the relevant space transformations. As long as you keep the order consistent, you can combine as many matrices as you want together.

Matrix Multiplication Ordering in Practice

Enough theory. Let’s talk some practice. Like above, we have a standard transformation sequence, and we would like to transform a given position vector P. We now have two choices:

  • Column-vector multiplication, where P is on the right-hand side
vec3 clip_space_P = projection_matrix * view_matrix * model_matrix * model_space_P;
  • Row-vector multiplication, where P is on the left-hand side
vec3 clip_space_P = model_space_P * model_matrix * view_matrix * projection_matrix;

Now, remember that A * v is the same as v * transpose(A), because v changes whether it is a row-vector or column-vector based on usage. This means that for these two to represent the same calculation, the matrices must be transposed between the top and bottom lines. And since swapping the packing order is roughly equivalent to transposing, that means that we can make the top and bottom lines work by changing the packing order.

This is where the other half of the confusion comes from; historical convention is that DirectX and HLSL codebases tend to use row-vector multiplication, while OpenGL and GLSL codebases tend to use column-vector multiplication. This is sometimes expressed online as “DirectX is row-major” and “OpenGL is column-major”, however, there is nothing inherent about this in the graphics APIs or shading languages, just set through years of inertia of sample code and existing codebases.

In practice, the combination of column-vector/row-vector multiplication, column-major/row-major packing, and the existence of transpose means that you can often make two mistakes that cancel each other out, or just fiddle around with flipping multiplication order and inserting transposes until things work.

However, there are some ways to make your life easier. When dealing with matrices, try to consider the space your data starts in, and the space transformations you wish to make. The general rule is that when using column-vector multiplication with the vector on the right-hand side, we wish to have a series of space-from-space matrices where the starting space is on the right-hand side, and the resulting space is on the left-hand side. Some even prefer to name their matrices this way.

With P on the right-hand side (column-vector multiplication), you want to phrase it as a chain of “A-from-B” transformations:

vec3 clip_space_P = clip_from_view_space * view_from_world_space * world_from_model_space * model_space_P;

And with P on the left-hand side (row-vector multiplication), you want to phrase it as a chain of “B-to-A” transformations:

vec3 clip_space_P = model_space_P * model_to_world_space * world_to_view_space * view_to_clip_space;

As long as you know your spaces, and pick a convention and stick to it, you should be good to go.

Inverses

OK, we’ve figured out how to transform space one way, but what if we wanted to go backwards? What if we have a point in world-space, and we want it in model-space? That’s what an inverse matrix is for. If we have a matrix that takes us from model-space to world-space, its inverse will take us from model-space back to world-space.

When using an inverse matrix, you should still multiply in the same order, but the space transformations will be backwards.

vec3 world_space_P = world_from_model_space * model_space_P;
vec3 model_space_P = inverse(world_from_model_space) * world_space_P;

Matrix Fact #10: The inverse matrix applies the opposite space transform, but doesn’t change the multiplication order at all.

Which one should I use?

Between column-vector and row-vector multiplication, and column-major and row-major packing, we find ourselves with four choices with no obvious preferences. At some level, this is a choice similar to spaces or tabs, or picking your favorite up vector. That said, my preference is column-vector multiplication, and row-major packing, and my rationale is the follows:

  • Column-vector multiplication, the convention with P being a column-vector on the right-hand side, is the more traditional convention that you will see spelled in papers from mathematics. I prefer it for this reason; if you only have space in your brain for one convention, I think it simplifies the load to only consider this. That said, given my experience working in the games industry, it is certainly the less standard convention here. Additionally, it maps more naturally to the function-application mental model: it’s easier to imagine A * B * C * v as A(B(C(v))).

  • Row-major packing, I prefer because it allows you to pack affine transform matrices more efficiently; for our model and view matrices, the last row of our matrix will be (0, 0, 0, 1). By removing this, we can fit the remaining 3 rows into three float4’s, which are a natural thing for a GPU to pack. For instance, with GLSL’s (admittedly outdated) std140 packing mode, we save 16 bytes of storage packing a 3×4 matrix as row-major over packing a 3×4 matrix as column-major.

However, more than anything, please try to be consistent, and please document your conventions somewhere. Nothing makes a graphics programmer happier than having your matrix conventions consistently obeyed and clearly documented.

Quick Reference Table

HLSL GLSL Notes
Multiplication Order Across times down Across times down
Matrix Type Name floatRxC matCxR
Matrix/Matrix Multiplication mul(A, B) A * B
Matrix/Vector Multiplication mul(A, v) A * v Treats v as a column vector.
Vector/Matrix Multiplication mul(v, A) v * A Treats v as a row vector.
Hadamard Multiplication A * B matrixCompMult(A, B) Only works for two matrices of the exact same size.
Default Matrix Packing Column-Major Packing Column-Major Packing
Changing Matrix Packing #pragma pack_matrix or the /Zpc command line argument layout(row_major) or layout(column_major)
Constructors float2x2(row0, row1) mat2x2(col0, col1) The element-wise constructor acts like you took consecutive numbers and “bunched” them up into vectors, then called the vector constructor.
Indexing m[row_index][col_index] m[col_index][row_index] This is sometimes called “column-major” or “row-major” but that is a misconception. Majority is just about the packing order, it does not change the indices.

How to write a renderer for modern graphics APIs

Tell me if this sounds familiar. You want to write a renderer. OpenGL seems beginner-friendly, and there’s lots of tutorials for you to get started, but you also know it’s deprecated, kind of dead, and mocked; not for “real graphics programmers”, whatever that means. Its replacement, Vulkan, seems daunting and scary to learn, tutorials seem to spin for hours without seeing a single pixel on the screen, and by the time you’re done, you’re no closer to making that low-latency twitchy multiplayer FPS game you’ve always wanted to make.

You might have heard things like “the boilerplate is mostly front-loaded. Sure, it takes 5,000 lines of code to draw one triangle, but then it only takes 5,100 lines of code to draw a giant city full of animated NPCs with flying cars and HDR and bloom and god rays and …”

Except, if you’ve tried it, especially having learned from a tutorial, that’s really not quite true. Technically, there might only still be 5,100 lines of code, but you need to change them in scary and brittle ways.

Now, I’ll be clear: as a graphics programmer, I like the newer graphics APIs like Vulkan (and Metal and Direct3D 12). But they are definitely not easy to learn. Some of this is just some flat out missing documentation about how to use them — the authoritative source of graphics API documentation are delivered as sterile and descriptive info-dumps about what the implementation is supposed to do. But the biggest missing piece, in my mind, is that there’s a level of structural planning above just the graphics APIs — ways to think about rendering and graphics that are bigger than just glEnable() and vkCmdCopyBuffer(). The biggest difference between something like OpenGL and a modern API like Vulkan is that OpenGL is forgiving at letting you stumble around until you realize how to write a renderer, while Vulkan makes you have a blueprint from the start. The messy boilerplate of modern APIs like Vulkan only become not so bad when you come in with this plan from the start.

You might be wondering, if this boilerplate is going to be the same every time, why not put it in the API. And that’s a great question; future updates to Vulkan and Direct3D 12 are still trying to figure out exactly where to split this line. Programmers might say they want low-level control for the best performance, but be scared and overwhelmed by what that looks like in practice. That said, while the rough shape of this plan might look similar, the details of how everything is implemented change. There will be 5,000 lines of boilerplate, but the exact nature and structure of that boilerplate will depend massively on the specific details of your project and what goals you want to meet.

Traditionally, building up this intuition comes with a lot of practice, and often times as well, a lot of institutional knowledge, the kind you get by working on renderers built by people before you. I’ve worked on many renderers for shipping games, and I’m hoping this post will give you some insight into how I think about building a renderer, and what the different pieces are. While I might link specifics of the individual APIs, this post is intended to more be a combination of high-level structural ideas you can use across any APIs, modern or not.

At the end of the day, there are really only a handful of fundamental things any graphics platform API has to do: you need to be able to give some data to the GPU, you need to be able to render out to various render targets, and you need to make some draw calls. Let’s tackle these in reverse order.

Pedants Note: There are really only so many words to go around, and like many fields, we sometimes use generic terms in multiple different contexts. Be aware of that, and know that when I talk about a “render pass”, it might have nothing to do with some other renderer’s idea of what a “pass” might be.

Draw Calls

Let’s start with maybe the simplest thing here, but it might be unobvious if you’re new to graphics programming: the importance of draw calls. OpenGL is commonly described as a giant state machine, and there are elements of its API design that are like that, but one pretty big key is that the state only applies when you make a draw call*. In modern OpenGL, a draw call is an API call like either glDrawArrays/glDrawElements, or some of the fancier or more absurd variants. Also, for the purposes of this post, compute dispatches fall under the “draw call” banner, they just have smaller amounts of state.

* For OpenGL users, I recommend checking out the footnote at the bottom of this post.

Binding textures, vertex arrays, uniforms, shaders, and setting things like blend state and such, the only way this will eventually have an effect is when you go to make a draw call. As such, one of first goals is to think about “what draw calls are we making, and what state would we like them to have”. For instance, if I know I want to have a cube with a transparent texture on it, I know I’ll need to make a draw call with this state bound:

  • Vertex bindings to the cube mesh
  • Some bindings to the transparent texture
  • A shader that samples the texture and returns that as the color
  • Blend state to turn the object transparent
  • Depth-stencil state to make sure the object doesn’t write to the depth buffer

By thinking about what state the draw call needs to be in, we can separate out the concerns of it independently of any sort of surrounding code for the renderer or the passes. There’s a few different ways this can manifest into code, but one common way is to just literally build a structure for a draw call. This approach was taken when Epic Games rewrote their drawing pipeline for Unreal Engine 4.22 from “a complicated, template mess” (their words) to something they felt was far more flexible. Different engines might have different levels of granularity for what they put in their draw call structure, and there are plenty of tradeoffs here. For instance, do you pre-bake your PSOs, or remain flexible and cache them in the backend? How granular are your uniform updates? While bgfx exposes an imperative-like API, internally it’s implemented with a similar-looking “draw call struct”, though it definitely has different answers to these questions.

That said, you don’t have to directly implement draw call structs, and many do not. But at the very least, there should be a goal to separate out the gameplay-level structures from the rendering structures; the two might not even have a 1:1 mapping. A very common problem we need to solve in graphics is transparency sorting. If I have two transparent cubes on screen, then the way it works is that the draw call that’s made later is the one that renders on top; this means that we need to look at both cubes, and then sort them from distance from the camera in order for the sort to roughly be correct. It might be tempting to sort the list of Cube objects directly, but then, what if want to place something inside the Cube — maybe the Cube is actually some sort of floating crystal with our princess inside? We now have a tree where the Cube has a Princess inside it, and trying to sort this tree quickly becomes a pain. A flat list of things to render that you can sort tends to be a better idea.

Not to mention, it’s common, and expected, for a single “object” to require multiple draw calls in the same frame. For instance, we might have an object for a single window, but that window might have its frame made out of a wood material going into the opaque pass, while the glass pane goes into the transparent pass. Both of these are going to want different textures, and different states for blend and depth/stencil, and possibly different shaders, too, so it should be obvious that this kind of a mesh is going to require at least two draw calls. To add in shadows, you’ll also need to render our little window object to the shadow map as well, totalling up three draw calls.

A very easy beginner’s mistake that I see is an architecture I call “void Player::render() renderers”, as they tend to have render methods directly attached to gameplay-level objects that set up some state, maybe bind a shader and some uniforms, and make their draw calls inline. When trying to extend this to support multiple draw calls, they tend to fall down, because now you need to call the render() method once when drawing it in opaque mode, and then again when drawing it in transparent mode, and then again as well when adding more advanced rendering features.

To avoid this, try to design for this, and make an architecture to make it easy for an object like Player to map to potentially output many different draw calls, and to be able to “route” these draw calls to different places, too. So, perhaps instead of one giant list of draw calls, you might make many different lists for different purposes. You probably want to sort your opaque objects differently than your transparent objects, and you almost certainly want lists for other things like the different shadow map passes.

But now that we’ve made our draw calls, what do we do with them?

Render Passes

So far, we’ve been talking about drawing some triangles to the screen with some textures and shaders. And that kind of renderer can get you quite far. But you might have seen that in recent years, rendering contains a lot more stuff than just rendering triangles. This coincided with the rise of techniques like deferred shading and advanced post-processing on consumer GPUs.

OpenGL hides this fact from you a little bit, but when you render onto “the screen”, you’re just rendering onto a texture. There’s no magic there, it’s just a normal old texture, and your desktop environment is in charge of taking that texture, and combining it the textures for all the other windows on screen to form the user interface. The little window previews in Alt-Tab are using the same exact kinds of textures that are used on the transparent cube, but the textures are just from other windows instead. Due to an unfortunate API limitation of OpenGL, you can’t actually access this texture, but it’s still there, and it’s just like any other texture.

The API design of Direct3D 11 makes this a little bit more obvious, by not giving the user a way to render directly to the screen without first creating and binding a render target texture to draw to. And a render target texture is just a texture with a special flag attached.

Additionally, there’s a feature on modern GPUs called “multiple render target”, or MRT, for short. Shaders and draw calls don’t have to output to only one render target texture at a time, they can actually output to up to 8 different render target textures at a time. The shader has syntax to output a color to render target 0, and a different color to render target 1, etc. all the way up to 8. Most games don’t need that many render target textures, though, 4 is probably the most you’ll see in practice.

With this in mind, at the lowest level, a render pass is a collection of draw calls all writing to the same set of render target textures, that all execute around roughly the same time. For instance, implement post-processing, you might first have a pass that renders your game scene to a render target texture, and then bind that texture as an input to a different draw call which applies a shader to it. In simple cases, these shaders can be things like color correction, vignettes, film grain.

There’s one more restriction though, which is that we often need to feed the results of one pass into another. It’s generally not possible to render into a render target texture at the same we have it bound for rendering in the shader, as this would create a feedback loop, so either we switch to a different render target entirely, or we often need to make a copy for ourselves.

A more complex example would be an effect like bloom. Bloom effects often work by repeatedly scaling down the input image to blur it, and adding them back together to achieve a Gaussian-like filter. Since each “step” in the scaling down here renders to a different render target (as the render target is getting smaller), each step would be its own pass.

Deferred shading pipelines become even more complicated here. In this case, the render target textures outputted by each pass are not just colors to be tweaked by a post-process pipeline, but specially crafted series of textures that allow future passes in the renderer to apply lighting and shading, shadows, decals, and all sorts of different core rendering techniques. If this is unfamiliar to you, my video on deferred shading in Breath of the Wild might be instructive here, but you can also find many different graphics breakdowns on games using deferred shading in the wild, one example being this breakdown of the indie hit Slime Rancher.

As more and more deferred-like approaches take over rendering, render passes and render pass management have start to become one of the main central things to think about when designing a renderer. When all your render passes are put together, they form a graph — one pass creates the render target that’s consumed by a different pass. When I think about render passes, I think about the dataflow across the entire frame; passes produce and consume render targets, and inside of each pass is a list of draw calls. And as renderers grow to hundreds of passes, keeping track of passes manually can be tricky. By using the insight that the dataflow in our frame is an acyclic graph, we start to end up with new tools like Frostbite’s FrameGraph, which uses graph theory to try to simplify the pain of writing a lot of code to set up different passes and hook them together. This idea has been so popular that it’s actually quite standard and commonplace to see across many different kinds of renderers now, and there are several implementations of the concept, like AMD’s Render Pipeline Shaders, which even adds a little DSL language for you to specify all the passes in your frame.

In older APIs like Direct3D 11 and OpenGL, passes were implicit — you could switch render targets whenever you wanted, and the driver was expected to juggle things for you. But for this new breed of APIs, like Direct3D 12, Vulkan, and Metal, render passes became a first-class concept. Direct3D 12 has the simplest API design here, with BeginRenderPass just taking a number of render targets to render to, though Direct3D 12’s render passes are actually optional. Vulkan has a whole complicated way of specifying render passes, but it also has a newer extension that basically implements the Direct3D 12 API instead.

The main impetus for adding support for render passes directly to modern graphics APIs was better support for mobile GPUs, because mobile GPUs are built around something called “tile memory”, and use a very different algorithm to render compared to desktop GPUs. I’d rather not go into a big info-dump here on mobile GPUs, as I want to try to keep this post about the high-level concepts, but I will recommend ARM’s Mali GPU Training series if you would like to learn more about mobile GPU architecture, and in particular, the Frame Construction video is fairly relevant for what we’re talking about here. I especially love their use of graphviz to visualize the passes in a frame.

But even besides areas like post-processing, multiple render passes are also helpful when drawing our 3D scene as well. Much like with draw calls, it’s also common and expected for a single object to end up in multiple passes in the same frame as well. A common technique in rendering is what’s called a “depth prepass” or “Z-prepass”. The idea is if rendering a full material for the object is expensive, we can prevent a lot of overdraw and wasted pixels by first rendering out a much simpler version of the object that will only write to the depth buffer, although this is only for the parts of the object that are opaque. As you increase the number of objects you want to support, you might to have an object render into both the depth prepass and the main color pass.

Not to mention, the opaque parts of your object might also want to go into other special passes, like our aforementioned shadow map passes. And even just for the case of our window from before, while the opaque and transparent draw calls do both use the same render targets, it’s pretty common to split these into different passes as well, since we probably want to run other post-processing passes like ambient occlusion or depth-of-field after we render all the opaque draw calls, but before we render the transparent draw calls.

As games keep becoming bigger and bigger, it’s also common to see different parts of the scene render at different resolutions as a performance tactic. You might want to render opaque objects at full-resolution, render transparent objects at half-resolution, and then render UI back at full-resolution again. The order and dataflow of passes is something you really want to keep flexible; it’s common for this order to change over the course of development, or even sometimes dynamically at runtime.

Proper render pass management is one of the biggest differences between a renderer written by someone who’s just starting out, and a renderer designed with years of experience. A big stumbling block I’ve often seen in people who are starting out is trying to retrofit shadow map passes into their renderer — shadow map passes are special, because the camera is in a different spot, you’re not writing to any color textures, you might want to use a different shader, and usually some of the rasterizer state of the draw call has to change to add on things like depth biases. A void Player::render()-style renderer is going to struggle with this. It has no holistic or central concept of different passes, since it’s only thinking about things at the object level. It’s pretty common to accidentally couple your rendering code to the assumptions of the opaque pass, and when you’re trying to add in shadow maps, this is your first exposure with multiple passes, and you probably aren’t thinking about them yet, nor how passes might connect together.

For one extra hint here, most engines that want to draw the same object into multiple passes usually attach some metadata to the concept of the pass, like the location and settings of the camera, what kind of pass it is (shadow map vs. depth prepass vs. opaque vs. transparent), how and where it should submit its draw calls, and other miscellaneous state, like any special state it should apply. This gives the code making the draw calls the right context to know whether it should draw anything in this pass, and how it should do it, and which shaders it should use.

Render Passes and Synchronization

GPUs are designed to go fast. Really fast. A lot of different tricks are used to make GPUs go fast, but the main one is parallelism; it is possible for GPUs to render more than once triangle at once. They can also render more than one draw call at once, and in some very extreme cases, GPUs can render multiple passes in parallel, too. However, that causes a problem for us, if one pass is used to feed into another. How does the GPU know that it’s allowed to run two passes in parallel? In older APIs like Direct3D 11 and OpenGL, the graphics driver was responsible for noticing that one pass’s render target becomes the next pass’s texture, and told the GPU not to run them in parallel. Similarly, we have the same problem the other way, too: if the next pass after that starts using that texture again as a render target, then we also need to make sure not to overlap there, too. This requires tracking every single texture and render target used in every single draw call during a pass, and comparing them with previous passes, a process known as as “automatic hazard tracking”. As with any automagic process, it can and does go wrong — false positives are not uncommon, meaning that two passes that could overlap in theory, don’t in practice, because the driver was too cautious during its hazard tracking.

As an analogy here, imagine some sort of “automatic” parallelizing system on the CPU that guaranteed that any two jobs touching the same chunk of memory wouldn’t overlap. Tracking the “current owners” of every single possible byte address is clearly wasteful, so we have to pick a bigger granularity. The granularity here determines the probability of false positives — picking something like 8MB chunks, then any two jobs that just so happen to access memory in the same 8MB chunk would be treated as hazardous, even if their memory accesses weren’t overlapping. This is the kind of false dependency that we would like to prevent.

Not to mention, newer features like multi-threaded rendering, multi-queue, indirect drawing, and bindless resource access make it much more difficult to do automatic hazard tracking. These are increasingly becoming the building blocks of a modern renderer as we move to GPU-driven world, and as a result, Direct3D 12 and Vulkan decided to remove automatic hazard tracking in favor of making you spell out the dependencies yourself. It’s worth noting that actually Metal kept a limited form of automatic hazard tracking which you can use if you want, however, it does limit your ability to use these newer rendering features, and also brings back the potential false positives.

Rather than directly specifying that certain passes can or can’t overlap, the design of Vulkan, Direct3D 12 and Metal all use a system known as “barriers” on individual resources. Before one can use a texture as a render target, you first must place a barrier telling it “I want to use this as a render target now”; and when you want to change it back to being used as a texture, you must place another barrier saying “use me as a texture”. This design also solves some other problems — another way that GPUs go fast is by having a lot of caches, and in order to “push” a texture along to the next stage, you really want to clear such caches. Another related problem is known as “image layouts”; rendering into a texture might want to lay out the texture data differently in memory than reading from it later inside a shader, as the memory access patterns can be different. So Vulkan has the concept of “image layouts”, which tell the driver what rough mode it should be in, with options like COLOR_ATTACHMENT_OPTIMAL (color attachment being the Vulkan name for “render target”), SHADER_READ_ONLY_OPTIMAL (for when reading from it in a shader), and so on.

Specifying the proper barriers can be tricky to do, and some naive implementations might “over-barrier”, which can cause performance issues. Since render passes and synchronization are so directly intertwined, it’s extremely common to integrate this kind of tracking directly into the design of the render graph. In fact, one of the original motivations for Frostbite’s FrameGraph mechanism was to make these barriers a lot simpler and less error-prone. The team behind AMD’s Render Pipeline Shaders mentioned that switching games to their framework reduced over-barrier errors, and improved GPU performance overall. For some very intricate details on barriers and what they’re doing at the hardware level, I recommend checking out MJP’s series on Breaking Down Barriers.

Data, Uploads, and Synchronization

The last major thing that I think about when I think about a renderer is data management. Part of this is “boring” stuff like allocation: in order for a texture to exist on the GPU, it needs to be somewhere in the GPU’s memory. Previously, this sort of data allocation was very simple — in OpenGL or Direct3D 11, you would ask the driver for a texture, tell it how big you wanted it, and what kinds of pixels were inside, and it would give you back a resource, which you could then supply with data. This same process happens with all kinds of data: not just textures, but also vertex buffers for storing vertex data, index buffers, and so on.

This seems simple, but the details matter greatly here. In the old APIs, you could upload new between draw calls, and it was all just supposed to work. You could upload a mesh to a vertex buffer, make a draw call, and then upload different data to the same vertex buffer, and everything would render just fine. Now, GPUs are pretty impressive pieces of hardware, but they’re not magic. If the draw calls are supposed to be running at the same time, how can we overwrite the data in a buffer while it’s in use? In practice, there’s three possible answers:

  1. The draws, actually, do not overlap. The GPU will wait for the first draw call to be done, then it updates the buffer, then runs the second draw call.
  2. The driver makes a copy of the buffer, and then uploads your new data to it. The first draw call gets the original buffer, the second draw call gets the copy.
  3. There’s some other wacky hardware magic in there to have “versions” the buffer data across different draw calls.

The third answer does sound nice, and it’s indeed what some older graphics hardware used to do, but those days are long gone — as the number of vertices and polygons and textures and amount of data increased, it wasn’t worth it. In practice, drivers use some combination of the first two solutions, using heuristics to decide which to use at any given time. In OpenGL, when updating your buffer data with glBufferData, the idea was that GL_STATIC_DRAW would get you option #1 and stall the GPU, while using GL_DYNAMIC_DRAW would get you the silent copy behavior, a technique known as “buffer renaming”. In practice, I think a lot of OpenGL drivers just stalled between draw calls, leading to the somewhat supersitutous practice of “buffer orphaning”, or replacing the buffer entirely between draw calls, effectively forcing a buffer rename. This optimization stretches back a long way, though. As early as Direct3D 8, buffer renaming was an officially supported optimization for updating vertex buffers.

This problem gets even thornier when considering OpenGL’s uniforms (known as constants in Direct3D). While it’s a bit of an odd case to update a single vertex buffer between draws, these uniforms often contain details like the position and rotation of the object you’re drawing, and are expected to change with every single draw call. Under the hood, they’re actually implemented as giant buffers of memory on the GPU, just like vertex buffers, a feature which later versions of OpenGL expose more directly. On the other hand, Direct3D 11 switched away from exposing the invidiual values, making the user upload constant buffers. For games with a lot of draw calls, these small buffer changes can add up quickly, and to implement this efficiently, buffer renaming is a must. So, when updating constant buffers, if the driver detects the constant buffer was already in use, they’d allocate a new chunk of memory, copy the old buffer into the new buffer, and then apply your updates.

Ultimately, while conceptually simple, the model requires considerable effort to use optimally in practice, and many flags and special paths had to be added to both OpenGL and Direct3D 11 to keep things running smoothly. Despite the API seeming simple, a large amount of heuristics and effort were needed behind the scenes, both of which were often unpredictable and unreliable for game developers.

Following the pattern so far, you might be able to guess what the more modern graphics APIs chose to clean up this mess. You are now in charge of allocating your own buffers, uploading to them, and crucially, you aren’t allowed to update them between draw calls anymore. If you have 100 draws that all use a different uniform value, you are in charge of allocating the 100 copies of that buffer that are needed and and upload to them yourself; though there is a small 128-byte area of constants that can be swapped out (“push constants” in Vulkan, “root constants” in Direct3D 12), usually just as a way to supply an index or two that points into a bigger chunk of buffer parameters.

In fact, not only are you not allowed to upload data between draw calls, you also need to make sure that if the buffer is currently in use by the GPU, then you shouldn’t touch it, or you risk bad things happening. Knowing when something is in use by the GPU or not might seem a bit involved, but in practice, there’s a very simple solution. The GPU can write an integer to a special location in memory every time it finishes some work, called a “fence” (Direct3D 12) or “timeline semaphore” (Vulkan), and the CPU can read this integer back. So, every frame, the CPU increments its “current frame” index, then after it submits all the passes to the GPU, it tells the GPU to write that frame index to the current frame. If you know that the last time you used this buffer was on frame 63, then once the GPU writes fence value greater than or equal to 63, you know it’s done with the buffer, and you can finally reuse it.

For some more details of the tradeoffs involved in fast, efficient, buffer management, I recommend reading through MJP’s article on GPU Memory Pools, which also links to some other great articles and resources. Additionally, if you want a Direct3D 11-style experience, AMD has fantastic open-source memory allocation libraries for both Direct3D 12 and Vulkan.

My one recommendation to help get a grip on this is that you try to architect your renderer in a way that you can do pretty much all of your data uploads at the beginning of the frame. For textures and vertex data, this isn’t too tricky, but uniform data can be trickier, since it often comes glued together with the draw calls. One common way to help with this is to build out a linear allocator — effectively, this is just a growable array of memory. When your renderer is walking the list of objects to generate draw calls for, it receives a pointer to the linear allocator too, and the code for rendering out each object allocates a chunk of memory, saves off the offset, and writes the uniform data it wants into the buffer. When it comes time to submit to the GPU, it is relatively easy to upload the whole buffer first thing, and use dynamic uniform buffer offsets to send the right region of data for each draw call; Direct3D 12 and Metal have similar offsets in their APIs when you bind them to the draw call.

Other stuff I didn’t cover

Modern graphics is quite complicated; and I only scratched the surface of how all these parts interact. “Platform” work like this is often a full-time job (actually, often multiple these days!), and it’s not uncommon to rewrite a graphics platform layer over time. But I hope this gave you some high-level overview for how I think about rendering, and how I map it to all the different primitives provided by both old-school graphics APIs like Direct3D 11 and OpenGL, and also modern APIs like Vulkan, Direct3D 12, and Metal.

I didn’t really cover anything about the high-level techniques, like how to do lighting and shading at a large scale, or how to practically write deferred rendering or post-processing, or how to build a material system. Partly because this is something that should carry over without too much modification between OpenGL and Vulkan, but also because I think there’s a wealth of other fantastic resources out there for that sort of thing. But I’ll consider writing some more if there’s enough interest…

I also didn’t cover anything about PSOs because I feel I already covered that well enough in my last article, but I’ll give one more practical tip: don’t be too ashamed of doing things at the last minute. For instance, if you use draw call structs, you might choose to loosely put state like blend state, rasterizer state, and which shader you’re using as loose-leaf objects in your draw call struct. Most engines I am aware of do something I’ve seen called “hash-n-cache” where you then use that to build a PSO description, and use that to look it up in a hash-map. If you already have a copy of it, you can use it directly, but if not, you can create one on the fly, and then cache it for later uses. The idea is that the states will not change too much from frame to frame, so once you’re past the initial few frames of creating PSOs, pretty much everything from there should be on the hot path. In general, the general hash-n-cache technique, and caching in general can get you surprisingly far towards getting something working right now. That said, in the extreme case, it can run into scalability issues, requiring detailed solutions.

* A quick rant about OpenGL’s bind-to-modify system

Above, I said “the state only applies when you make a draw call”. This is not strictly true in OpenGL. Due to 30 years of legacy and unfortunate API design, doing anything with most objects in OpenGL first requires binding them. Want to upload new texture data? glBindTexture. Want to upload new vertex data? glBindBuffer. Some really roundabout API design also resulted in the whole glBindBuffer, glVertexAttribPointer dance, where the currently bound array buffer is “locked in” when you call glVertexAttribPointer. To be clear; this is not a constraint because of how GPUs work, this is a constraint because OpenGL consistently lacked future foresight when designing its original APIs. The original OpenGL only had support for a single texture at a time, and glBindTexture was added when it was realized that having more than one would be a good idea. Similarly, glVertexAttribPointer originally took a pointer into the CPU’s memory space (that’s why the last argument has to awkwardly be cast to a void*!), and glBindBuffer was added later once it was realized that having a handle to a chunk of GPU memory was useful.

So, in OpenGL, binding is really used for two purposes — the first is binding the object so you can modify, query, or just manipulate it in some way, and the latter is to actually bind it to be used in the draw call. The latter is all we should be using binding for. In modern desktop GL, there is a feature called Direct State Access, or DSA, which bit the bullet and introduced a large number of new entrypoints to eliminate the dependency on bind-to-modify style APIs, but this feature never made it to OpenGL ES or WebGL, and there are also platforms in wide deployment that still only support up to OpenGL 3.1.

The bind-to-modify API design ends up confusing a lot of people into how graphics and GPUs work, and also leads to a lot of cases where state is correctly set up for a draw call, but “accidentally” changed at the last minute — for instance, I’ve had difficult-to-find bugs in my OpenGL applications because I called glBindBuffer to reupload some data while I had a VAO bound at the time. If you are using OpenGL, I highly suggest using Direct State Access if you can. While not perfect, it helps eliminate a large class of errors. If it’s not available on your platform, I recommend writing your own state tracking layer on top that helps gives you “DSA-like” functionality, and helps track and look out for any ambiently bound state that it knows might certain kinds of modify operations. It doesn’t have to be a huge heavyweight layer, it just has to watch out for and untangle some of the trickier state-related hazards.

The Missing Guide to Modern Graphics APIs – 2. PSOs

Today, we’ll be looking at a fairly simple, but fundamental concept in modern APIs, and using it to springboard onto talking about some different GPU architectures, and that is, the “PSO”, or “Pipeline State Object”. Motivating this necessity is one of the main design considerations of modern graphics APIs, and that is, predictable performance. Basically, the goal is that for every function call in the API, you can reliably group it into one of two categories: either it returns pretty instantly, or it can take a long time to complete.

Previous APIs, like Direct3D 11 and OpenGL, made no such guarantees. Every call had the potential to stall, and often times, unless you had experience on console platforms or were blessed with insider knowledge mostly limited to triple-A game developers, you simply had no idea when these would happen. And even then, you could be in for a surprise: the stalls might change GPU vendors, graphics driver versions, and even based on your game itself. I vividly remember having the experience of debugging a gnarly graphics bug that magically disappeared when I renamed my executable from “ShooterGame.exe” to something else. As a game developer myself, I want my user experience to be as fluid as possible, only having these stalls when the user is at a loading screen, but often times it would be difficult to find a reliable set of API calls to guarantee the expensive stalls happen then. And if you’ve ever played a video game, you’ve certainly had the experience of having the first 10 seconds of wandering around a map be a very rough experience, until everything is loaded in and smooths out again.

While the old APIs and driver vendors are not fully to blame for this problem, the lack of guidance, and amount of guesswork means it’s incredibly difficult for a graphics programmer to get it right, especially when trying to ship a single executable that runs smoothly across three different GPU vendors, each with their own independent architectures and drivers.

So, what changed? If you are familiar with OpenGL, and have rendered transparent objects before, you have likely written code like the following:

// Bind the shader program.
glUseProgram(myShaderProgram);

// Configure the vertex data.
GLint posLocation = glGetAttribLocation(myShaderProgram, "Position");
glVertexAttribPointer(posLocation, 3, GL_FLOAT, sizeof(struct vertex), offsetof(struct vertex, pos));
GLint clrLocation = glGetAttribLocation(myShaderProgram, "Color");
glVertexAttribPointer(clrLocation, 4, GL_FLOAT, sizeof(struct vertex), offsetof(struct vertex, color));

// First, render opaque parts of the object.
glDisable(GL_BLEND); // Turn off any latent blend state.
glDepthMask(true); // We want to write to the depth buffer.
glDrawElements(GL_TRIANGLES, opaqueTriangleCount, 0);

// Now, render the transparent parts of the object.
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
glDepthMask(false);
glDrawElements(GL_TRIANGLES, transparentTriangleCount, transparentTriangleOffset);

It’s important to note that while GPUs have slowly started to become more programmable, large portions of the pipeline appear to be fixed-function, including things like the how vertex data gets unpacked into the shader attributes, the blending modes, or whether to write to the depth buffer. I say “appear”, because different GPUs have different implementations, and some of them might be programmable. For instance, on most iOS devices, all blending is compiled into the shader. Enabling and disabling the GL_BLEND flag might require the driver to take your shader and recompile it with some bits at the end to blend against the framebuffer. Though this can’t be done immediately inside the glEnable call; the glBlendFunc after affects the shader math using to blend values against the framebuffer, so most drivers will defer this work all the way until the glDrawElements.

One might be tempted to just include blending in the shader going forward, but remember that this is specifically an iOS thing; not all devices necessarily do full blending in the shader, so this isn’t universally possible to move in. The difficulty comes not from the fact that some pieces of state changes are expensive, but that since different GPU vendors have different GPU architectures, it’s not possible to portably know what’s expensive and what’s not. In fact, while I believe Apple used to publish an official guide about what would be an expensive state change in their OpenGL ES driver, they have since taken it down, and I am not aware of any authoritative source for this information now, despite being semi-widely known. However, we can infer that it is an expensive given how they have crafted the equivalent Metal APIs, as we’ll see later.

As another example, on some Mali GPUs, the vertex attributes are fetched entirely within the shader itself. Adjusting the number of vertex inputs might cause a shader to recompile. However, this only applies to older Mali models, on the newest, Bifrost-based architectures, there exists a native hardware block for vertex attribute fetching, and the driver doesn’t need to recompile shaders for different vertex attribute layouts, instead, it can push a few commands to the GPU to tell it to reconfigure its fixed-function hardware. Basically, almost any state might or might not affect whether the driver needs to recompile a shader, and there’s effectively no way to know up-front whether a given glDrawElements call might stall.

But not all is lost. The hope is that once you’ve drawn something once with a certain set of states, future calls will be quicker. This is why most best practice tell you to “pre-warm” your shaders and states by submitting some draw calls. If you are lucky, the Developer Tools or Driver’s Debug Output might inform you of such recompilations happening, but it’s still finicky and often difficult to track down.

Direct3D 11 appears to fare a tiny bit better, by offering blocks of state to be created and cached. In theory, creating one of these objects could be where the expensive work takes place, however the expense tends to come from the combinatorics of the shaders and state blocks together. Since these state blocks could still be bound at any time before a draw call, it was only marginally better in practice.

What do modern APIs offer in contrast? Instead of the mix-and-match states of previous APIs, the modern graphics APIs offer “Pipeline State Objects”, or “PSOs”, which contain all possible states that might be expensive to change at runtime, or any states that a compiled optimizer might be able to use as influences. Calling the “CreatePipelineStateObject” function is expected to be slow, but once that’s done, all the work is done. No more driver optimizations, no background threads*, no draw-time work. The pipeline state is usable once “CreatePipelineStateObject” returns, and it doesn’t need to be touched again.

You can see such structures with varying amounts of functionality in Vulkan’s VkGraphicsPipelineCreateInfo, D3D12’s
D3D12_GRAPHICS_PIPELINE_STATE_DESC, and Metal’s MTLRenderPipelineDescriptor. Metal has a fairly conservative view of pipeline state, with only a handful of fields that suggests a truly expensive recompilation stage. Since Apple pretty tightly controls their hardware, and the sets of GPUs that can run Metal, it can have much more direct control over the drivers and the APIs they expose to developers. Direct3D 12 is somewhere in the middle, with a moderate number of fields.

Vulkan, owing to its ambitious vision for deployment across mobile and desktop, has the largest amount of state. Now, some of it can be dynamic, by passing certain VkDynamicState flags, but given that it’s an option, it’s perhaps assumed that it might be more expensive than baking it into the pipeline, on some platforms. The downside to this approach is that since it’s no longer possible to mix and match state blocks, and pipelines must be created up-front, more pipelines must be created by applications, and they might take longer to compile. In practice, the hope is that if a driver does not need to compile something into the shader, it will only cache the parts that are expensive, and just record a few commands to reconfigure the hardware for the rest. But you, the application developer, need to assume that every PSO creation might be expensive. Even if it might not cause draw-time stalls, the driver might like to use this opportunity to apply extra optimizations now, given that it has all the state up-front, and can more likely do expensive things now, rather than at draw-call time, where any compilation must happen quickly.

To help alleviate and save on pipeline creation costs, modern APIs allow serializing created pipelines to binary blobs that can be saved to disk, along with loading them back in. This means that once a pipeline is created once, the application can save it off and reuse it in the future. In theory, anyway. In practice, the details can be messy.

Additionally, if one wants to just eat the cost, and do all of the PSO creation at draw time, that’s possible too — just write your application code so that you call the function to create your PSO right before the draw code. Don’t feel too bad about it, either. A lot of real-world games do work this way, or they might do a hybrid approach where a certain number of “core” PSOs are compiled at load time, with the rest lazily created. Or maybe you skip the draw if the PSO isn’t finished at draw time, but send it to compile on a background thread (oh yeah! Since all of this state is self-contained and passed in directly to the API, instead of hidden behind global state, it’s a lot lot easier to multi-thread!). The important thing is not that it’s always fast, but that stalls are predictable and understandable — the engine is guaranteed to know when each stall will happen.

One thing that might seem well-intentioned, but is ultimately pointless, are Vulkan’s Pipeline Derivatives. The goal is to make certain aspects of pipeline creation cheaper by suggesting that pipelines can derivate from a “template”, and change only a few state bits. However, since the application is unable to know what can be created cheaply without platform-specific knowledge, it is unable to know what state to go in the template, and what should go in the derivative. To give another motivating example, say I have a series of pipelines, with four different blend modes, and four different vertex input states. Should I create the templates with the same blend mode and change the vertex input states in my derivated pipelines? Or should I create the templates with the same vertex input states and change the blend modes? Knowing what we know now about iPhone and Mali GPUs, they both have different concepts of “expensive”, and to answer the question, needs to know what kind of hardware we’re running on, which really negates the whole point. Rather than pipeline derivatives, the better option is for the driver to cache things that can be cached independently. On devices where blend modes can change cheaply, if I ask the driver to compile two PSOs that only differ in their blend mode, it should be able to notice that the shaders have not changed, and share 99% of the work.

Also, to address that asterisk from earlier, while the intention is “no more background threads”, some graphics vendors might want to do additional optimization based on how the PSO is used at runtime… shaders can be complicated and featureful enough that there might still be enough “dead code” weighing the whole thing down. For instance, a driver might realize that a texture you’re loading is, in most cases, a solid color, usually white or black, and will build a different version of the same shader that skips the texture fetch, but only if the solid color texture is bound. While Direct3D 12 originally started with a desire of “no driver background threads”, it now readily admits that this is the case in practice. Thankfully, it supplies an API to try to curtail such background threads, at least during debugging.

Anyway, that’s about it for PSOs. Next time, we’ll finally be back to discussing GPU architecture, with a look at “render passes”, along with discussing a popular trend in the mobile space, the “tiler”.

The Missing Guide to Modern Graphics APIs – 1. Intro

There’s a certain opinion that graphics programming is a lot more difficult than it used to be. The so-called “Modern APIs”, that is, Direct3D 12, Metal, and Vulkan, introduce a new graphics programming paradigm that can be difficult and opaque to someone new to the field. Bizarrely, they can be even more difficult for someone who grew up on older APIs like “classic” OpenGL. Compared to what you might be used to, it seems surprisingly heavy, and the amount of “boilerplate” is very high.

There’s no end of “Vulkan Tutorials” out there on the web. I am not aiming to create another one. What makes these APIs hard to use is the lack of a true reference on the problem space: the conceptual understanding of what a modern GPU is, what it looks like, and what it’s best at doing. This goes hand in hand with architecting your graphics engine, as well.

When left to our own devices, fed tiny nibbles of information here and there, we’ll use our imagination and come up with our own mental model of what a GPU is and how it works. Unfortunately, the universe is against us here; a combination of decades of inaccurate marketing to a gaming populus most interested in just comparing raw specs, legitimate differences in GPU architecture between vendors, and the sheer mass of code tucked away in drivers to win benchmarks have made it incredibly difficult to build that mental model effectively. How are we supposed to understand modern APIs that promise to expose the whole GPU to us, but don’t tell us what our hardware even is? Every senior graphics programmer I’ve talked to had to go through several difficult sessions of “unlearning”, of tearing down mental models made with incomplete or outdated information.

My goal in this series is to take a certain kind of graphics programmer, one who might be familiar with old-school OpenGL, or Direct3D 11, and give them “the talk”: How and why graphics has been shifting towards these newer modern APIs, the conceptual models that senior graphics programmers think in terms of these days, and of course, what your GPU is actually doing. What, actually, is a fence in Vulkan, and how does that differ from a barrier? What’s a command queue? Can someone explain D3D12 root signatures to me? Are search engines smart enough yet to notice that this paragraph is really just a blatant attempt at exploiting their algorithms?

My ambition is for this series to become a companion piece to Fabien Giesen’s “A Trip through the Graphics Pipeline, 2011“, but focused on modern API design and newer industry developments, instead of just desktop GPU design. While not strictly a prerequisite, I consider it to be vital reading for any working graphics engineer. At the very least, I will assume you are familiar with a rough understanding of the graphics pipeline, as established by classic OpenGL.


Everyone is familiar with the general knowledge that GPUs are complex, incredibly powerful pieces of equipment, and one of their biggest benefits is that they process data in parallel. GPU programming is incredibly tricky, and quickly becoming its own specialization in the industry, powering machine learning, data science, and other important things. With large frameworks and engines like Unreal, or TensorFlow, written by massive companies with thousands of employees, I think this leads to the impression that GPUs work something like this:

A large amount of code goes into a block box, and a video game magically appears on the other side.

Let’s be clear: graphics programming is a complex, difficult field with incredible challenges to solve, active research, confusing overloaded lingo, and frustrating alienating nights as you hopelessly flip signs and transpose matrices with reckless abandon, feeling like an impostor… much like any other discipline. It will all seem like magic until something “clicks”, and often times that’s a missing principle or piece of information that nobody bothered to explain, because it seems so obvious in retrospect. And as innovation continues to march on, as the stack becomes larger and larger, we build everything new on top of the burial grounds of ideas past, and it becomes even more difficult to see the pieces for what they are. So let’s go back to the beginning… and I can think of no better place to start, than with OpenGL. OpenGL is probably the most well-known graphics API out there, and, conveniently for us, has a strict stance on backwards compatibility, letting us back and forward through a 20 year legacy of graphics development, just by observing the different endpoints in order, so let’s start at the very, very beginning. You can still pretty easily open up a window and draw the famous triangle with the following:

glBegin(GL_TRIANGLES);

glVertex3f(0.0, -1.0, 0.0);
glColor4f(1.0, 0.0, 0.0, 1.0);

glVertex3f(-1.0, 1.0, 0.0);
glColor4f(0.0, 1.0, 0.0, 1.0);

glVertex3f(1.0, 1.0, 0.0);
glColor4f(0.0, 0.0, 1.0, 1.0);

glEnd();

So, here we are, programming an infamously complex piece of equipment, with just a few lines of code. This raises a genuine question: What’s wrong with this approach? What has changed in the 20 years since, that we’ve needed to add so many more things? There are many, many answers to this line of inquiry:

  1. What if I want to draw my triangle in a different way? Like, say I want it to rotate, how do I do that? What about if I want to give it a texture? Or two?
  2. My game is rendering upwards of 1,000,000 triangles per frame. Does this scale to that amount? If not, what’s slow?
  3. What is going on under the hood? How is the GPU coming into play, other than “triangle happen”? How can I use this to inform my code’s architecture, and make reasonable guesses about what will be slow, and what will be fast?

Much like everything in engineering, these questions reolve around tradeoffs. These three lines of questioning are ones of features, performance, and transparency, respectively, weighted against the tradeoff of simplicity. If you only need to draw a handful of triangles, and maybe a few colors and textures, and you are comfortable in that environment, and then maybe OpenGL is still the right option for you! For others though, there are roadblocks in the way of higher performance, and modern APIs offer alternative solutions to these problems.

The OpenGL API was invented in a time before the term “GPU” was even coined, and certainly before it was imagined that every modern computer would have one, pushing large amounts of triangles to the screen. It was designed for industrial use, on specialty servers in a closet providing real graphics horsepower, with programs like AutoCAD connected to it. But it has been kept alive over the years as new versions coming out, each one expanding the amount of flexibility and scalability provided.

It would be far too long a post if I went over the full history of ideas tried and eventually abandoned in OpenGL, but one of the first major inventions to come along with the “GPU” was the idea of a bit of memory on the GPU itself. This is RAM memory, just like your regular computer has RAM memory. The idea is that uploading 1,000,000 triangles from the CPU to the GPU is infeasible to do every frame, so maybe instead, we can upload the data once, and then just refer back to it later. This saves a large amount of memory bandwidth, and became the idea of “vertex buffers”. Don’t worry too much about the exact details of the code below, just note the rough steps:

// Prepare our vertex data.
struct vertex { float pos[3]; float color[4]; };
const struct vertex vertexData[] = {
    { { 0.0, -1.0, 0.0, },
    { 1.0, 0.0, 0.0, 1.0, }, },

    { { -1.0, 1.0, 0.0, },
    { 0.0, 1.0, 0.0, 1.0, }, },

    { { 1.0, 1.0, 0.0, },
    { 0.0, 0.0, 1.0, 1.0, }, },
};

// Now upload it to the GPU.
int buffer;
glGenBuffers(1, &buffer);
glBindBuffer(GL_ARRAY_BUFFER, buffer);
glBufferData(buffer, sizeof(vertexData),
             vertexData, GL_STATIC_DRAW);

// Tell OpenGL how to interpret the raw data.
glVertexPointer(3, GL_FLOAT, sizeof(struct vertex),
                offsetof(struct vertex, pos));
glColorPointer(4, GL_FLOAT, sizeof(struct vertex),
               offsetof(struct vertex, color));

// Now draw our triangle. We're drawing a total of 3 vertices,
// starting at the first (0th) vertex in the data.
glDrawArrays(GL_TRIANGLES, 0, 3);

This lets us lay our data out in a consistent format up-front, and then instead of needing to feed each vertex to OpenGL separately, we can first upload a giant bundle of data, and it does the expensive transfer once, presumably during a loading screen. Whenever the GPU needs to pull a vertex out to render, it already has it in its RAM.

An important thing to note, however, is that this tradeoff only makes sense if you have some reason to believe that uploading to the GPU is expensive. We’ll come to see why this is true in future installments of the series, and the particular nuances and details of what a statement like this means, but for now, I’ll just ask you to treat it as just a declaration from on high; that it is true.

Now, we’ve traded off simplicity for some alternative answers to the above questions. While the code has become a lot more complicated, for the same exact triangle, so it might seem like a wash, but our scalability has improved, since we can draw the same triangle by referring to the same bit of GPU memory, and our transparency has improved; now we understand the concept of “GPU memory”. Given this addition to our mental model, we can now imagine how some piece of software might emulate the glBegin() set of APIs on top of a “memory-ful API”, by automatically allocating and filling in vertex buffers under the hood, uploading to them when necessary, and submitting the proper draw calls. One of the goals of a “graphics driver” is to be this bridge between the API that the user is programming against, and the dirty details of what the GPU is doing. We still don’t yet have a complete mental model of how the GPU accomplishes tasks, but we at least understand that it involves “memory”, if nothing else than to store our triangle’s vertices.

As you can imagine, the amount of code required to emulate older APIs and ideas can be quite large, and that invites bugs. If you’ve ever wondered where your triangle went after writing a lot of graphics code, you’ll come to appreciate how many details and cruft are in APIs like this, and how many opportunities a driver has to get it wrong, or to accidentally invoke undefined behavior. As the graphics landscape has evolved, we’ve continually added additional flexibility, scalability, and transparency, and the new “modern APIs” are scorched-earth realizations of these ideals. Moving more of this complex management code from the driver to the application increases transparency, in that we now have more control over the GPU, at the expense of us application writers having to do more of what the driver was doing before.

Once you understand what the code in the driver was doing for you previously, understanding modern APIs becomes a lot easier, and it will also help you architect your graphics code in a way that lets you get the most performance out of your GPU, even on older APIs. Modern APIs run on pretty much the exact same hardware as the older ones, so it’s more codifying and designing a new set of APIs, keeping in mind best practices that have been scattered around the industry for decades… albiet some of them being more “secret knowledge”, passed from hardware vendor to game developer under some form of NDA, and spreading semi-publicly from there.

Next time, we’ll be exploring this idea of “best practices” further, by looking at the driving motivations behind one of the modern API’s newest inventions, the “PSO”. See you then.

Why are 2D vector graphics so much harder than 3D?

There’s a lot of fantastic research into 2D graphics rendering these days. Petr Kobalicek and Fabian Yzerman have been working on Blend2D, one of the fastest and most accurate CPU rasterizers on the market, with a novel JIT approach. Patrick Walton of Mozilla has explored not just one, but three separate approaches in Pathfinder, culminating in now Pathfinder V3. Raph Levien has built a compute-based pipeline based on Gan et al’s ahead-of-its-time 2014 paper on vector textures. Signed distance fields seem to be getting some further development from both Adam Simmons and Sarah Frisken independently.

One might wonder: why is there so much commotion about 2D? It seriously can’t be that much harder than 3D, right? 3D is a whole other dimension! Real-time raytracing is around the corner, with accurate lighting and and yet we can’t manage dinky 2D graphics with solid colors?

To those not well-versed in the details of the modern GPU, it’s a very surprising conclusion! But 2D graphics has plenty of unique constraints that make it a difficult problem to solve, and one that doesn’t lend itself well to parallel approaches. Let’s take a stroll down history lane and trace the path that led us here in the first place, shall we?

The rise of PostScript

In the beginning, there was the plotter. The first graphics devices to interact with computers were “plotters”, which had one or multiple pens and an arm that could move over the paper. Things were drawn by submitting a “pen-down” command, moving the arm in some unique way, possibly curved, and then submitting “pen-up”. HP, manufacturer of some of the earliest plotter printers, used a variant of BASIC called “AGL” on the host computer, which then would send commands to the plotter peripheral itself in a another language like HP-GL. During the 1970s, we saw the rise of affordable graphics terminals, starting with the Tektronix 4010. It has a CRT for its display, but don’t be fooled: it’s not a pixel display. Tektronix came from the analog oscilloscope industry, and these machines work by driving the electron beam in a certain path, not in a grid-like order. As such, the Tektronix 4010 didn’t have pixel output. Instead, you sent commands to it with a simple graphing mode that could draw lines but, again, in a pen-up pen-down fashion.

Like a lot of other inventions, this all changed at Xerox PARC. Researchers there were starting to develop a new kind of printer, one that was more computationally expressive than what was seen in plotters. This new printer was based on a small, stack-based Turing-complete language similar to Forth, and they named it… the Interpress! Xerox, obviously, was unable to sell it, so the inventors jumped ship and founded a small, scrappy startup named “Adobe”. They took Interpress with them and tweaked it until was no longer recognizable as Interpress, and they renamed it PostScript. Besides the cute, Turing-complete stack-language language it comes with to calculate its shapes, the original PostScript Language Reference marks up an Imaging Model in Chapter 4, near-identical to the APIs we widely see today. Example 4.1 of the manual has a code example which can be translated to HTML5 <canvas> nearly line-by-line.

/box {                  function box() {
    newpath                 ctx.beginPath();
    0 0 moveto              ctx.moveTo(0, 0);
    0 1 lineto              ctx.lineTo(0, 1);
    1 1 lineto              ctx.lineTo(1, 1);
    1 0 lineto              ctx.lineTo(1, 0);
    closepath               ctx.closePath();
} def                   }
                        
gsave                   ctx.save();
72 72 scale             ctx.scale(72, 72);
box fill                box(); ctx.fill();
2 2 translate           ctx.translate(2, 2);
box fill                box(); ctx.fill();
grestore                ctx.restore();

This is not a coincidence.

Apple’s Steve Jobs had met the Interpress engineers on his visit to PARC. Jobs thought that the printing business would be lucrative, and tried to simply buy Adobe at birth. Instead, Adobe countered and eventually sold a five-year license for PostScript to Apple. The third pillar in Jobs’s printing plan was funding a small startup, Aldus, which was making a WSYWIG app to create PostScript documents, “PageMaker”. In early 1985, Apple released the first PostScript-compliant printer, the Apple LaserWriter. The combination of the point-and-click Macintosh, PageMaker, and the LaserWriter singlehandedly turned the printing industry on its head, giving way to “desktop publishing” and solidifying PostScript its place in history. The main competition, Hewlett-Packward, would eventually license PostScript for its competing LaserJet series of printers in 1991, after consumer pressure.

PostScript slowly moved from being a printer control language to a file format in and of itself. Clever programmers noticed the underlying PostScript sent to the printers, and started writing PostScript documents by hand, introducing charts and graphs and art to their documents, with the PostScript evaluated for on-screen display. Demand sprung up for graphics outside of the printer! Adobe noticed, and quickly rushed out the Encapsulated PostScript format, which was nothing more than a few specially-formatted PostScript comments to give metadata about the size of the image, and restrictions about using printer-centric commands like “page feed”. That same year, 1985, Adobe started development on “Illustrator”, an application for artists to draw Encapsulated PostScript files through a point-and-click interface. These files could then be placed into Word Processors, which then created… PostScript documents which were sent to PostScript printers. The whole world was PostScript, and Adobe couldn’t be happier. Microsoft, while working on Windows 1.0, wanted to create its own graphics API for developers, and a primary goal was making it compatible with existing printers so the graphics could be sent to printers as easily as a screen. This API was eventually released as GDI, a core component used by every engineer during Windows’s meteoric rise to popularity in the 90s. Generations of programmers developing for the Windows platform started to unknowingly equate “2D vector graphics” with the PostScript imaging model, cementing its status as the 2D imaging model.

The only major problem with PostScript was its Turing-completeness — viewing page 86 of a document means first running the script for pages 1-85. And that could be slow. Adobe caught wind of this user complaint, and decided to create a new document format that didn’t have these restrictions, called the “Portable Document Format”, or “PDF” for short. It threw out the programming language — but the graphics technology stayed the same. A quote from the PDF specification, Chapter 2.1, “Imaging Model”:

At the heart of PDF is its ability to describe the appearance of sophisticated graphics and typography. This ability is achieved through the use of the Adobe imaging model, the same high-level, device-independent representation used in the PostScript page description language.
By the time the W3C wanted to develop a 2D graphics markup language for the web, Adobe championed the XML-based PGML, which had the PostScript graphics model front and center.
PGML should encompass the PDF/PostScript imaging model to guarantee a 2D scalable graphics capability that satisfies the needs of both both casual users and graphics professionals.
Microsoft’s competing format, VML, was based on GDI, which as we know was based on PostScript. The two competing proposals, both still effectively PostScript, were combined to make up W3C’s “Scalable Vector Graphics” (“SVG”) technology we know and love today.

Even though it’s old, let’s not pretend that the innovations PostScript brought to the world are anything less than a technological marvel. Apple’s PostScript printer, the LaserWriter, had a CPU twice as powerful as the Macintosh that was controlling it, just to interpret the PostScript and rasterize the vector paths to points on paper. That might seem excessive, but if you were already buying a fancy printer with a laser in it, the expensive CPU on the side doesn’t seem so expensive in comparison. In its first incarnation, PostScript invented a fairly sophisticated imaging model, with all the features that we take for granted today. But the most powerful, wowing feature? Fonts. Fonts were, at the time, drawn by hand with ruler and protractor, and cast onto film, to be printed photochemically. In 1977, Donald Knuth showed the world what could be with his METAFONT system, introduced together with his typesetting application TeX, but it didn’t catch on. It required the user to describe fonts mathematically, using brushes and curves, which wasn’t a skill that most fontgraphers really wanted to learn. And the fancy curves turned into mush at small sizes: the printers of the time did not have dots small enough, so they tended to bleed and blur into each other. Adobe’s PostScript proposed a novel solution to this: an algorithm to “snap” these paths to the coarser grids that printers had. This is known as “grid-fitting”. To prevent the geometry from getting too distorted, they allowed fonts to specify “hints” about what parts of the geometry were the most important, and how much should be preserved.

Adobe’s original business model was to sell this font technology to people that make printers, and sell special recreations of fonts, with added hints, to publishers, which is why Adobe, to this day, sells their versions of Times and Futura. Adobe can do this, by the way, because fonts, or, more formally, “typefaces”, are one of five things explicitly excluded by US Copyright Law, since they were originally designated as “too plain or utilitarian to be creative works”. What is sold and copyrighted instead is the digital program that reproduces the font on the screen. So, to prevent people from copying Adobe’s fonts and adding their own, the Type 1 Font format was originally proprietary to Adobe and contained “font encryption” code. Only Adobe’s PostScript could interpret a Type 1 Font, and only Adobe’s Type 1 Fonts had the custom hinting technology allowing them to be visible at small sizes.

Grid fitting, by the way, was so universally popular that when Microsoft and Apple were tired of paying licensing fees to Adobe, they invented an alternate method for their font file competitor, TrueType. Instead of specifying declarative “hints”, TrueType gives the font author a complete Turing-complete stack language so that the author can control every part of grid-fitting (coincidentally avoiding Adobe’s patents on declarative “hints”). For years, the wars between the Adobe-backed Type 1 and the TrueType raged on, with font foundries being stuck in the middle, having to provide both formats to their users. Eventually, the industry reached a compromise: OpenType. But rather than actually decide a winner, they simply plopped both specifications into one file format: Adobe, now in the business of selling Photoshop and Illustrator rather than Type 1 Fonts, removed the encryption bits, gave the format a small amount of spit polish, and released CFF / Type 2 fonts, which were grafted into OpenType wholesale as the cff table. TrueType, on the other hand, got shoved in as glyf and other tables. OpenType, while ugly, seemed to get the job done for users, mostly by war of endurance: just require that all software supports both kinds of fonts, because OpenType requires you to support both kinds of fonts.

Of course, we’re forced to ask: if PostScript didn’t become popular, what might have happened instead? It’s worth looking at some other alternatives. The previously mentioned METAFONT didn’t use filled paths. Instead, Knuth, in typical Knuth fashion, rigorously defines in his paper Mathematical Typography the concept of a curve that is “most pleasing”. You specify a number of points, and some algorithm finds the one correct “most pleasing” curve through them. You can stack these paths on top of each other: define such a path as a “pen”, and then “drag the pen” through some other path. Knuth, a computer scientist at heart, managed to introduce recursion to path stroking. Knuth’s thesis student, John Hobby, designed and implemented algorithms for calculating the “most pleasing curve”, the “flattening” of the nesting of paths, and rasterizing such curves. For more on METAFONT, curves, and the history of font technology in general, I highly recommend the detailed reference of Fonts & Encodings, and the papers of John D. Hobby.

Thankfully, the renewed interest in 2D graphics research means that Knuth and Hobby’s splines are not entirely forgotten. While definitely arcane and non-traditional, they recently made their way into Apple’s iWork Suite where they are now the default spline type.

The rise of triangles

Without getting too into the math weeds, at a high-level, we call approaches like Bezier curves and Hobby splines implicit curves, because they are specified as a mathematical function which generates the curve. They are smooth functions which look good at any resolution and zoom level, which happen to be good traits for a 2D image designed to be scalable.

2D graphics started and maintained forward momentum around these implicit curves, by near necessity in their use in modelling human letterforms and glyphs. The hardware and software to compute these paths in real-time was expensive, but since the big industry push for vector graphics came from the printing industry, most of the rest of the existing industrial equipment was already plenty more expensive than the laser printer with the fancy CPU.

3D graphics, however, had a very different route. From the very beginning, the near-universal approach was to use straight-edged polygons, often times manually marked up and entered into the computer by hand. Not all approaches, though. The 3D equivalent of an implicit curve is an implicit surface, made up of basic geometric primitives like spheres, cylinders and boxes. A perfect sphere with infinite resolution can be represented with a simple equation, so for organic geometry, it was a clear winner over the polygon look of early 3D. MAGI was one of a few companies pushing the limits of implicit surfaces, and combined with some clever artistic use of procedural textures, they won the contract with Disney to design the lightcycle sequences for the 1982 film Tron. Unfortunately, though, that approach quickly fell by the wayside. The number of triangles you could render in a scene was skyrocketing due to research into problems like “hidden surface removal” and faster CPUs, and for complex shapes, it was a lot easier for artists to think about polygons and vertices they could click and drag, rather than use combinations of boxes and cylinders to get the look they wanted.

This is not to say that implicit surfaces weren’t used in the modelling process. Tools like Catmull-Clark subdivision were a ubiquitous industry standard by the early 80s, allowing artists to put a smooth, organic look on otherwise simple geometry. Though Catmull-Clark wasn’t even framed as an “implicit surface” that can be computed with an equation until the early 2000s. Back then, it was seen as an iterative algorithm: a way to subdivide polygons into even more polygons.

Triangles reined supreme, and so followed the tools used to make 3D content. Up-and-coming artists for video games and CGI films were trained exclusively on polygon mesh modellers like Maya, 3DS Max and Softimage. As the “3D graphics accelerator” came onto the scene in late-80s, it was designed to accelerate the existing content out there: triangles. While some early GPU designs like the NVIDIA NV1 had some limited hardware-accelerated curve support, it was buggy and quickly dropped from the product line.

This culture mostly extends into what we see today. The dominant 2D imaging model, PostScript, started with a product that could render curves in “real-time”, while the 3D industry ignored curves as they were difficult to make work, relying on offline solutions to pre-transform curved surfaces into triangles.

Implicit surfaces rise from the dead

But why were implicit curves able to be done in real-time on 2D on a printer in the 80s, and yet 3D implicit curves are still too buggy near the early ’00s? Well, one answer is that Catmull-Clark is a lot more complicated than a Bezier curve. Bezier curves do exist in 3D, where they are known as B-splines, and they are computable, but they have the drawback that they limit the ways you can connect your mesh together. Surfaces like Catmull-Clark and NURBS allow for arbitrarily connected meshes to empower artists, but this can lead to polynomials greater than the fourth degree, which tend to have no closed-form solution. Instead, what you get are approximations based on subdividing polygons, like what happens in Pixar’s OpenSubdiv. If someone ever finds an analytic closed-form solution to root-finding either Catmull-Clark or NURBS, Autodesk will pay a lot of money for it, for certain. Compared to these, triangles seem a lot nicer: simply compute three linear plane equations and you have yourself an easy test.

… but what if we don’t need an exact solution? That’s exactly what graphics developer of incredible renown Íñigo Quílez asked when doing research into implicit surfaces again. The solution? Signed distance fields. Instead of telling you the exact intersection point of the surface, it tells you how far away you are from a surface. Analoguous to an analytically computed integral vs. Euler integration, if you have the distance to the closest object, you can “march” through the scene, asking how far away you are at any given point and stepping that distance. Such surfaces have seen a brand new life through the demoscene and places like Shadertoy. A twist on the old MAGI approach to modelling brings us incredible gems like Quílez’s Surfer Boy, calculated in infinite precision like an implicit surface would. You don’t need to find the algebraic roots of Surfer Boy, you just feel it out as you march through.

The difficulty, of course, is that only a legitimate genius like Quílez can create Surfer Boy. There’s no existing tooling for signed-distance field geometry, it’s all code. That said, given the exciting resurgence of implicit surfaces for their organic, curved look, there’s now plenty of interest into the technique. MediaMolecule’s PS4 game Dreams is a content-creation kit built around combining implicit surfaces, requiring them to tear down and reinvent most of traditional graphics in the process. It’s a promising approach, and the tools are intuitive and fun. Oculus Medium and unbound.io are also putting good research into the problem. It’s definitely a promising look into what the future of 3D graphics and next-generation tools might look like.

But some of these approaches are less adaptable to 2D than you might think. Common 3D game scenes tend to have lush materials and textures but low geometry counts, as many critics and snake-oil salesman are quick to point out. This means that we can rely on smaller amounts of anti-aliasing as silhouettes are not as majorly important. Approaches like 4x MSAA might cut the mustard for a lot of games, but for small fonts with solid colors, instead of 16 fixed sample locations, you would much rather compute the exact area under the curve for each pixel, giving you as much resolution as you want.

Rotating the viewport around in a 3D game has the effect of causing effects similar to saccadic masking as your brain re-adjusts to the new view. For a lot of games, this can help hide artifacts in post-processing effects like temporal antialiasing, which Dreams and unbound.io heavily lean on to get good performance of their scene. Conversely, in a typical 2D scene, we don’t have this luxury of perspective, so attempting to use it will make our glyphs and shapes boil and jitter with those artifacts in full glory. 2D is viewed differently, and the expectations are higher. Stability is important as you zoom, pan, and scroll.

None of these effects are impossible to implement on a GPU, but they do show a radical departure from “3D” content, with different priorities. Ultimately, 2D graphics rendering is hard because it’s about shapes — accurate letterforms and glyphs — not materials and lighting, which is mostly a solid color. GPUs, through a consequence of history, chose not to focus on real-time implicit geometry like curves, but instead on everything that goes inside them. Maybe in a world where PostScript didn’t win, we would have a 2D imaging model that didn’t have Bezier as a core realtime requirement. And maybe in a world where triangles were replaced with better geometry representations sooner, we would see content creation tools focus on 3D splines, and GPUs that have realtime curves built right into the hardware. It’s always fun to imagine, after all.

Six Years of noclip.website

I’ve always had a love for the art in video games. Sure, all mediums have some ability to craft worlds from nothing, but none are so realized, so there, as the worlds in video games. It’s also an example of the interplay between “pure art” and computer technology. And in the case of the GameCube/Wii, technology from 1999, even! Yes, smart engineers played a large role in building game engines and tooling, but artists are very rarely appreciated, yet they are the ones responsible for what you see, what you hear, and often how you feel during a specific section. Not only do they model and texture everything, they control where the postprocessing goes, where the focus is pulled in each shot, and tell the programmers how to tweak the lighting on the hair. A good artist is a “problem solver” in much the same way that a good engineer is. They are the ones who decide how the game looks, from shot and scene composition to materials and lighting. The careful eye of a good art director is what turns a world from a pile of models to something truly special. Their work has touched the lives of so many over the past 30 years of video game history.

noclip.website, my side project, is a celebration of the incredible work that video game artists have created. It lets you explore some video game maps with a free camera and see things from a new perspective. It’s blown up to moderate popularity now, and as it’s now reaching 6 years of development from me in some ways, I figured I should say a few words about its story and some of the experiences I’ve had making it.

My first commit to an ancient noclip predecessor, bmdview.js, was made on April 11th, 2013. I was inspired by the works of amnoid’s bmdview and my own experiments with the tool. I had never written any OpenGL or graphics code before, but I managed to stumble through it through trial and error and persistence.

This is the first screenshot I can find of bmdview.js with something recognizable being drawn. It’s the Starship Mario from Super Mario Galaxy 2.

After this resounding initial success, I… put it down, apparently. I had a lot of side projects at the time: learning about GIF/JPEG compression, about the X Window System, and doing some game reverse engineering of my own. I revisited it a few times over the years, but a “complete emulation” always seemed out of reach. There were too many bugs and I really barely knew anything about the GameCube’s GPU.

A few years or so later, I was inspired enough to create an Ocarina of Time viewer, based off the work that xdaniel did for OZMAV. I can’t remember too many details; it’s still live and you can go visit it but I did have to un-bit-rot it for this blog post (JS API breakage; sigh). I really enjoyed this, so I continued with a few more tools, until I had the idea to combine them all. First came Super Mario 64 DS in October of 2016, and then zelview.js was added in 20 days after. I don’t have time to recount all of the incredible people who I either based my work on or have contributed to the project directly; the credits page should list most of them.

Keeping Momentum

Keeping a side-project going for this long requires momentum. When I started the project, I decided that I would try as hard as possible to prevent refactors. Refactoring code kills momentum. You do not want to write more code that will be changed by the refactor, so you stop all your progress while you get it done, and of course, as you change the code for this new idea you have, you encounter resistance and difficulty. You also have to change this. Or maybe you found your new idea doesn’t fit as well in all cases, so the resulting code is still as ugly as when you started. And you get discouraged and it’s not fun any more, so you decide not to work on it for a bit. And then “a bit” slowly becomes a year, and you get less guilty about not touching it ever again. It’s a story that’s happened to me before. *cough*.

In response, I optimized the project around me, and my goals. First up, it’s exciting, it’s thrilling to get a game up on screen. There’s nothing like seeing that initial hints of a model come through, even if it’s all contorted. It’s a drive to keep pushing forward, to know that you’re part of the way there. Arranging the code so I can get that first hint of game up on the screen ASAP is a psychological trick, but it’s an effective one. That rush has never gotten old, even after the 20 different games I’ve added.

Second, I’m probably the biggest user of my own site. Exploring the nooks and crannies of some nostalgic game is something I still do every day, despite having done it for 6 years. Whether it’s trying to figure out how they did a specific effect, or getting a new perspective on past nostalgia gone by, I’m always impressed with the ingenuity and inventiveness of game artists. So when I get things working, it’s an opportunity to explore and play. This is fun to work on not just because other people like it, but because I like it. So, when people ask me if I will add so-and-so game, the answer is: probably not, unless you step up to contribute. If I haven’t played it, it makes it harder for me to connect with it, and it’s less fun for me to work on.

Third, I decided that refactors were off-limits. To help with this, I wanted as little “abstractions” and “frameworks” as possible. I should share code when it makes sense, but never be forced to share code when I do not want it. For this, I took an approach inspired by the Linux kernel and built what they call “helpers” — reusable bits and bobs here and there to help cut down on boilerplate and common tasks, but are used on an as-needed basis. If you need custom code, you outgrow the training wheels, from the helpers to your own thing, perhaps by copy/pasting it, and then customizing it. Both tef and Sandi Metz have explored this idea before: code is easier to write than it is to change. When I need to change an idea that did not work out well, I should be able to do it incrementally — port small pieces of the codebase over to the new idea, and keep the old one around. Or just delete ideas that did not work out without large change to the rest of the code.

As I get older, I realize that the common object-oriented languages make it difficult to share code in better ways and too easily lock you into the constraints of your own API. When this API requires a class, it is less momentum to squeeze your own code to extend a base class rather than creating a new file for an interface, putting what you need in there, making your class inherit the interface, and then try to share code and build the helpers. TypeScript’s structurally typed interfaces, where any class can automatically match the interface, makes it easy to define new ones. And simply by not requiring a separate file for an interface, TypeScript makes them a lot easier to create, and so, you end up with more of them. Small changes to how we perceive something as “light” or “heavy” makes a difference for which one we reach for to solve a problem.

I am not saying that these opinions are the absolute right way to build code. There are tradeoffs involved: the amount of copy/paste code is something I can keep in my head, but someone else trying to maintain it might be frustrated by the amount of different places they might have to fix a bug. What I am saying is that this was originally intended as my side-project site, and so the choices I made were designed to keep me happy in my spare time. A team or a business? Those have different goals, and might choose different tradeoffs for their organization and structure. I do think I have had tremendous success with them.

The Web as a Platform

My hobby is breaking the web. I try to push it to its limits, and build experiences that people thought should never have been possible. This is easier than you might think, because there’s a tremendous amount of low-hanging fruit in web applications. There’s a lot that’s been said on performance culture, premature optimization, and new frameworks that emphasize “programmer velocity” at the expense of performance. That said, my experience as a game developer does crowd my judgement a bit. We’re one of the few industries where we are publicly scrutinized over performance metrics by an incessant and very often rude customer base, over numbers and metrics. Anything less than 30fps and you will be lambasted for “poor optimization”. Compare that to business software, where I’m not even sure my GMail client can reach 10fps. I recently discovered that Slack was taking up 10GB of memory.

A poor performance culture is baked into the web platform itself, through not just the frameworks, but the APIs and new language features as well. Garbage collection exists, but its cost is modelled to be free. Pauses caused by the garbage collector are the number one cause of performance issues that I’ve seen. Objects like Promises can cause a lot of GC pressure, but in most JavaScript libraries found on npm they are created regularly, sometimes multiple times per every API call. The iterator protocol is an absurd notion that creates a new object on every iteration because they did not want to add a hasNext method (too Java-y?) or use Python’s StopIteration approach. Its performance impact can be even worse — I’ve measured up to a 10fps drop just because I used a for…of to iterate over 1,000 objects per frame. We need to hold platform designers accountable for decisions that affect performance, and build APIs that take the garbage collector into account. Why can’t I reuse a Promise object in an object pool? Does this interface need a special dictionary, or can I get away without it?

Writing performant code is not hard. Yes, it takes care and attention and creative problem solving, but aren’t those the things that inspired you to the craft of computer science in the first place? You don’t need more layers, you don’t need more dependencies, you just need to sit down, write the code, and profile it.

Oh, by the way. I recommend TypeScript. It’s one of the few things I truly love about modern web development. Understand where it’s polyfilling for you, and definitely don’t always trust the compiler output without verifying it yourself, but it has earned my trust enough over time for it to be worth the somewhat minor pains it adds to the process.

WebGL as a Platform

OpenGL is dreadful. I say this as a former full-time Linux engineer who still has a heart for open-source. It’s antiquated, full of footguns and traps, and you will fall into them. As of v4.4, OpenGL gives you enough ways to avoid all of the bad ideas, but you still need to know which ideas are bad and why. I might do a follow-up post on those eventually. OpenGL ES 3.0, the specification that WebGL2 is based on, unfortunately has very few of those fixes. It’s an API very poorly suited for underpowered mobile devices, which is where it ended up being deployed.

OpenGL optimization, if you do not understand GPUs, can be a lot like reading the tea leaves. But the long and short of it is “don’t do anything that would cause a driver stall, do your memory uploads as close together as you can, and change state as little as possible”. Checking whether a shader had any errors compiling? Well you’ve just killed any threading the driver had. glUniform? Well, that’s data that’s destined for the GPU — you probably want to use a uniform buffer object instead so you can group all the parameters into one giant upload. glEnable(GL_BLEND)? You just recompiled your whole shader on Apple iDevices.

And, of course, this is to say nothing of the quirks that each driver has. OpenGL on Mac has a large number of issues and bugs, and, yes, those extend to the web too. ANGLE also has its own set of bugs, which, yes, I also have a workaround for.

The hope here is for modern graphics APIs to make their way to the web. Thankfully, this is happening with WebGPU. The goal is to cut down on WebGL overhead and provide something closer to the modern APIs like Vulkan and Metal. But this requires building your application in a way that’s more suitable to modern graphics APIs: thinking about things in terms of buffers, pipelines and draw calls. The one large refactor I allowed myself in the project was porting the whole platform to use my WebGPU-esque API, which I currently run on top of WebGL 2. Though, as an example of how exhausting that refactor was; even though I designed it so that I could work on all the games at separate times, with both the legacy and modern codepaths existing live on the site for months, it took me over 5 months to port all the different games, and I also had to remove zelview.js as a casualty. These things are exhausting, and I didn’t have the energy to push it through. I eventually was able to build on my N64 experience when it came time to build Banjo-Kazooie, and made something much better the second time around.

I ended up with a fairly unique renderer design I’m happy about — it’s similar to recent refactors like the one that showed up in Unreal 4.22. Perhaps in the future, I can make a post about modern game engine renderers, the different passes in play, and how they relate to the underlying platform layer, for those interested in writing your own.

The future

My main side project a few years ago was Xplain, an interactive series about the X Windowing System and its role in graphics history. I’m no longer working on Xplain. Some say I have a talent for clear, explanatative writing, but it takes me a long time to craft a story, lay out all the pieces, and structure it in a way to build intuition by layering pieces. I tried to reboot it a few years ago by reframing it as no longer about X11, hoping that would get me in the mood to write again, but no. It’s attached to a legacy I’m not particularly interested in exploring or documenting any further, and the site’s style and framework are not useful. I’m still interested in writing explanatory writing, but it will be here, like my post on Super Mario Sunshine’s water.

noclip is not my day job, it is my side project. It is a labor of love for me. It’s a website I have enjoyed working on for the past 6 years of my life. I don’t know how much longer it’ll continue beyond that — I’m feeling a bit burned out because of the large amount of tech support, bills to pay, and general… expectations? From people? They want all the games. I get it, I really do. I am ecstatic to hear that people like my work and my tool enough to know that they want their game in there too. But at some point, I’m going to need a bit of a break. To help continue and foster research, I decided to fund it myself for a game I’ve long wanted on the site while I take it a bit easier. If this is succesful, I plan to do more of it. I hope to build a community of contributors to the project. Having a UI designer or frontend developer would be nice. If you want to contribute, please join the Discord! Or, at least show your appreciation in the comments. For more interesting video game snippets, I post pretty frequently on my Twitter now. If you made it all the way to the bottom here, thank you. I hope it was at least interesting.

Deconstructing the water effect in Super Mario Sunshine

Note: The demos below require WebGL2 support. If you are running a browser without WebGL 2 support, user “petercooper” on Hacker News has helpfully recorded a video and GIFs and for me.

One of my hobbies is writing model viewers and graphics toys for games. It’s a good mix of my interests in graphics and rendering, in reverse engineering complex engines, and nostalgia for old video games.

I recently extended my WebGL-based game model viewer to add support for some of Nintendo’s GameCube games, including The Legend of Zelda: The Wind Waker and Super Mario Sunshine. The GameCube, for those unaware, had a novel, almost-programmable, but fixed-function GPU. Instead of developers writing shaders, they programmed in a set of texture combiners similar to the methods used in glTexEnv pipelines, but taken to the extreme. For those used to modern programmable GPUs, it can be quite the mindbending experience to think that complex effects can be done with this thing. And yet, 2002 saw the release of Super Mario Sunshine with some really good looking water for its time. Replicated in WebGL below:

This water effect is loaded into your browser directly from the original game files for Delfino Plaza and placed onto a plane. Let’s take a deeper dive into how this was done, shall we?

Texturing the plane

Believe it or not, the effect actually starts out like this:

The effect can be seen as a fairly complex variant on something super old: “texture scrolling”. It’s a bit more complicated than what’s displayed here, but the fundamentals remain the same. Our plane starts life as this scrolling wave-y texture, which provides us some interesting noise to work with. This is then combined with a second layer of the same texture, but this time only scrolling in one dimension.

This gives us an interesting moire pattern which is the basis for how the water appears to bubble and shift around so naturally. You might even notice some “ghostly”-looking alignment when the two textures meet up. This artifact is visible in the original material, but it appears more like an intentional sunbeam or a ray of light scanning across the water. Hiding artifacts like this with clever material design is a large part of game graphics techniques.

Obviously, the texture isn’t black. Instead of the colors being black and white, they’re blended in with the background, giving us something that looks more transparent.

Now we’re getting somewhere. At this point, the second texture layer is also added in twice as much as the first, which makes it looks especially bright, almost “blooming”. This feature will come in handy later to define highlights in our water.

Going back to the original material, it’s a lot more “dynamic”. That is, as we move the camera around, zoom in and out, the texture seems to morph with it. It’s clear when it’s near us, and also fades out in the distance. Now, in a traditional fixed-function pipeline, this sort of effect is impossible. There’s no possible way this material can know the distance from the camera! However, Nintendo uses a clever abuse of a more traditional feature to implement this sort of thing. Let’s talk about what I like to call “mip trick”.

Building a square mip out of a round hole

Mip-mapping is a traditional graphics optimization. You see, when GPUs apply textures, they want the resulting image to be as smooth as possible, and they want to to be as fast as possible. The texture we’re sampling from here is actually only 64×64 pixels in size (yes, it’s true!), and our browser windows tend to be a lot bigger than that. If you zoom in, especially in our last demo, you can “see the pixels”, and also how they blend together and fade in and out, but keep in mind that GPUs have to compute that for every pixel in the resulting image. Looking from above, in this case the texture is magnified, but when looking at it at an angle, as the plane becomes more squashed from perspective in the distance, and the texture on the screen drops to less than 64×64 in size.

When this happens, the texture is said to be “minified”, and the GPU has to read a lot more pixels in our texture to make the resulting image smooth. This is expensive — the GPU wants to read as few pixels as possible. For this reason, we invented “mip-maps”, which are precomputed smaller versions of each image. The GPU can use these images instead when the texture is minified. So, we have 32×32 versions of our texture, and 16×16 versions of our texture, and the GPU can select which one it wants, and even blend across two versions to get the best image quality. Mipmaps are an excellent example of a time/space tradeoff, and an example of build-time content optimizations.

However, you might have noticed, “as the texture becomes minified”. That happens when it becomes smaller on the screen, which… tends to happen when the texture is… farther away. Are you picking up on the hint here? It’s a way to pick out distance from the camera!

What if, instead of using smaller versions of the same texture, we instead use different textures? Nintendo had the same idea. This is what I call the “mip trick”. The wave texture I showed you above isn’t the full story. In practice, here’s the full wave texture, with all of its mipmap levels shown.

In the largest mipmap level (where the texture is closest to the camera), we don’t have any pixels. This basically removes the water effect in a small radius around the camera — letting the water be clear. This both prevents the water material from getting too repetitive, and also helps during gameplay by showing the player the stuff underwater that is closest to them. Clever! The second mipmap level is actually the texture I’ve been using in the demo up until now, and is “medium-strength”.

The third mipmap level is the brightest, which corresponds to that “band” of bright shininess in the middle. This band, I believe, is a clever way of faking enviornment reflections. At that camera distance, you can imagine we’d mostly being the reflection from our skybox when at a 20 degree angle looking into the water, like our clouds. In Sirena Beach, this band is tinted yellow to give the level a beautiful yellow glow that matches the evening sunset.

Let’s try uploading all of these mipmaps now into our demo.

That’s getting us a lot closer! We’re almost there.

As a quick aside, since the algorithm choosing of which mipmap to use for the texture is hardcoded into the GPU, it does mean it isn’t necessarily portable. The GameCube renders in a resolution 640×548, and the mipmaps here are designed for that size. The Dolphin developers noticed this as well — since Dolphin can render in higher resolutions than what the GameCube can handle, this trick can break unless you are careful about it. Thankfully, modern graphics APIs have ways of applying a bias to the mipmap selection. Using your screen resolution and the knowledge of the original 640×548 design of the GameCube, we can calculate this bias and then use that while sampling.

With that out of the way, it’s time for the final touch. Again, believe it or not, there’s only one thing left to turn our last demo into the final product. A simple function (known as the alpha test) tests “how bright” the resulting pixel is, and if it’s between a certain threshold, the pixel is kicked out entirely. In our case, any pixels between 0.13 and 0.92 are simply dropped on the floor.

This gives us the unique “seran wrap” look for the outer bands of the effect, and in the middle, the water is mostly composed of these brighter pixels, and so the higher threshold lets only the really bright pixels shine through, giving us that empty band and those wonderful highlights!

Forgotten Lore

In today’s days of programmable shaders, PBR pipelines, and increased art budgets, these tricks are becoming more and more like forgotten knowledge. Nintendo’s GameCube-era games have, in my admittedly-biased mind, some of the best artwork done in this era. Even though I mentioned “GameCube”, the Wii was effectively the same hardware, and so these same tricks can be found in the Mario Galaxy games, Super Smash Bros. Brawl, and even The Legend of Zelda: Skyward Sword. It’s impressive that GPU technology from 2001 carried Nintendo all the way through 2012, when the Wii U was released.

Good art direction, a liberal amount of creative design, and intricate knowledge of the hardware can make for some fantastic effects under such constraints. For more fun, try figuring out the glass pane effects in Delfino Hotel or the improvements upon the technique used in Super Mario Galaxy.

The code used for every one of these demos is open-source and available on GitHub. Obviously, all credits for the original artwork go to the incredibly talented artists at Nintendo. Huge special thanks to all of my friends working on the Dolphin team.

Web DRM

This post is different from my usual material. Despite the name, I’m not going to talk about actual coding all that much. This post might be classified under “lament”, or maybe “rant”. I talk about problems, reflect on them, and ultimately offer no solutions. As always, opinions are entirely my own, but are definitely influenced by my employer, my friends, my social status, and whatever ad campaign I saw last week, because that’s how opinions work. Please enjoy.

In May of 2016, a small section of the internet was chasing after a mystery. Someone noticed a mysterious symbol had appeared in two different games, of an eye inside a hand. Both of them had been there, lying in plain sight, for half a year. It’s known as an “Alternate Reality Game”, or “ARG”. A sort of invented Da Vinci’s Code, where mysteries and puzzles unlock clues, blurring the lines between fiction and reality. The game usually ends in a marketing message for something else, nothing more than a “be sure to drink your ovaltine”. The allure of a random symbol being placed in a bunch of games, in secret, and having such a long time before being noticed at all is really cool. So, once discovered, off the “game detectives” went, cracking the code and solving the puzzles that lay before them.

Most games were beaten quickly, by simply cracking open the .exe files and the game’s data, often long before the “proper” method of solving it was done. With the exception of one game. It was the earliest of these symbols placed, in fact. The online game Kingdom of Loathing added the symbol in late 2014. It was the last puzzle in the ARG to be solved. Nobody could crack the code, through datamining or otherwise. The correct answer involved noticing certain items in the game could spell out a secret code: “nlry9htdotgif”. It referred to a file on their servers.

Before the community managed to figure it out, the developers hinted at the solution through their podcast. Their choice of words was, to my ears at the time, interesting.

The other games that are involved in this ARG, almost all of them, the thing that people were looking for, just got datamined out of them because they were just Steam games, and we had the advantage of like, well, “this is a web game, so we have always online DRM!” that makes it so you actually have to solve the puzzle.


I don’t use Spotify. I download MP3 files and buy the albums. I wasn’t always like this. I was super excited when Spotify first came to America. I signed up, explored a lot of music, and found an artist I really enjoyed. The next week, the artist was wiped from the service. I canceled my subscription. Paradoxically, as Netflix and Spotify and Steam grow in popularity, there’s less and less content on it. Artist rates are declining, and everybody wants the 30% cut that the platform owners take. Everybody’s launching their own streaming service, and so, this month, It’s Always Sunny in Philadelphia is leaving Netflix. FOX doesn’t need Netflix anymore, since they have Hulu, and they want your money through Hulu Plus. Want to watch Game of Thrones? HBO NOW will cost you $14.99. Crunchyroll, $11.95. YouTube Red, $9.99. Twitch Prime, $10.99. The dangers of a la carte cable TV seem very real.

Several of my more tech-savvy friends are with me. The guys that waited in line for the first iPhone, and were using Netflix when it sent you DVDs through the mail. There’s a gap in their Blu-Ray collection, starting around 2008. But last year, they’re starting to buy things again. It’s nice to actually own media that won’t expire. Yes, it has DRM — the shitty, encrypted kind. But it doesn’t have web DRM. The disc won’t physically expire because the servers don’t want to send you the file anymore. Programmers can always crack the encryption keys with enough exerted effort. While everybody was afraid of Encrypted Media Extensions in the web browser, Netflix and Spotify were off building something far more ridiculous. Cracking an RSA key feels a lot less intimidating to me now.

Netflix is choosing to continue House of Cards without Kevin Spacey. However, it feels entirely plausible that after the massive wave of recent sexual assault scandals in Hollywood, Netflix might reverse its course and delete the show from their servers forever. It’s now forever “out of print.” After all, the Cosby 77 special was never released. This isn’t a new problem: a lot of TV shows have never seen the light of day after their original broadcast date, except maybe on giant tape reels in old storage rooms somewhere. Every old TV show famously has “the lost episode”. Plenty of old movies are missing forever. But those feel to me like matters of negligent archiving. Netflix scorching an entire show, perhaps even because of public pressure from us, the people, feels a lot more deliberate. And maybe you’re OK with that. Separation of the artist and the work is something that’s becoming more and more difficult to grapple with in today’s society, and perhaps we should just light everything by Bill Cosby and Kevin Spacey up in flames. But the only place left to find anything lost through that will be on the hard drives of people that torrented it.

It feels entirely plausible that after sexual assault allegations about Kevin Spacey, House of Cards might just disappear from the world entirely. Netflix pulls the video files from their app, and that’s that.

And of course I can’t write about this without mentioning subscription software. As we transition from desktop software to web services, it’s very rare to find a “pay-once” kind of deal like you used to. Adobe’s Creative Cloud started that trend by pushing their entire suite of apps, including Photoshop, to a monthly subscription, and it was quickly followed up by Autodesk and QuickBooks. If you cancel your subscription, you lose the ability to use the apps entirely. Web DRM was so successful that we’re now using it for standard industry tools.

Gadgets are having the same issues. Companies releasing internet-enabled devices rarely think about the longevity of any of it. Logitech had no empathy for bricking customers’ devices until they were called out. And Sony TVs from five ago can’t run the YouTube app; Google broke their devices. YouTube doesn’t need Sony. It’s more effective for them to move fast and break things, leaving a pile of consumer angst in the wake.

There’s a common saying: “nothing ever gets lost on the internet”. Digital culture is supposed to be the prime time for extremely nitpicky nerds. Everything is recorded, analyzed, copied. As storage, hosting, bandwidth costs go down, more and more things are supposed to be preserved. But this couldn’t be further from the truth. The fundamental idea of the web is that anything can link to anything — people can explore and share and copy with nothing but a URL. But the average “half-life” of a link is two years. This post has 49 links. If you’re reading this in 2019, it’s likely only around 24 of them will actually point where I wanted them to point.

“How much knowledge has been lost because it only exists in a now-reaped imageshack upload embedded in a forum post?”. By 2019, I expect this user’s Twitter profile to have gone private, or deleted entirely, or Twitter changing their URL structure and breaking links everywhere.


Publishing a movie on YouTube is no longer as expensive as publishing a DVD in your local FYE. Costs have gone down. This has enabled an explosive level of amazing creativity and enabled so many projects and endeavors it hasn’t before. Being a musician doesn’t require signing to a label. Upload anything to SoundCloud, YouTube, and Bandcamp and you’re now a musician. Web 2.0, as corny as the term is, is primarily about so-called “user-generated content”.

As a creator, this can be a blessing and a curse. I probably wouldn’t have had a voice 30 years ago, since I barely have anything interesting or original to say. Today, I have a voice, but so do 20,000 other people. Some say we’re in an attention economy: that there’s so much being created, that people are overwhelmed. Yes, there’s now 20,000 more musicians, but the number of people listening stays the same. Your struggle isn’t necessarily to be heard, it’s to be heard for more than five seconds. Google Analytics tells me that the average time reading any one of my posts, the so-called “time on page”, is 37 seconds. 90% of my readers have clicked Back in their browser long before reading this sentence.


I don’t believe in Idiocracy. The population isn’t getting dumber. The population’s IQ (whatever you think about it as a metric for measuring intelligence), has been going up. Plenty of people are still reading and learning — Wikipedia is the fifth most popular site in the world, after all.

What I believe is happening is that our reading is getting less expensive. All of the links I’ve posted here are to free sources, except for one. Do you have a Wall Street Journal account? I don’t. I used one weird trick to bypass it. It’s horrible, and I don’t like that I did it. As a society, we’re not paying for the things we used to. Stuff we totally should be paying for. Prices for entertainment, for news, for media, have nosedived in the past 20 years. Why pay for the Wall Street Journal when someone from Bloomberg or the Huffington Post will summarize the article for me, for free?

Some people are disappointed by the fact BuzzFeed now has a seat at the White House. But perhaps BuzzFeed’s more attention-grabby parts are simply the price we pay to fund its Pulitzer-prize winning journalism.

30 years ago, this article might have been published as an article in a newspaper, its grammar and style thoroughly edited by someone whose job it was to do nothing but that, and we’d both get paid for it. Today, this blog costs me money to host and I don’t make any money from it. Music albums that used to cost $20 now cost $5.99. But in terms of large-scale productions, they cost more than ever. TV shows take millions more than they once did to make: as expectations and fidelity go up, so do production costs. Sets, props, and visual effects need to be crafted more carefully than ever to appeal to high definition TV screens. Gamers seeking thrills demand higher frame rates, bigger polygons, and more pixels. YouTube beats this by offering lower-budget productions. iOS beats this by offering cheaper, “indie” titles.

I now work for a company that makes mobile applications. The price of a mobile application is $0.99. And you can still expect 90% of Android users to pirate it. This is, to say the least, unsustainable. Mobile games need to make money not from app sales, but from in-app purchases fueled by psychology.


Nintendo, the top dog of “triple-A” video games studios, was recently skewered by investors for daring to release a mobile game featuring Mario… for $10. It did not meet their sales predictions. Their newest mobile game, which is free-to-play and features in-app purchases, seems to be fairing a bit better.

On closer inspection though, there’s something funky about those numbers.

Atul Goyal, a senior analyst at Jefferies, told CNBC’s “Squawkbox” that he expected 500 million downloads of the Super Mario Run app on the Apple app store by March 2017.”

But according to analyst Tom Long of BMO Capital Markets, there are 715 million iPhones in use. That gives us two answers: either Tom Long is wrong, or Atul Goyal is. Two out of every three iPhone users is an unreasonable target for a Nintendo game.

A total of 1 billion downloads of the app are expected across operating systems, he added.

I don’t claim to be a senior analyst. But I also don’t claim that 13% of the world’s population will have downloaded a Mario game. This feels to me like an unrealistic growth target. As people pay less and less individually for games, you need to make things up in volume.


The low cost of production, the low cost of consumption, the attention economy, web DRM aren’t new ideas or new problems. We’re going to need to find a way out of this. Cracked.com published a fairly influential article (warning: might be unsuitable for work) on this subject back in 2010. David Wong’s term is “Forced ARTifical Scarcity” (“FARTS” for short. Har har. The article did come out in 2010, after all). His main argument is that we’ve switched mediums: things that were previously paid for by the cost of shipping a physical disk or pieces of paper are now effectively free. Business models built on ratios of supply and demand failed to take into account what would happen when supply is now infinite.

But there’s a crucial mistake hiding in there.

Remember the debut of Sony’s futuristic Matrix-style virtual world, PlayStation Home? There was a striking moment when the guys at Penny Arcade logged in and found themselves in a virtual bowling alley… standing in line. Waiting for a lane to open up. In a virtual world where the bowling alley didn’t actually exist. It’s all just ones and zeros on a server–the bowling lanes should be effectively infinite, but where there should have been thousands of lanes for anybody who wanted one, there was only FARTS.

Servers aren’t free, David. They’re physical things, hooked into a physical wire. They only have so much power and so much capacity. They go down, they overheat, they break, just like any other machine. There’s electricity to pay. This scarcity might be forced, but probably isn’t. Left to their own devices, people will hack and cheat. A badly programmed server might allow you to bowl on someone else’s lane. The same ingenuity that cracks open DRM also shatters fair play. Fixing bugs, applying security updates all take programmers, and money.

The servers go down when the money coming in doesn’t match the money going out.

People tend to think the internet is free and fair, but it’s anything but. I’m not talking simply about net neutrality rules, which do worry me, but about peering and transit. In 2014, this culminated in a public explosion between Netflix, Cogent, and Verizon, and the details are a lot more interesting and subtle than originally meet the eye. Bandwidth is expensive and there are unwritten, long-standing de facto rules about who pays for it. Fiber optic cable is expensive and fragile, costing upwards of $80,000 per mile. The hacker community can dream of a free internet, but unless someone eats that cost it’s not happening.


The Right to Read feels more and more realistic every day. It’s troubling. But I think the reason it feels realistic is because of everything I just described. When free digital copying upends 200 years of economic ideas and stability, the first impulse would be to stop it, or delay it until we can figure out what all of this means. DRM, to me, is an evil, but it’s a necessary and hopefully temporary one. It feels like there’s a growing deluge of water held back by a rickety dam. The people with the money go and rebuild it every 5 years, but it’s not going to hold that much longer. The pressure keeps building until the DRM can’t sustain the raw torrent of mayhem that will break it open. You’re now flooded and half the world’s underwater. Better hope you have a boat.

No, I don’t know what the boat is in this metaphor either.


People look to crowdfunding as a way to solve these problems, but I think people massively underestimate how much money at a raw level it takes to build an actual production. Kickstarter’s own lists of the most funded projects lists three campaigns for the Pebble watch, a company that got bought out by Fitbit this year after running out of money, the COOLEST COOLER, which appears to have gone south, and the OUYA, a games console which is probably best described to a link to the Crappy Games Wiki. OUYA, Inc. was later bought out by RAZER after, well, running out of money. Even the $8 million raised through Kickstarter had to be followed up with $25 million more dollars of private investor money.

$8 million might seem like a lot of money, but it quickly dries up when running an actual production. Next time you see a movie, or play a game, stare closely at the credits. Think about each one of those people there, their salary, and how much they worked on the final product. And then think about the countless uncredited cast and crew, and subcontractors of subcontractors who barely get so much as a Special Thanks.


Upload anything to SoundCloud, YouTube, and Bandcamp and you’re now a musician.

Funny story, that. SoundCloud takes servers and electricity, too. SoundCloud almost went out of business this year, but it was kept alive by investors trying to save the company. In two years, SoundCloud will likely die, because it couldn’t make money to keep the servers running. Or maybe it will get bought by Google as part of an “acqui-hire”. Your prize is your songs, your followers, your playlists all go away, replaced with an email thanking you for taking part in their incredible journey.

Apple’s iTunes Music Store, according to rumors, likely won’t be a music store in the near future. Even Spotify… let me repeat that, Spotify, everyone’s darling music service, can’t figure out how to make money. Hell, YouTube still isn’t profitable, but Google runs it at a loss anyway. The hope is eventually it will pay off.

Bandcamp, which offers premium album downloads and DRM-free content, is profitable.

Perhaps Web DRM isn’t as lucrative as we thought.

URG

If you asked software engineers some of their “least hated” things, you’ll likely hear both UTF-8 and TCP. TCP, despite being 35 years old, is rock-solid, stable infrastructure that we take for granted today; it’s hard to sometimes realize that TCP was man-made, given how well it’s served us. But within every single TCP packet lies a widely misunderstood, esoteric secret.

Look at any diagram or breakdown of the TCP segment header and you’ll notice a 16-bit field called the “Urgent Pointer”. These 16 bits exist in every TCP packet ever sent, but as far as I’m aware, no piece of software understands them correctly.

This widely misunderstood field has caused security issues in multiple products. As far as I’m aware, there is no fully correct documentation on what this field is actually supposed to do. The original RFC 793 actually contradicts itself on the field’s exact value. RFC 1011 and RFC 1122 try to correct the record, but from my reading of the specifications, they seem to also describe the field incorrectly.

What is, exactly, the TCP URG flag? First, let’s try to refer to what RFC 793, the document describing TCP, actually says.

… TCP also provides a means to communicate to the receiver of data that at some point further along in the data stream than the receiver is currently reading there is urgent data. TCP does not attempt to define what the user specifically does upon being notified of pending urgent data, but the general notion is that the receiving process will take action to process the urgent data quickly.

The objective of the TCP urgent mechanism is to allow the sending user to stimulate the receiving user to accept some urgent data and to permit the receiving TCP to indicate to the receiving user when all the currently known urgent data has been received by the user.

From this description, it seems like the idea behind the urgent flag is to send some message, some set of bytes as “urgent data”, and allow the application to know “hey, someone has sent you urgent data”. Perhaps, you might even imagine, it makes sense for the application to read this “urgent data packet” first, as an out-of-band message.

But! TCP is designed to give you two continuous streams of bytes between computers. TCP, at the application layer, has no concept of datagrams or packetized messages in that stream. If there’s no “end of message”, it doesn’t make sense to define the URG packet to be different. This is what the 16-bit Urgent Pointer is used for. The 16-bit Urgent Pointer specifies a future location in the stream where the urgent data ends:

This mechanism permits a point in the data stream to be designated as the end of urgent information.

Wait. Where the urgent data ends? Then where does it begin? Most early operating systems assumed that this implied that there was one byte of urgent data located at the Urgent Pointer, and allowed clients to read it independently of the actual stream of data. This is the history and rationale behind the flag MSG_OOB, part of the Berkley Sockets API. When sending data through a TCP socket, the MSG_OOB flag sets the URG flag and points the Urgent Pointer at the last byte in the buffer. When a packed is received with the URG flag, the kernel buffers and stores the byte at that location. It also signals the receiving process that there is urgent data available with SIGURG. When receiving data with recv(), you can pass MSG_OOB to receive this single byte of otherwise inaccessible out-of-band data. During a normal recv(), this byte is effectively removed from the stream.

This interpretation, despite being used by glibc and even Wikipedia, is wrong based on my reading of the TCP spec. When taking into account the “neverending streams” nature of TCP, a more careful, subtle, and intentional meaning behind these paragraphs is revealed. One made clearer by the next sentence:

Whenever this point is in advance of the receive sequence number (RCV.NXT) at the receiving TCP, that TCP must tell the user to go into “urgent mode”; when the receive sequence number catches up to the urgent pointer, the TCP must tell user to go into “normal mode”…

Confusing vocabulary choices such as “urgent data” implies that there is actual data explicitly tagged as urgent, but this isn’t the case. When a TCP packet is received with an URG flag, all data currently in the socket is now “urgent data”, up until the end pointer. The urgent data waiting for you up ahead isn’t marked explicitly and available out-of-band, it’s just somewhere up ahead and if you parse all the data in the stream super quickly you’ll eventually find it. If you want an explicit marker for what the urgent data actually is, you have to put it in the stream yourself — the notification is just telling you there’s something waiting up ahead.

Put another way, urgency is an attribute of the TCP socket itself, not of a piece of data within that stream.

Unfortunately, several foundational internet protocols, like Telnet, are fooled by this misunderstanding. In Telnet, the idea is that if you have a large amount of data waiting in the buffer for a “runaway process”, it’s hard for your commands to make it through. From the Telnet specification:

To counter this problem, the TELNET “Synch” mechanism is introduced. A Synch signal consists of a TCP Urgent notification, coupled with the TELNET command DATA MARK. The Urgent notification, which is not subject to the flow control pertaining to the TELNET connection, is used to invoke special handling of the data stream by the process which receives it…

… The Synch is sent via the TCP send operation with the Urgent flag set and the [Data Mark] as the last (or only) data octet.

In a TCP world, this idea of course makes no sense. There’s no “last data octet” in a TCP stream, because the stream is continuous and goes on forever.

How did everyone get confused and start misunderstanding the TCP urgent mechanism? My best guess is that the broken behavior is actually more useful than the one suggested by TCP. Even a single octet of out-of-band data can actually signal quite a lot, and it can be more helpful than some “turbo mode” suggestion. Additionally, despite the availability of POSIX functionality like SO_OOBINLINE and sockatmark, there remains no way to reliably test whether the TCP socket is in “urgent mode”, as far as I’m aware. The Berkley sockets API started this misunderstanding and provides no easy way to get the correct behavior.

It’s incredible to think that 35 years of rock-solid protocol has had such an amazing mistake baked into it. You can probably count the number of total TCP packets sent in the trillions, if not more, yet 16 bits are dedicated to a field that nothing more than a handful of software has ever sent.

I don’t know who the Web Audio API is designed for

WebGL is, all things considered, a pretty decent API. It’s not a great API, but that’s just because OpenGL is also not a great API. It gives you raw access to the GPU and is pretty low-level. For those intimidated by something so low-level, there are quite a few higher-level engines like three.js and Unity which are easier to work with. It’s a good API with a tremendous amount of power, and it’s the best portable abstraction we have for a good way to work with the GPU on the web.

HTML5 Canvas is, all things considered, a pretty decent API. It has plenty of warts: lack of colorspace, you can’t directly draw DOM elements to a canvas without awkwardly porting it to an SVG, blurs are strangely hidden from the user into a “shadows” API, and a few other things. But it’s honestly a good abstraction for drawing 2D shapes.

Web Audio, conversely, is an API I do not understand. The scope of Web Audio is hopelessly huge, with features I can’t imagine anybody using, core abstractions that are hopelessly expensive, and basic functionality basically missing. To quote the specification itself: “It is a goal of this specification to include the capabilities found in modern game audio engines as well as some of the mixing, processing, and filtering tasks that are found in modern desktop audio production applications.”

I can’t imagine any game engine or music production app that would want to use any of the advanced features of Web Audio. Something like the DynamicsCompressorNode is practically a joke: basic features from a real compressor are basically missing, and the behavior that is there is underspecified such that I can’t even trust it to sound correct between browsers. More than likely, such filters would be written using asm.js or WebAssembly, or ran as Web Workers due to the rather stateless, input/output nature of DSPs. Math and tight loops like this aren’t hard, and they aren’t rocket science. It’s the only way to ensure correct behavior.

For people that do want to do such things: compute our audio samples and then play it back, well, the APIs make it near impossible to do it in any performant way.

For those new to audio programming, with a traditional sound API, you have a buffer full of samples. The hardware speaker runs through these samples. When the API thinks it is about to run out, it goes to the program and asks for more. This is normally done through a data structure called a “ring buffer” where we have the speakers “chase” the samples the app is writing into the buffer. The gap between the “read pointer” and the “write pointer” speakers is important: too small and the speakers will run out if the system is overloaded, causing crackles and other artifacts, and too high and there’s a noticeable lag in the audio.

There’s also some details like how many of these samples we have per second, or the “sample rate”. These days, there are two commonly used sample rates: 48000Hz, in use by most systems these days, and 44100Hz, which, while a bit of a strange number, rose in popularity due to its use in CD Audio (why 44100Hz for CDDA? Because Sony, one of the organizations involved with the CD, cribbed CDDA from an earlier digital audio project it had lying around, the U-matic tape). It’s common to see the operating system have to convert to a different sample rate, or “resample” audio, at runtime.

Here’s an example of a theoretical, non-Web Audio API, to compute and play a 440Hz sine wave.

const frequency = 440; // 440Hz A note.
 // 1 channel (mono), 44100Hz sample rate
const stream = window.audio.newStream(1, 44100);
stream.onfillsamples = function(samples) {
    // The stream needs more samples!
    const startTime = stream.currentTime; // Time in seconds.
    for (var i = 0; i < samples.length; i++) {
        const t = startTime + (i / stream.sampleRate);
        // samples is an Int16Array
        samples[i] = Math.sin(t * frequency) * 0x7FFF;
    }
};
stream.play();

The above, however, is nearly impossible in the Web Audio API. Here is the closest equivalent I can make.

const frequency = 440;
const ctx = new AudioContext();
// Buffer size of 4096, 0 input channels, 1 output channel.
const scriptProcessorNode = ctx.createScriptProcessorNode(4096, 0, 1);
scriptProcessorNode.onaudioprocess = function(event) {
    const startTime = ctx.currentTime;
    const samples = event.outputBuffer.getChannelData(0);
    for (var i = 0; i < 4096; i++) {
        const t = startTime + (i / ctx.sampleRate);
        // samples is a Float32Array
        samples[i] = Math.sin(t * frequency);
    }
};
// Route it to the main output.
scriptProcessorNode.connect(ctx.destination);

Seems similar enough, but there are some important distinctions. First, well, this is deprecated. Yep. ScriptProcessorNode has been deprecated in favor of Audio Workers since 2014. Audio Workers, by the way, don’t exist. Before they were ever implemented in any browser, they were replaced by the AudioWorklet API, which doesn’t have any implementation in browsers.

Second, the sample rate is global for the entire context. There is no way to get the browser to resample dynamically generated audio. Despite the browser requiring having fast resample code in C++, this isn’t exposed to the user of ScriptProcessorNode. The sample rate of an AudioContext isn’t defined to be 44100Hz or 48000Hz either, by the way. It’s dependent on not just the browser, but also the operating system and hardware of the device. Connecting to Bluetooth headphones can cause the sample rate of an AudioContext to change, without warning.

So ScriptProcessorNode is a no go. There is, however, an API that lets us provide a differently sampled buffer and have the Web Audio API play it. This, however, isn’t a “pull” approach where the browser fetches samples every once in a while, it’s instead a “push” approach where we play a new buffer of audio every so often. This is known as BufferSourceNode, and it’s what emscripten’s SDL port uses to play audio. (they used to use ScriptProcessorNode but then removed it because it didn’t work good, consistently)

Let’s try using BufferSourceNode to play our sine wave:

const frequency = 440;
const ctx = new AudioContext();
let playTime = ctx.currentTime;
function pumpAudio() {
    // The rough idea here is that we buffer audio roughly a
    // second ahead of schedule and rely on AudioContext's
    // internal timekeeping to keep it gapless. playTime is
    // the time in seconds that our stream is currently
    // buffered to.

    // Buffer up audio for roughly a second in advance.
    while (playTime - ctx.currentTime < 1) {
        // 1 channel, buffer size of 4096, at
        // a 48KHz sampling rate.
        const buffer = ctx.createBuffer(1, 4096, 48000);
        const samples = buffer.getChannelData(0);
        for (let i = 0; i < 4096; i++) {
            const t = playTime + Math.sin(i / 48000);
            samples[i] = Math.sin(t * frequency);
        }

        // Play the buffer at some time in the future.
        const bsn = ctx.createBufferSource();
        bsn.buffer = buffer;
        bsn.connect(ctx.destination);
        // When a buffer is done playing, try to queue up
        // some more audio.
        bsn.onended = function() {
            pumpAudio();
        };
        bsn.start(playTime);
        // Advance our expected time.
        // (samples) / (samples per second) = seconds
        playTime += 4096 / 48000;
    }
}
pumpAudio();

There’s a few… unfortunate things here. First, we’re basically relying on floating point timekeeping in seconds to keep our playback times consistent and gapless. There is no way to reset an AudioContext’s currentTime short of constructing a new one, so if someone wanted to build a professional Digital Audio Workstation that was alive for days, precision loss from floating point would become a big issue.

Second, and this was also an issue with ScriptProcessorNode, the samples array is full of floats. This is a minor point, but forcing everybody to work with floats is going to be slow. 16 bits is enough for everybody and for an output format it’s more than enough. Integer Arithmetic Units are very fast workers and there’s no huge reason to shun them out of the equation. You can always have code convert from a float to an int16 for the final output, but once something’s in a float, it’s going to be slow forever.

Third, and most importantly, we’re allocating two new objects per audio sample! Each buffer is roughly 85 milliseconds long, so every 85 milliseconds we are allocating two new GC’d objects. This could be mitigated if we could use an existing, large ArrayBuffer that we slice, but we can’t provide our own ArrayBuffer: createBuffer creates one for us, for each channel we request. You might imagine you can createBuffer with a very large size and play only small slices in the BufferSourceNode, but there’s no way to slice an AudioBuffer object, nor is there any way to specify an offset into the corresponding with a AudioBufferSourceNode.

You might imagine the best solution is to simply keep a pool of BufferSourceNode objects and recycle them after they are finished playing, but BufferSourceNode is designed to be a one-time-use-only, fire-and-forget API. The documentation helpfully states that they are “cheap to create” and they “will automatically be garbage-collected at an appropriate time”.

I know I’m fighting an uphill battle here, but a GC is not what we need during realtime audio playback.

Keeping a pool of AudioBuffers seems to work, though in my own test app I still see slow growth to 12MB over time before a major GC wipes, according to the Chrome profiler.

What makes this so much more ironic is that a very similar API was proposed by Mozilla, called the Audio Data API. It’s three functions: setup(), currentSampleOffset(), and writeAudio(). It’s still a push API, not a pull API, but it’s very simple to use, supports resampling at runtime, doesn’t require you to break things up into GC’d buffers, and doesn’t have any.

Specifications and libraries can’t be created in a vacuum. If we instead got the simplest possible interface out there and let people play with it, and then took some of the more slow bits people were implementing in JavaScript (resampling, FFT) and put them in C++, I’m sure we’d see a lot more growth and usage than what we do today. And we’d have actual users for this API, and real-world feedback from users using it in production. But instead, the biggest user of Web Audio right now appears to be emscripten, who obviously won’t care much for any of the graph routing nonsense, and already attempts to work around the horrible APIs themselves.

Can the ridiculous overeagerness of Web Audio be reversed? Can we bring back a simple “play audio” API and bring back the performance gains once we see what happens in the wild? I don’t know, I’m not on these committees, I don’t even work in web development other than fooling around on nights and weekends, and I certainly don’t have the time or patience to follow something like this through.

But I would really, really like to see it happen.