What is the most commonly used bytecode language in the world? Java (JVM Bytecode)? .NET (CLI)? Flash (AVM1/AVM2)? Nope. There’s a few that you use every day, simply by turning on your computer, or tablet, or even phone. You don’t even have to start an application or visit a webpage.
ACPI
The most obvious is the large, gargantuan specification known as “ACPI”. The “Advanced Configuration and Power Interface” specification lives up to its name, with the most recent specification being a mammoth document that weighs in at almost 1000 pages. And yes, operating systems are expected to implement this. The entire thing. The bytecode part is hidden deep, but it’s seen in chapter 20, under “ACPI Machine Language”, describing a semi-register VM with all the usuals: Add, Subtract, Multiply, Divide, standard inequalities and equalities, but then throws in other fun things like ToHexString and Mid (substring). Look even further and you’ll see a full object model, system properties, as well as an asynchronous signal mechanism so that devices are notified about when those system properties change.
Most devices, of course, have a requirement of nothing less than a full implementation of ACPI, so of course all this code is implemented in your kernel, running at early boot. It parallels the complexity of a full JavaScript environment with its type system and system bindings, with the program code supplied directly over the wire from any device you plug in. Because the specification is so complex, an OS-independent reference implementation was created by Intel, and this is the implementation that’s used in the Linux kernel, the BSDs (including Mac OS X), and the fun toy ReactOS, HaikuOS kernels. I don’t know if it’s used by Windows or not. Since the specification’s got Microsoft’s name on it, I assume their implementation was created long before ACPICA.
Fonts
After that, want to have a graphical boot loader? Simply rendering an OpenType font (well, only OpenType fonts with CFF glyphs, but the complexities of the OpenType font format is a subject for another day) requires parsing the Type 2 Glyph Format, which indeed involves a custom bytecode format to establish glyphs. This one’s even more interesting: it’s a real stack-based interpreter, and it even has a “random” opcode to make random glyphs at runtime. I can’t imagine this ever be useful, but it’s there, and it’s implemented by FreeType, so I can only assume it’s used by some fonts from in the real world. This bytecode interpreter also contained at one time a stack overflow vulnerability which was what jailbroke the iPhone in JailbreakMe.com v2.0, with the OTF file being loaded by Apple’s custom PDF viewer.
This glyph language is based on and is a stripped down version of PostScript. Actual PostScript involves a full turing-complete register/stack-based hybrid virtual machine based on Forth. The drawbacks of this system (looping forever, interpreting the entire script to draw a specific page because of complex state) were the major motivations for the PDF format — while based on PostScript, it doesn’t have much shared document state, and doesn’t allow any arbitrary flow control operations. In this model, someone (even an automated program) could easily verify that a graphic was encapsulated, not doing different things depending on input, and that it terminated at some point.
And, of course, since fonts are complicated, and OpenType is complicated, OpenType also includes all of TrueType, which includes a bytecode-based hinting model to ensure that your fonts look correct at all resolutions. I won’t ramble on about it, but here’s the FreeType implementation. I don’t know of anything interesting happening to this. Seems there was a CVE for it at one time.
To get this article to display on screen, it’s very likely that thousands of these tiny little microprograms ran, once for each glyph shape in each font.
Packet filtering
Further on, if you want to capture a network packet with tcpdump or libpcap (or one of its users like Wireshark), it’s being filtered through the Berkeley Packet Filter, a custom register-based bytecode. The performance impact of this at one time was too large for people debugging network issues, so a simple JIT compiler was put into the kernel, under an experimental sysctl flag.
As a piece of historical interest, an earlier version of the BPF code was part of the code claimed to be infringing part of the SCO lawsuits (page 15), but was actually part of BSD4.3 code that was copied to the Linux kernel. The original BSD code was eventually replaced with the current architecture, known as the Linux Socket Filter, in Linux 2.2 (which I can’t easily link to, as there’s no public repository of the Linux kernel code with pre-git history, as far as I know).
What about it?
The popularity of bytecode as a general and flexible solution to problems is alluring, but it’s not without its complexities and faults, with such major security implications (an entire iPhone jailbreak from incorrect stack overflow checking!) and insane implementation requirements (so much that we only have one major implementation of ACPI used across all OSes that we can check).
The four examples also bring out something interesting: the wildly different approaches that can be taken to a bytecode language. In the case of ACPI, it’s an interesting take on what I can only imagine is scope creep on an originally declarative table specification, bringing it to the mess today. The Type 1 Glyph and TrueType Hinting languages are basic stack-based interpreters, showing their PostScript heritage. And BPF is a register-based interpreter, which ends up with a relatively odd register-based language that can really only do simple operations.
Note, though, that all of these implementations above have had security issues in their implementations, with numerous CVEs for each one, because bytecode interpreter implementations are hard to get right. So, to other hackers: do you know of any other low-level, esoteric custom bytecode specifications like these? And to spec writers: did you really need that flexibility?