Bytecode

What is the most commonly used bytecode language in the world? Java (JVM Bytecode)? .NET (CLI)? Flash (AVM1/AVM2)? Nope. There’s a few that you use every day, simply by turning on your computer, or tablet, or even phone. You don’t even have to start an application or visit a webpage.

ACPI

The most obvious is the large, gargantuan specification known as “ACPI”. The “Advanced Configuration and Power Interface” specification lives up to its name, with the most recent specification being a mammoth document that weighs in at almost 1000 pages. And yes, operating systems are expected to implement this. The entire thing. The bytecode part is hidden deep, but it’s seen in chapter 20, under “ACPI Machine Language”, describing a semi-register VM with all the usuals: Add, Subtract, Multiply, Divide, standard inequalities and equalities, but then throws in other fun things like ToHexString and Mid (substring). Look even further and you’ll see a full object model, system properties, as well as an asynchronous signal mechanism so that devices are notified about when those system properties change.

Most devices, of course, have a requirement of nothing less than a full implementation of ACPI, so of course all this code is implemented in your kernel, running at early boot. It parallels the complexity of a full JavaScript environment with its type system and system bindings, with the program code supplied directly over the wire from any device you plug in. Because the specification is so complex, an OS-independent reference implementation was created by Intel, and this is the implementation that’s used in the Linux kernel, the BSDs (including Mac OS X), and the fun toy ReactOS, HaikuOS kernels. I don’t know if it’s used by Windows or not. Since the specification’s got Microsoft’s name on it, I assume their implementation was created long before ACPICA.

Fonts

After that, want to have a graphical boot loader? Simply rendering an OpenType font (well, only OpenType fonts with CFF glyphs, but the complexities of the OpenType font format is a subject for another day) requires parsing the Type 2 Glyph Format, which indeed involves a custom bytecode format to establish glyphs. This one’s even more interesting: it’s a real stack-based interpreter, and it even has a “random” opcode to make random glyphs at runtime. I can’t imagine this ever be useful, but it’s there, and it’s implemented by FreeType, so I can only assume it’s used by some fonts from in the real world. This bytecode interpreter also contained at one time a stack overflow vulnerability which was what jailbroke the iPhone in JailbreakMe.com v2.0, with the OTF file being loaded by Apple’s custom PDF viewer.

This glyph language is based on and is a stripped down version of PostScript. Actual PostScript involves a full turing-complete register/stack-based hybrid virtual machine based on Forth. The drawbacks of this system (looping forever, interpreting the entire script to draw a specific page because of complex state) were the major motivations for the PDF format — while based on PostScript, it doesn’t have much shared document state, and doesn’t allow any arbitrary flow control operations. In this model, someone (even an automated program) could easily verify that a graphic was encapsulated, not doing different things depending on input, and that it terminated at some point.

And, of course, since fonts are complicated, and OpenType is complicated, OpenType also includes all of TrueType, which includes a bytecode-based hinting model to ensure that your fonts look correct at all resolutions. I won’t ramble on about it, but here’s the FreeType implementation. I don’t know of anything interesting happening to this. Seems there was a CVE for it at one time.

To get this article to display on screen, it’s very likely that thousands of these tiny little microprograms ran, once for each glyph shape in each font.

Packet filtering

Further on, if you want to capture a network packet with tcpdump or libpcap (or one of its users like Wireshark), it’s being filtered through the Berkeley Packet Filter, a custom register-based bytecode. The performance impact of this at one time was too large for people debugging network issues, so a simple JIT compiler was put into the kernel, under an experimental sysctl flag.

As a piece of historical interest, an earlier version of the BPF code was part of the code claimed to be infringing part of the SCO lawsuits (page 15), but was actually part of BSD4.3 code that was copied to the Linux kernel. The original BSD code was eventually replaced with the current architecture, known as the Linux Socket Filter, in Linux 2.2 (which I can’t easily link to, as there’s no public repository of the Linux kernel code with pre-git history, as far as I know).

What about it?

The popularity of bytecode as a general and flexible solution to problems is alluring, but it’s not without its complexities and faults, with such major security implications (an entire iPhone jailbreak from incorrect stack overflow checking!) and insane implementation requirements (so much that we only have one major implementation of ACPI used across all OSes that we can check).

The four examples also bring out something interesting: the wildly different approaches that can be taken to a bytecode language. In the case of ACPI, it’s an interesting take on what I can only imagine is scope creep on an originally declarative table specification, bringing it to the mess today. The Type 1 Glyph and TrueType Hinting languages are basic stack-based interpreters, showing their PostScript heritage. And BPF is a register-based interpreter, which ends up with a relatively odd register-based language that can really only do simple operations.

Note, though, that all of these implementations above have had security issues in their implementations, with numerous CVEs for each one, because bytecode interpreter implementations are hard to get right. So, to other hackers: do you know of any other low-level, esoteric custom bytecode specifications like these? And to spec writers: did you really need that flexibility?

20 thoughts on “Bytecode

  1. Nice posts, I really like thinkt about where we use bytecode every day :) While many virtual machines uses some sort of bytecode internally, many are not exposed to outside, for example browsers executing JavaScript.

    The UEFI specification defines a own bytecode EBC (“EFI ByteCode”) for having a portable code between different architecture for some piece of hardware (e.g. graphic cards). See chapter 20: http://www.uefi.org/specs/
    Since Apple switched from PowerPC to x86 (or a other perspective, OpenFirmware to EFI; or f-code to EBC ;-)), some vendors are actually shipping devices with EBC code in their ROM. Unfortunately most vendors still sell devices with x86 only code on it, but I think it will change with the ARM wave coming with Windows8 stuff.

    Anyways, about EBC: The source language is usally C, so all those “dirty” feature like direct memory access are mapped to the bytecode, so it doesn’t provide any safety guarantee like Java bytecode does. The cool thing about EBC is, that it supports those memory access on 32-bit and 64-bit architectures, although using the same blob of EBC. For example it is possible to use a graphic card driver for UEFI on a 32-bit ARM device and also on a 64-bit itanium server. In order to support this, they use “natural indexing”, i.e. the shift the computation of pointer offset from the compiler to the virtual machine. The compiler just encodes the “natural index” into the bytecode instruction and the interpreter has to compute it at run-time.

    A open-source reference implementation of the VM is included in TianoCore for x86, x86_64 and Itanium: http://tianocore.git.sourceforge.net/git/gitweb.cgi?p=tianocore/edk2;a=tree;f=MdeModulePkg/Universal/EbcDxe

  2. Great post.

    Fun thing is that bytecode isn’t what it used to be any more. I talked to Sun JVM folks back in the day, and they said it was often easier to analyse/optimise starting from the source, with the semantic information intact.

  3. The Infocom games all used a simple virtual machine called the Z-machine (for Zork). It’s been reverse-engineered and is now available publicly, and a thriving interactive fiction world exists because of it. Also check out TADS (especially TADS 3) for a “competing” VM for IF.

    Another game that uses a virtual machine is Core Wars. No registers, not even an accumulator, and no stack, just one big heap for several programs to coexist in and try to overrun.

    None of these are commonly used, at least not like ACPI, but they’re interesting in their own right.

    • If we’re talking about VMs in games we have to mention SCUMM (Script Creation Utility for Maniac Mansion) and it’s open source implementation ScummVM. Don’t know more about it, though. Not even if it’s register or stack based.

  4. Xorg includes a 16-bit x86 emulator in case it needs to run BIOS code. If you’re running X with the VESA driver (“safe mode”), you’re using it. I think the other common drivers mostly avoid it these days, with kernel modesetting and all.

  5. IIUC the DWARF debug info format which is usually used under Linux uses an own bytecode format, to specify where to find variable contents or saved register contents on stack. I think usually this bytecode is handled by an interpreter built into GDB (but there are probably other implementations out there, hidden in other debugging tools). So any time a backtrace for a compiled executable appears under Linux, most likely this bytecode has been executed.

    Even better, the DWARF bytecode format (or a subset thereof?) is also used by GCC to implement Exception Handling for C++ – see http://gcc.gnu.org/wiki/Dwarf2EHNewbiesHowto . Basically whenever an exception is thrown, the application needs to interpret this bytecode to find locations of certain values on stack. So running a C++ application compiled with GCC can already cause lots of bytecode to be interpreted.

    I think libunwind library provides a common code base to interpret DWARF debug info and GCC Exception Handling tables.

  6. Pingback: Bytecode | Enjoying The Moment

  7. Interesting article. I knew about ACPI (needed to dive into it once to geta netbook working on linux), but had no idea that fonts were a mini-language too.

    As for esoteric: Bitcoin contains an actual mini-bytecode-language for transactions: https://en.bitcoin.it/wiki/Script
    It is a stack-based language. However it is purposefully not turning complete, with no loops. It turns out that most of the flexibility has not been needed (although there are some theoretical uses such as smart contracts), thus most of the opcodes have been disabled at the moment.

Leave a Reply to oliver Cancel reply

Your email address will not be published. Required fields are marked *