Elfhex

Tuesday 9 June 2020

While the project was actually undertaken several months ago, I have not written about my Elfhex machine code “assembler” on this blog yet. Elfhex is a small language and “assembler”, which accepts source files containing machine code (or really, any sequence of bytes, represented by their hex values), and prepends the ELF header (hence the name) in such a way to make the resulting binary executable on an OS that uses the ELF format (e.g. Linux). In order to make this a bit less tedious, the language also contains a number of utilities, such as labels and references, fragments (macros), that make the program construction process easier. Nevertheless, it does not support mnemonics or other typical “assembly language” constructs, and in that manner could be said to be architecture-agnostic as well.

Why did I make Elfhex? Well, I thought it would be interesting to use.

In general, the CPUs in computers execute machine code, which is simply a sequence of bytes that represent the series of operations that the CPU should perform. Typically, this machine code is stored on the computer’s secondary storage in executable binary files, such as ELF files on Linux, and then loaded into memory by the operating system so that the CPU can access it.

Of course, we don’t typically write programs directly in machine code: we have higher level languages such as C that are more convenient to use, and which have compilers that can then convert the programs we write into machine code. Even assembly languages usually do not map directly to the underlying machine code, as one mnemonic can often represent multiple opcodes (for example, in x86 assembler, depending on what type of addition is desired, the add mnemonic can be converted to a variety of opcodes). In addition, assembly languages usually contain a number of features that help abstract some of the more tedious parts of the machine code, such as encoding the argument bytes, which, for example, can be quite complex in x86 due to the ModR/M system. Due to this, after assembling a program written in assembly language, the resulting binary can be still quite difficult to compare, especially if the assembler adds additional data to the binary, such as libraries, metadata such as the section table, or debug symbols.

Sometimes, therefore, I thought it would be interesting to be able to write in machine code itself, and have a minimal transformation process that packaged the code in an actual executable binary, within which the source code would be easily recognizable (when viewed with, for example, a hex editor). Ideally, all that such a system would add is a minimal binary header, leaving the rest of the source intact.

In theory, this could be all done in a standard hex editor. There are some inconveniences when doing this, however: primarily, as bytes are added and removed, the location of everything after that byte moves, making it difficult to determine the appropriate values for references (e.g., when writing a jump instruction).

Such a goal could also be achieved using a common assembly language, such as NASM for x86, which contains pseudo-instructions such as db which allow the direct insertion of arbitrary bytes in the assembled file. Repeating these pseudo-instructions to form a complete binary, however, would be somewhat tedious, and overall we would be working against the design of an assembler like NASM, which is to assemble programs written in assembly. It also may insert unnecessary bytes into the binary by default, such as a section table. These can be removed with various flags, or after assembly, however they detract from our desire to see our source code directly reflected in the output binary.

Therefore, I created Elfhex. As the documentation on Github describes, it is a simple “assembly” language in which bytes (represented by pairs of hex digits, e.g. a1 9a ff) are the basic unit. In the most fundamental sense, the “assembler” for the language simply takes these bytes in the source file, prepends a minimal ELF header consisting of as few bytes as possible, and outputs the result. As long as the bytes in the source represent valid machine code, then the output can then be executed on an ELF-supporting platform, such as Linux, as aforementioned.

Along with the bytes, however, Elfhex supports some features that make these programs easier to write. One of the main such features is support for labels and references, the latter being replaced by the position of their referenced label at compile time. This allows for position-independent source code which is robust against the addition or removal of bytes in other parts of the file. Other features include fragment (macro) support, padded literals in various bases and strings. There is also support for extensions, which can be invoked from the source code and provide additional functionality. As an example, the main program contains an extension that allows the expression of x86 modR/M bytes in Intel syntax.

As a result, Elfhex is a language that is somewhat more convenient to use than writing programs directly in a hex editor, while still maintaining the supremacy of the byte as the fundamental unit of programming, without the need for mnemonics or other abstractions. While not very practical for larger applications, writing programs in Elfhex, I find, can grant a greater appreciation of the complexities of machine code.

As mentioned, Elfhex can be found on Github, or installed directly from Pip using pip install elfhex. Give it a try!