Tumble Forth: From Dusk Till C

We now know the secret ingredients that make Forths alive, but we won’t go all the way to a full Forth because it would be much less interesting than the first magical moments of life. Moreover, I already have one for you: Dusk OS.

Dusk OS is a 32-bit STC Forth that broadly follows Starting Forth’s conventions, with a few caveats:

  1. Lowercase words.
  2. Filesystem-based rather than Block-based.
  3. There is no HEX/DEC mode. Hexadecimal literals are prefixed with $, like in $1234abcd. There is also support for character literals like ‘X’. Printing a literal in hexadecimal is done with the .x word.
  4. No support for decimals.
  5. DO..LOOP changes to for2..next because Dusk has fancy support for iterators.
  6. VARIABLE changes to value with what Dusk calls “to” semantics.
  7. The text editor is completely different.

You can of course read Dusk’s documentation which will tell you all you need to know, but you don’t have to. Your knowledge from having read Starting Forth will be enough, I’ll fill you in as needed.

Dawn of a new world

With all this ruckus about building a Forth and then discovering a whole new one, where were we again? Ah yes, we want to compile this:

int foo(int a, int b) {
    return a + b;
}

int main() {
    return foo(42, 12);
}

Compiling this, alright, but what to? Under UNIX, this compiled into an ELF executable that would be called and yield the result through its program return code. Under a Forth, we would rather want to compile this to a simple word.

The main function is a crutch that only UNIX needs. What we’d actually want to do with our compiled word is this:

42 12 foo . \ prints 54

This means we can ignore the main function. So, just like that, half our work is done!

Of course, we could compile this with Dusk’s existing C compiler, but we wouldn’t be learning much. The rest of this story arc will be devoted to write the code necessary to minimally compile the foo function’s source code into an executable word under i386 Dusk.

We will first build an assembler for the few instructions we need, then we’ll build ourselves a C tokenizer, and finally write the code that parses these incoming tokens and generate the corresponding i386 code through the assembler.

Pitching your tent at Dusk

Let’s prepare our stuff. First, make sure that you can run the i386 version of Dusk. Running make pcrun should result in QEMU launching with a Dusk OS prompt.

In addition to that, you should install mtools. Dusk's PC image builds and runs fine without it, but it won't pick up extra files in fs/ root, as you're about to do. With mtools installed, it will.

For this story arc, we’ll make the filesystem root our workspace. We’ll write our work in files at the root filesystem and then load them in Dusk. Let’s test that this works by creating a file called myasm.fs in the fs/ subfolder in Dusk’s source code, with this content:

." Hello Dusk!\n"

Then, run make pcrun again and at prompt, type ”f<< myasm.fs”. You get the message? Good! We’re ready to build our barebone i386 assembler.

Assembler scope

As mentioned above, we’ll go for a minimal solution for the presented problem, which means a minimal assembler. To determine the scope of that assembler, let’s try to see what foo would look like in NASM. You should already have an idea since it’s functionally equivalent to the + word we’ve already implemented in our baby Forth, with these differences:

  1. We’re in 32-bit mode, so register names have a E (for “extended”) prefix.
  2. All 16-bit references are upgraded to 32-bit. Whatever was 2 bytes in our baby Forth is now 4 bytes.
  3. Dusk’s register assigned to PS is ESI rather than BP.
  4. Dusk keeps PS Top Of Stack element in the EAX register.

The last point needs a special mention. Brad Rodriguez, in the same excellent “Moving Forth” article I’ve already linked previously, speaks of the subject in the section named “Top-Of-Stack in Register”. Dusk OS does this. Not only for performance reasons, but also because of the HAL, which is a vast subject we’ll not talk about now.

This peculiarity changes the logic for our word a lot. What was:

  1. Pop PS in a register.
  2. Add register to PS top.
  3. Write result to PS top.
  4. Return.

Becomes:

  1. PS top is already in EAX.
  2. Add [ESI] to EAX.
  3. Shrink ESI stack by 4 (that is, add 4). Result in EAX, which is still our PS top.
  4. Return.

In NASM, this would look like:

BITS 32
add eax, [esi]
add esi, 4
ret

This would mean that to compile foo, our assembler would only need two instructions, add and ret. Should be easy!

i386 encoding

Alright, let’s get started. First, we need a reference. “Intel 80386 Programmer’s Reference Manual” will do. It’s a big document, but I’ll walk you to the important parts, so there’s no need to read it beyond the pages I specifically mention, but you have to read those pages to follow this article. You might not understand the totality of what you read there, but that’s fine, my goal is to explain these pages to you.

Should we begin with the easy instruction? Let’s encode ret! The reference PDF has a Table of Contents with each instruction listed there. Through this TOC, you’ll easily find ret at p. 378. We see on that page that there are multiple forms of this instruction, but we will not bother with the fancy versions and use only the regular one, the “near” version. The listing on that page tells us that this particular instruction has a simple opcode, C3, with no parameter. Too easy! We can thus add our first word in myasm.fs:

: ret, $c3 c, ;

Why ”ret,” instead of ”ret”? To indicate that the effect of this word is to “write” to “here”. It’s just better form, Forth wise.

That assembler can now create its first word:

f<< myasm.fs
code nothing ret,

You can now call nothing and see that it does nothing. To be sure, you could try inspecting your stacks before and after with the .S utility word.

Starting Forth doesn’t talk about code, but it’s a frequent word in Forths that support assemblers. “code xxx” is the equivalent to “: xxx” except that it doesn’t go in compilation mode. This means that words we call afterwards are interpreted. The result of the snippet above is thus to create an empty “nothing” dictionary entry, followed by C3. You can verify this with “‘ nothing dump”.

i386 modr/m

So, that’s if for ret. Does this mean that half the job is done? Obi-Wan would answer “yes, from a certain point of view”. But you know better than to take the words of ghosts at face value right?

add is a much more complex instruction (p. 261) and we need to cover two of its forms:

  1. “direct register + indirect register”, that is, reference the 4 bytes (dword in x86-speak) where ESI points to and combine it to register EAX.
  2. “direct register + immediate”, that is, reference the 4 bytes constant encoded in the instruction itself and combine it to register ESI.

How should we encode them? When you look at the i386 reference, it’s hard to understand what you read and find the encoding you’re supposed to use. You easily see the “EAX, imm32” form, which should fit one of the forms if it targeted the right register, but the rest of the forms are gibberish. You can guess that “r32” means a reference to a 32-bit register, but what’s “r/m32”? How do we handle indirect references? The answer is that these addressing forms are encoded in x86’s famous “modr/m” encoding (p. 241).

As we’ve already seen before, the x86 family has a wide range of addressing modes for its instructions. Those addressing modes are encoded in a second byte following the main instruction opcode. Except for the “shortcut forms” of some instructions (such at the “EAX, imm32” one we see for add), most instruction forms involving a register or a memory location will have the modr/m byte. There’s at most one per instruction, and it has this bit structure:

  1. b7:6 mod
  2. b5:3 reg
  3. b2:0 r/m (register or memory)

The naming is confusing, but to Intel’s defense, I don’t see what other names I’d use. Those fields are too flexible for a more specific naming scheme.

The mod field selects one of the 4 general addressing modes1 documented at p. 244:

In other words, whenever the modr/m byte fits the C0 mask, it means that both operands are “direct”, otherwise, one of the operands is “indirect”.

The next two fields select the operands for the instructions using register IDs also documented in p. 2442. The reg field is generally used for the "direct register" operand and is not affected by mod. In the “add eax, [esi]” instance, this field would contain a reference to EAX, thus 0. This field is not always the destination. For example, in “add [esi], eax”, the reg field would also contain EAX. It’s the instruction opcode that determines operands order (01 vs 03)34.

The last field, r/m, is the ID of the register that will be affected by mod. For example, under a mod 00, we would set r/m to ESI (6) to express [ESI].

With this knowledge, we can encode “add eax, [esi]” in our head. First, we need to select the right add form, which is the “r32, r/m32” one. This means a 03 instruction opcode. Then, we need a modr/m byte with mod=00 reg=EAX r/m=ESI, which means 06. This gives us a final encoded instruction: 03 06. Is this what NASM gives us? Great, we agree!

Assembler API

First of all, let’s have useful constants:

0 const eax
1 const ecx
2 const edx
3 const ebx
4 const esp
5 const ebp
6 const esi
7 const edi

We’ll use these constants as arguments to our API. Now, the tricky part is to elegantly express the “use EAX as a direct destination and use ESI as an indirect source” command to the add instruction. Dusk’s i386 assembler has such an API, but the scope of this subject is too wide for this story arc. We’ll defer this to another story arc and go with a brutally simple, albeit ugly, API: specifying the form in the word name.

For example, for the form we need, we will add a word named “addr[], ( dst src -- )” which means “write the add instruction in its “r32, r/m32” form with mod=00 dst=r32 and src=r/m32”. With what we already know of i386 encoding, the implementation of this word is trivial:

: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;

This allows us to create a new word:

code foo eax esi addr[], ret,
42 12 foo . \ prints 54

So, that’s it? Success? From a certain point of view, yes, but this word has a problem: it leaks an element to PS. Instead of the signature “a b -- n” that we want, it has the signature “a b -- a n”. To drop the a, we need to “nip” it from the stack with “add esi, 4”.

The instruction form that allows this is the “r/m32, imm32”5 form. This one also has a modr/m byte, but also a 32-bit immediate that will follow it. The modr/m byte in this case is special: it only has one operand. In these cases, only the mod and r/m fields are used, reg stays empty. However, to pack as many instructions as possible in as few bytes as possible, i386 encoding scheme use these 3 precious bits to expand the tight 8-bit base opcode space. Therefore, for the “r/m32, imm32” form, add shares the 81 opcode with other instructions such as sub, adc,etc. So, you can’t put anything in the reg field, you have to put what Intel tells you to put. In the documentation, it’s the number after the slash. In our case /0.

Wrapping all this up, this allows us to add our new word:

: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;

You guessed what the ”$c0 or” was for? To select mod=03! With this new word, we can complete our foo word:

code foo
  eax esi addr[],
  esi 4 addri,
  ret,
42 12 foo . \ prints 54

You can now see, with .S, that we don’t leak to PS anymore! This completes our extremely minimal i386 assembler. We’ll be able to use it in our C compiler, when comes the time of generating the code.

The source code for our toy assembler can be downloaded here

Going further

We won't go further, assembler-wise, in this story arc. If you wish to explore this subject further, I recommend you a good tool to help you visualize x86 encoding. Then, you can of course take a look at Dusk's i386 assembler.

Up next

In the next article, we’ll set aside our shiny new assembler for a while as we tackle the first part of a C compiler by building a tokenizer.

Next: Feeding the beast


  1. I ignore special cases for simplicity reasons and because we won’t use them in this story arc. 

  2. EAX=0 ECX=1, etc. 

  3. It’s also this opcode that determines that the operation is a 32-bit one. If we wanted to do the same operation with the same operands in 8-bit, we’d change the opcode to 00 or 02. What about 16-bit? It’s complicated and out of scope of this article. 

  4. It’s a bit confusing, but one realization about i386 will help you grok the logic. In i386, it’s impossible to have an instruction with two “mod” operands. For example, “add [esi], [eax]” is impossible. Therefore, one of the operands is always “direct”. reg is the one. 

  5. Let’s ignore the imm8 optimization for now.