Tumble Forth: From Dusk Till C

We now know the secret ingredients that make Forths alive, but we won’t go all the way to a full Forth because it would be much less interesting than the first magical moments of life. Moreover, I already have one for you: Dusk OS.

Dusk OS is a 32-bit STC Forth that broadly follows Starting Forth’s conventions, with a few caveats:

  1. Lowercase words.
  2. Filesystem-based rather than Block-based.
  3. There is no HEX/DEC mode. Hexadecimal literals are prefixed with $, like in $1234abcd. There is also support for character literals like ‘X’. Printing a literal in hexadecimal is done with the .x word.
  4. The text editor is completely different.

You can of course read Dusk’s documentation which will tell you all you need to know, but you don’t have to. Your knowledge from having read Starting Forth will be enough, I’ll fill you in as needed.

Dawn of a new world

With all this ruckus about building a Forth and then discovering a whole new one, where were we again? Ah yes, we want to compile this:

int foo(int a, int b) {
    return a + b;
}

int main() {
    return foo(42, 12);
}

Compiling this, alright, but what to? Under UNIX, this compiled into an ELF executable that would be called and yield the result through its program return code. Under a Forth, we would rather want to compile this to a simple word.

The main function is a crutch that only UNIX needs. What we’d actually want to do with our compiled word is this:

42 12 foo . \ prints 54

This means we can ignore the main function. So, just like that, half our work is done!

Of course, we could compile this with Dusk’s existing C compiler, but we wouldn’t be learning much. The rest of this story arc will be devoted to write the code necessary to minimally compile the foo function’s source code into an executable word under i386 Dusk.

We will first build an assembler for the few instructions we need, then we’ll build ourselves a C tokenizer, and finally write the code that parses these incoming tokens and generate the corresponding i386 code through the assembler.

Pitching your tent at Dusk

Let’s prepare our stuff. First of all, this part of the story arc was written with Dusk v19 as a reference. If you use another version, snippets might not work.

Then, make sure that you can run the x86 version of Dusk. If you're on an Intel CPU on a host OS that allows usermode Dusk to run, it's the best option. However, building it is a bit tricky because you have the force the i386 mode, which involves complexity. Refer to the README.

Otherwise, your other option would be emulating a PC with QEMU. This can be done easily through the PC deployment package.

For this to work, you'll need QEMU. Then, go in the deploy/pc subfolder and run make clean emul.

For this story arc, we'll need to work on a file that is ran by Dusk. You can create it now. Let's call it mycc.fs, in which we'll put this test contents:

."Hello Dusk!\n"

In the following articles, we'll refer to "loading" mycc.fs. Depending on whether you're on Usermode or QEMU, this means two different things.

Usermode

Loading a file in usermode is convenient. You use the -f flag. For example, if you run:

./dusk -f mycc.fs -p

You'll get a Dusk OS prompt, but before that prompt, you'll see "Hello Dusk!". This means that Dusk has loaded the specified file, then continued on with the prompt (without the -p flag, it quits).

If you type a few commands in the prompt, you'll notice that your commands are echoed twice. To avoid this, you need to set your TTY in raw mode:

stty -icanon -echo min 0

But this makes your shell prompt not echo anything. What you need to do is to wrap your Dusk invocation like this:

stty -icanon -echo min 0; ./dusk ; stty icanon echo

... or you can live with the double echoing. That works too...

PC deployment under QEMU

Loading mycc.fs isn't as convenient under QEMU as it is under Usermode. The brutal way of doing so is to copy mycc.fs in the fs/ directory and then run make clean emul in deploy/pc. Loading the file can then be done by typing:

f<< mycc.fs

However, that means that every change in your file requires a full rebuild of the PC image. That's a bit time consuming. What you can do instead is to use mtools to copy mycc.fs in the image:

mcopy -i pc.img mycc.fs ::

Then, run make emul (without the clean so there's no rebuild) and run f<< mycc.fs.

So, you see that "Hello Dusk!"? Good! We’re ready to build our barebone i386 assembler.

Assembler scope

As mentioned above, we’ll go for a minimal solution for the presented problem, which means a minimal assembler. To determine the scope of that assembler, let’s try to see what foo would look like in NASM. You should already have an idea since it’s functionally equivalent to the + word we’ve already implemented in our baby Forth, with these differences:

  1. We’re in 32-bit mode, so register names have a E (for “extended”) prefix.
  2. All 16-bit references are upgraded to 32-bit. Whatever was 2 bytes in our baby Forth is now 4 bytes.
  3. Dusk’s register assigned to PS is ESI rather than BP.
  4. Dusk keeps PS Top Of Stack element in the EAX register.

The last point needs a special mention. Brad Rodriguez, in the same excellent “Moving Forth” article I’ve already linked previously, speaks of the subject in the section named “Top-Of-Stack in Register”. Dusk OS does this. Not only for performance reasons, but also because of the HAL, which is a vast subject we’ll not talk about now.

This peculiarity changes the logic for our word a lot. What was:

  1. Pop PS in a register.
  2. Add register to PS top.
  3. Write result to PS top.
  4. Return.

Becomes:

  1. PS top is already in EAX.
  2. Add [ESI] to EAX.
  3. Shrink ESI stack by 4 (that is, add 4). Result in EAX, which is still our PS top.
  4. Return.

In NASM, this would look like:

BITS 32
add eax, [esi]
add esi, 4
ret

This would mean that to compile foo, our assembler would only need two instructions, add and ret. Should be easy!

i386 encoding

Alright, let’s get started. First, we need a reference. “Intel 80386 Programmer’s Reference Manual” will do. It’s a big document, but I’ll walk you to the important parts, so there’s no need to read it beyond the pages I specifically mention, but you have to read those pages to follow this article. You might not understand the totality of what you read there, but that’s fine, my goal is to explain these pages to you.

Should we begin with the easy instruction? Let’s encode ret! The reference PDF has a Table of Contents with each instruction listed there. Through this TOC, you’ll easily find ret at p. 378. We see on that page that there are multiple forms of this instruction, but we will not bother with the fancy versions and use only the regular one, the “near” version. The listing on that page tells us that this particular instruction has a simple opcode, C3, with no parameter. Too easy! We can thus add our first word in myasm.fs:

: ret, $c3 c, ;

Why ”ret,” instead of ”ret”? To indicate that the effect of this word is to “write” to “here”. It’s just better form, Forth wise.

That assembler can now create its first word:

( load mycc.fs )
code nothing ret,

You can now call nothing and see that it does nothing. To be sure, you could try inspecting your stacks before and after with the .S utility word.

Starting Forth doesn’t talk about code, but it’s a frequent word in Forths that support assemblers. “code xxx” is the equivalent to “: xxx” except that it doesn’t go in compilation mode. This means that words we call afterwards are interpreted. The result of the snippet above is thus to create an empty “nothing” dictionary entry, followed by C3. You can verify this with “‘ nothing dump”.

i386 modr/m

So, that’s if for ret. Does this mean that half the job is done? Obi-Wan would answer “yes, from a certain point of view”. But you know better than to take the words of ghosts at face value right?

add is a much more complex instruction (p. 261) and we need to cover two of its forms:

  1. “direct register + indirect register”, that is, reference the 4 bytes (dword in x86-speak) where ESI points to and combine it to register EAX.
  2. “direct register + immediate”, that is, reference the 4 bytes constant encoded in the instruction itself and combine it to register ESI.

How should we encode them? When you look at the i386 reference, it’s hard to understand what you read and find the encoding you’re supposed to use. You easily see the “EAX, imm32” form, which should fit one of the forms if it targeted the right register, but the rest of the forms are gibberish. You can guess that “r32” means a reference to a 32-bit register, but what’s “r/m32”? How do we handle indirect references? The answer is that these addressing forms are encoded in x86’s famous “modr/m” encoding (p. 241).

As we’ve already seen before, the x86 family has a wide range of addressing modes for its instructions. Those addressing modes are encoded in a second byte following the main instruction opcode. Except for the “shortcut forms” of some instructions (such at the “EAX, imm32” one we see for add), most instruction forms involving a register or a memory location will have the modr/m byte. There’s at most one per instruction, and it has this bit structure:

  1. b7:6 mod
  2. b5:3 reg
  3. b2:0 r/m (register or memory)

The naming is confusing, but to Intel’s defense, I don’t see what other names I’d use. Those fields are too flexible for a more specific naming scheme.

The mod field selects one of the 4 general addressing modes1 documented at p. 244:

In other words, whenever the modr/m byte fits the C0 mask, it means that both operands are “direct”, otherwise, one of the operands is “indirect”.

The next two fields select the operands for the instructions using register IDs also documented in p. 2442. The reg field is generally used for the "direct register" operand and is not affected by mod. In the “add eax, [esi]” instance, this field would contain a reference to EAX, thus 0. This field is not always the destination. For example, in “add [esi], eax”, the reg field would also contain EAX. It’s the instruction opcode that determines operands order (01 vs 03)34.

The last field, r/m, is the ID of the register that will be affected by mod. For example, under a mod 00, we would set r/m to ESI (6) to express [ESI].

With this knowledge, we can encode “add eax, [esi]” in our head. First, we need to select the right add form, which is the “r32, r/m32” one. This means a 03 instruction opcode. Then, we need a modr/m byte with mod=00 reg=EAX r/m=ESI, which means 06. This gives us a final encoded instruction: 03 06. Is this what NASM gives us? Great, we agree!

Assembler API

First of all, let’s have useful constants:

0 const eax
1 const ecx
2 const edx
3 const ebx
4 const esp
5 const ebp
6 const esi
7 const edi

We’ll use these constants as arguments to our API. Now, the tricky part is to elegantly express the “use EAX as a direct destination and use ESI as an indirect source” command to the add instruction. Dusk’s i386 assembler has such an API, but the scope of this subject is too wide for this story arc. We’ll defer this to another story arc and go with a brutally simple, albeit ugly, API: specifying the form in the word name.

For example, for the form we need, we will add a word named “addr[], ( dst src -- )” which means “write the add instruction in its “r32, r/m32” form with mod=00 dst=r32 and src=r/m32”. With what we already know of i386 encoding, the implementation of this word is trivial:

: addr[], ( dst src -- ) $03 c, swap 3 lshift or c, ;

This allows us to create a new word:

code foo eax esi addr[], ret,
42 12 foo . \ prints 54

So, that’s it? Success? From a certain point of view, yes, but this word has a problem: it leaks an element to PS. Instead of the signature “a b -- n” that we want, it has the signature “a b -- a n”. To drop the a, we need to “nip” it from the stack with “add esi, 4”.

The instruction form that allows this is the “r/m32, imm32”5 form. This one also has a modr/m byte, but also a 32-bit immediate that will follow it. The modr/m byte in this case is special: it only has one operand. In these cases, only the mod and r/m fields are used, reg stays empty. However, to pack as many instructions as possible in as few bytes as possible, i386 encoding scheme use these 3 precious bits to expand the tight 8-bit base opcode space. Therefore, for the “r/m32, imm32” form, add shares the 81 opcode with other instructions such as sub, adc,etc. So, you can’t put anything in the reg field, you have to put what Intel tells you to put. In the documentation, it’s the number after the slash. In our case /0.

Wrapping all this up, this allows us to add our new word:

: addri, ( reg imm -- ) $81 c, swap $c0 or c, , ;

You guessed what the ”$c0 or” was for? To select mod=03! With this new word, we can complete our foo word:

code foo
  eax esi addr[],
  esi 4 addri,
  ret,
42 12 foo . \ prints 54

You can now see, with .S, that we don’t leak to PS anymore! This completes our extremely minimal i386 assembler. We’ll be able to use it in our C compiler, when comes the time of generating the code.

The source code for our toy assembler can be downloaded here

Going further

We won't go further, assembler-wise, in this story arc. If you wish to explore this subject further, I recommend you a good tool to help you visualize x86 encoding. Then, you can of course take a look at Dusk's i386 assembler.

Up next

In the next article, we’ll set aside our shiny new assembler for a while as we tackle the first part of a C compiler by building a tokenizer.

Next: Feeding the beast


  1. I ignore special cases for simplicity reasons and because we won’t use them in this story arc. 

  2. EAX=0 ECX=1, etc. 

  3. It’s also this opcode that determines that the operation is a 32-bit one. If we wanted to do the same operation with the same operands in 8-bit, we’d change the opcode to 00 or 02. What about 16-bit? It’s complicated and out of scope of this article. 

  4. It’s a bit confusing, but one realization about i386 will help you grok the logic. In i386, it’s impossible to have an instruction with two “mod” operands. For example, “add [esi], [eax]” is impossible. Therefore, one of the operands is always “direct”. reg is the one. 

  5. Let’s ignore the imm8 optimization for now.