Tumble Forth: In the Eye of the Compiler

We have an assembler and a tokenizer, we can proceed with the final step of our compiler, the parse+generate1 one!

Because our compiler aims to minimally compile the foo() function, it's actually going to be quite simple. The parsing of the function can be divided in four main parts:

  1. return type
  2. name
  3. arguments
  4. body

Return type

Because our function involves only one type, this part of the parsing is actually a no-op. It has no purpose because our generation code is going to hardcode for the int type. Therefore, all we need to do is to consume the token and assert that it's int.

This task is of course trivial, but what is less trivial is to design the API for our C compiler. My proposition is this, which you can put in a new unit called mycc.fs:

f<< myasm.fs
f<< tok.fs
: err abort"CC error" ;
: expect ( tok str -- ) s= not if err then ;
: parseType ( -- ) nextt "int" expect ;
: cc parseType tokstype ;
: cc<< ['] cc exec<< ;

This allows you to do:

f<< mycc.fs
cc<< foo.c

As you can see, the main entry point of our C compiler would be the word cc, which expects the file const to contain the C source to compile.

At this stage, what cc would do is to check that the function return type is correct, then spit the rest of the file as unparsed tokens for debugging.

If you modify foo.c to give foo another return type, you'll see that cc will correctly raise an error.

s= ( str1 str2 -- f ) which you haven't seen yet, compares two strings and return whether they have the same contents and length.

Name

The next token coming up is the function name. Handling this one is also simple as all we need to do is to create a new dictionary entry with the same name. It can be done thus:

: parseName ( -- ) sysdict nextt entry ;
: cc parseType parseName tokstype ;

After you've amended mycc.fs, you can run it again and see that it consumed the foo token and created a foo word (which you can verify with ' foo). This word is of course broken because it's empty, don't try to call it.

And that's all there is to it. If we wanted our CC to support function call, we'd want to bind this name to a signature structure somehow, and if we wanted to support the static keyword, we'd want to conditionally create this entry. But since we don't want all this, we can afford to stay simple.

Code wise, there are two words you haven't seen yet, entry and sysdict.

entry ( 'dict str -- ) is the low level word to create a dictionary entry. Instead of reading its name from the input stream like code and :, it gets it from a string supplied from PS. It also has the ability to create the entry in any dictionary, not just the system one. In this case, what we're after is the former ability, not the latter, because we want to create our word in the system dictionary.

Which brings us to sysdict, which is a constant that yields the address of a pointer to the lastest entry to the system dictionary. It's the exact same thing as the dictionary variable in our baby Forth. The entry word updates that pointer when it's finished creating the entry.

Arguments

Our parser will handle an arbitrary number of arguments, all ints, and keep that list somewhere in memory so that we can map an identifier name to a ESI offset or EAX. The code looks like this:

$100 const ARGSLEN
create args ARGSLEN allot
: parseArgs ( -- )
  args nextt "(" expect begin ( a )
    parseType nextt over strmove s) nextt dup "," s= while ( a tok )
    drop repeat ( a tok )
  ")" expect 0 swap c! ;
: cc parseType parseName parseArgs tokstype ;

With these additions, our cc word will consume the arguments and create a structure defined by Dusk's "lib/str" library, the String List. That list is very simple, it's a list of strings that follow each other in memory, ended by a null string (a string with a 0 length field). You can vizualize that list by dumping the contents of args after having run cc:

args dump
:0000d299 0161 0162 0000 0000 0000 0000 0000 0000 .a.b............
...

Such a string list can easily be iterated upon and thus fulfills our requirement to map names to indexes. The lib/str library has a word for this, sfind ( str list -- idx? f ) which returns the index of the specified string in the list. It has the same kind of dynamic stack effect as cidx presented in the previous article:

"a" args sfind . . \ prints 10
"b" args sfind . . \ prints 11
"hello" args sfind . \ prints 0

Memory-wise, you can see that I chose a static buffer for simplicity, with the tradeoff that it can only handle ARGSLEN bytes in its list. Parsing a function that busts this limit will result in memory corruption.

Code-wise, we have once again two new words, s) and strmove.

s) ( str -- a ) simply jumps over the specified string and returns the address directly following its last character. This allows easy iteration of a string list.

strmove ( str dst -- ) is a utility wrapper around cmove and copies the contents of str, including its length byte, to address dst.

Body

We've now reached the crucial point of our process where we can finally answer the question at the root of this tumbling down adventure, that is, how is the C source of our foo() function compiled and then ran?

: assert ( f -- ) not if err then ;
: parseExpression ( -- ) 
  nextt args sfind assert nextt "+" expect
  nextt args sfind assert ( idx1 idx2 )
  ?swap 1 = assert 0 = assert ( )
  eax esi addr[], ;
: argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
: parseStatement ( -- )
  nextt "return" expect parseExpression nextt ";" expect
  esi argscnt 1- 4 * addri, ret, ;
: parseBody nextt "{" expect parseStatement nextt "}" expect ;
: cc parseType parseName parseArgs parseBody ;

If you've been using the POSIX VM, now is the time to switch back to i386 Usermode or QEMU because the POSIX VM can't run i386 code:

f<< mycc.fs
cc<< foo.c
42 54 foo . \ prints 96
.S \ shows that there is no PS leak

Code-wise, there's only one word you haven't seen yet, ?swap ( a b -- lo hi ) which sorts the top two items of PS. This allows us to easily check if our expression contains both argument indexes 0 and 1 regardless of the order.

The rest of the code is usage of the i386 assembler you've written yourself.

So this is it! it's done! Mission Accomplished! as they say in America with much fanfare.

Of course, this parsing code only works for a tiny subset of the C language. It can't:

  1. Have more than one statement in its body.
  2. Have a statement other than return.
  3. Have an empty return statement.
  4. Have a return type or argument types other than int.
  5. Parse an expression that isn't a + binary operations with argument references as its two terms.

However, it's not a complete sham and does have a bit of leeway:

  1. It checks that the identifiers in the expression are correct. When they're not, the code is going to raise an error instead of generating the wrong code.
  2. The function can have more than two arguments. You can't use them in the expression, but the return handler will still properly consume those arguments from PS.
  3. The function can have any name.
  4. b + a will work too.

But what is more important is that we can already see how we'd go around improving the compiler.

We'd want to use a third argument in the expression? Add displacement support to the assembler (to allow references like [esi+4]) and conditionally use this capability in parseExpression.

We'd want support for constant numbers? Try parse ( str -- n f ) on the token and if it succeeds, conditionally use the addri, variant using that number instead of addr[],.

We want to support -? Instead of expecting a "+" token, add a condition that maps + to add and - to sub. This of course requires an improved assembler.

... and so on. But these improvements and the ones that follow are deep subjects and need separate story arcs.

Conclusion

We've reached the end of this story arc by answering the initial question, that is, how can this simple piece of C code be compiled and ran? We've also managed to generate code that's faster than GCC's or clang's because we're freed from UNIX's wasteful calling conventions. We also manage to answer the "running" part of the question by using a system that is so much simpler than UNIX that it becomes "just call the address!".

Of course, the challenge of keeping up with modern C compilers in terms of performance quickly becomes much harder as the compiled code becomes larger, but isn't it refreshing to see that, at the heart of it, the task of compiling C code in a sound manner isn't impossibly complex?

I hope that this pilot story arc helped you to demystify the subjects of low level computing, assembly and compilation and motivated you to dig these subjects further, which we'll do in further story arcs.


  1. Some compilers, most even, have separate steps for parsing and code generation, using an Abstract Syntax Tree representation of the code in between. We don't (DuskCC doesn't either). We generate code directly as we parse. It comes with a few drawbacks, but results in a much simpler code.