We have an assembler and a tokenizer, we can proceed with the final step of our compiler, the parse+generate1 one!
Because our compiler aims to minimally compile the foo()
function, it's
actually going to be quite simple. The parsing of the function can be divided in
four main parts:
Because our function involves only one type, this part of the parsing is
actually a no-op. It has no purpose because our generation code is going to
hardcode for the int
type. Therefore, all we need to do is to consume the
token and assert that it's int
.
This task is of course trivial, but what is less trivial is to design the API
for our C compiler. My proposition is this, which you can put in a
new unit called mycc.fs
:
f<< myasm.fs
f<< tok.fs
: err abort"CC error" ;
: expect ( tok str -- ) s= not if err then ;
: parseType ( -- ) nextt "int" expect ;
: cc parseType tokstype ;
: cc<< ['] cc exec<< ;
This allows you to do:
f<< mycc.fs
cc<< foo.c
As you can see, the main entry point of our C compiler would be the word cc
,
which expects the file
const to contain the C source to compile.
At this stage, what cc
would do is to check that the function return type is
correct, then spit the rest of the file as unparsed tokens for debugging.
If you modify foo.c
to give foo
another return type, you'll see that cc
will correctly raise an error.
s= ( str1 str2 -- f )
which you haven't seen yet, compares two strings and
return whether they have the same contents and length.
The next token coming up is the function name. Handling this one is also simple as all we need to do is to create a new dictionary entry with the same name. It can be done thus:
: parseName ( -- ) sysdict nextt entry ;
: cc parseType parseName tokstype ;
After you've amended mycc.fs
, you can run it again and see that it consumed
the foo
token and created a foo
word (which you can verify with ' foo
).
This word is of course broken because it's empty, don't try to call it.
And that's all there is to it. If we wanted our CC to support function call,
we'd want to bind this name to a signature structure somehow, and if we wanted
to support the static
keyword, we'd want to conditionally create this entry.
But since we don't want all this, we can afford to stay simple.
Code wise, there are two words you haven't seen yet, entry
and sysdict
.
entry ( 'dict str -- )
is the low level word to create a dictionary entry.
Instead of reading its name from the input stream like code
and :
, it gets
it from a string supplied from PS
. It also has the ability to create the entry
in any dictionary, not just the system one. In this case, what we're after is
the former ability, not the latter, because we want to create our word in the
system dictionary.
Which brings us to sysdict
, which is a constant that yields the address of a
pointer to the lastest entry to the system dictionary. It's the exact same thing
as the dictionary
variable in our baby Forth. The entry
word updates that
pointer when it's finished creating the entry.
Our parser will handle an arbitrary number of arguments, all int
s, and keep
that list somewhere in memory so that we can map an identifier name to a ESI
offset or EAX
. The code looks like this:
$100 const ARGSLEN
create args ARGSLEN allot
: parseArgs ( -- )
args nextt "(" expect begin ( a )
parseType nextt over strmove s) nextt dup "," s= while ( a tok )
drop repeat ( a tok )
")" expect 0 swap c! ;
: cc parseType parseName parseArgs tokstype ;
With these additions, our cc
word will consume the arguments and create a
structure defined by Dusk's "lib/str" library, the String List. That
list is very simple, it's a list of strings that follow each other in memory,
ended by a null string (a string with a 0 length field). You can vizualize that
list by dumping the contents of args
after having run cc
:
args dump
:0000d299 0161 0162 0000 0000 0000 0000 0000 0000 .a.b............
...
Such a string list can easily be iterated upon and thus fulfills our
requirement to map names to indexes. The lib/str
library has a word for this,
sfind ( str list -- idx? f )
which returns the index of the specified string
in the list. It has the same kind of dynamic stack effect as cidx
presented
in the previous article:
"a" args sfind . . \ prints 10
"b" args sfind . . \ prints 11
"hello" args sfind . \ prints 0
Memory-wise, you can see that I chose a static buffer for simplicity, with the
tradeoff that it can only handle ARGSLEN
bytes in its list. Parsing a
function that busts this limit will result in memory corruption.
Code-wise, we have once again two new words, s)
and strmove
.
s) ( str -- a )
simply jumps over the specified string and returns the
address directly following its last character. This allows easy iteration of a
string list.
strmove ( str dst -- )
is a utility wrapper around cmove
and copies the
contents of str
, including its length byte, to address dst
.
We've now reached the crucial point of our process where we can finally answer
the question at the root of this tumbling down adventure, that is, how is the
C source of our foo()
function compiled and then ran?
: assert ( f -- ) not if err then ;
: parseExpression ( -- )
nextt args sfind assert nextt "+" expect
nextt args sfind assert ( idx1 idx2 )
?swap 1 = assert 0 = assert ( )
eax esi addr[], ;
: argscnt ( -- cnt ) 0 args begin dup c@ while swap 1+ swap s) repeat drop ;
: parseStatement ( -- )
nextt "return" expect parseExpression nextt ";" expect
esi argscnt 1- 4 * addri, ret, ;
: parseBody nextt "{" expect parseStatement nextt "}" expect ;
: cc parseType parseName parseArgs parseBody ;
If you've been using the POSIX VM, now is the time to switch back to i386 Usermode or QEMU because the POSIX VM can't run i386 code:
f<< mycc.fs
cc<< foo.c
42 54 foo . \ prints 96
.S \ shows that there is no PS leak
Code-wise, there's only one word you haven't seen yet, ?swap ( a b -- lo hi )
which sorts the top two items of PS
. This allows us to easily check if our
expression contains both argument indexes 0 and 1 regardless of the order.
The rest of the code is usage of the i386 assembler you've written yourself.
So this is it! it's done! Mission Accomplished! as they say in America with much fanfare.
Of course, this parsing code only works for a tiny subset of the C language. It can't:
return
.return
statement.int
.+
binary operations with argument
references as its two terms.However, it's not a complete sham and does have a bit of leeway:
return
handler will still properly consume those
arguments from PS
.b + a
will work too.But what is more important is that we can already see how we'd go around improving the compiler.
We'd want to use a third argument in the expression? Add displacement support
to the assembler (to allow references like [esi+4]
) and conditionally use this
capability in parseExpression
.
We'd want support for constant numbers? Try parse ( str -- n f )
on the token
and if it succeeds, conditionally use the addri,
variant using that number
instead of addr[],
.
We want to support -
? Instead of expecting a "+" token, add a condition that
maps +
to add
and -
to sub
. This of course requires an improved
assembler.
... and so on. But these improvements and the ones that follow are deep subjects and need separate story arcs.
We've reached the end of this story arc by answering the initial question, that is, how can this simple piece of C code be compiled and ran? We've also managed to generate code that's faster than GCC's or clang's because we're freed from UNIX's wasteful calling conventions. We also manage to answer the "running" part of the question by using a system that is so much simpler than UNIX that it becomes "just call the address!".
Of course, the challenge of keeping up with modern C compilers in terms of performance quickly becomes much harder as the compiled code becomes larger, but isn't it refreshing to see that, at the heart of it, the task of compiling C code in a sound manner isn't impossibly complex?
I hope that this pilot story arc helped you to demystify the subjects of low level computing, assembly and compilation and motivated you to dig these subjects further, which we'll do in further story arcs.
Some compilers, most even, have separate steps for parsing and code generation, using an Abstract Syntax Tree representation of the code in between. We don't (DuskCC doesn't either). We generate code directly as we parse. It comes with a few drawbacks, but results in a much simpler code. ↩