Now that we have our assembler, we can begin tackling C directly. You'll remember that this is the code we want to compile:
int foo(int a, int b) {
return a + b;
}
The first step in the compilation process is to separate this stream into a series of tokens, which will then be fed to the parser.
Tokenization in C is slightly more complex than tokenization in Forth:
whitespaces and symbols are token boundaries. For example, "return a+b;"
yields tokens "return"
, "a"
, "+"
, "b"
and ";"
.
You'll see in this article that the actual logic for doing so is almost trivial, hardly enough to be worth a whole article. However, to build this API, we'll need to leverage mechanisms in Dusk that are not trivial and this article will walk you through them.
The logic we build in this article is CPU independent. Unlike the previous
article, it's not required to run Dusk with make pcrun
, make run
will also
work. It's a bit more convenient to use under POSIX because it's faster to build
and you can paste text in the terminal. It will pick up files in the fs
subfolder in the same way as make pcrun
.
Our tokenizer will yield tokens as strings. Strings are a series of contiguous bytes in memory preceded by a "length" field, which is one byte in size. When we refer to a string, we refer to the address of the "length" field.
The "
word allows you to create a string literal. In most Forths, this word
is limited by Forth tokenization rules, which means that it must always be
followed by a whitespace, which will not be included in the string. Dusk is
fancy and doesn't have this limitation, so string contents can directly follow
the "
. For example, "hello"
yields an address in memory that will look
like this:
A+0 A+1 A+2 A+3 A+4 A+5
+-----------------------------+
| 05 | 68 | 65 | 6C | 6C | 6F |
+-----------------------------+
Dusk doesn't bother with implementing strlen
, as it's exactly the same as
c@
. It's also frequent to use strings with c@+ ( addr -- addr+1 c )
to
process strings. Example:
: foo ( str -- ) c@+ for ( a ) c@+ emit next ( a ) drop ;
"hello" foo \ prints "hello", exactly like the "stype" word
The snippet above also includes a frequent pattern in Dusk code. First, words are documented with a signature comment. This one indicates that it expects a string to live on PS top and that the word will consume it.
We also see "stack status" comments, to help following along the code. For
example, after c@+
is called, PS becomes ( a length )
, with a
being the
address of the first character of the string. Because the for
consumes
length
, the loop body ends up with a stack status of ( a )
.
To avoid making words too heavy, we don't bother commenting the stack status at each step. However, Dusk has the habit of documenting stack status at particular points:
begin..while..repeat
, in which
exit points can have different stack status!Our tokenizer needs to feed itself from some kind of stream. You can feed
yourself directly from the main stream using the in< ( -- c )
word. Here's
an example usage:
: foo in< emit ;
foo A \ the A character is not interpreted by Forth, but by "foo"
But getting your C code from the input stream is a bit inconvenient, right? It's
much better to get it from a file. It turns out that the input stream in Dusk
is very flexible and this can easily be done. For example, exec<<
can be used
to call a word with its input stream "hijacked" with the contents of a specified
file. For example:
: spit begin in< dup -1 <> while emit repeat drop ;
' spit exec<< foo.c
Will spit the entire contents of foo.c
to the console.
What we'll want to build is a word that, when called, yields the next token as
a string. If we've reached the end of input, we yield zero. Let's call this
word nextt ( -- str-or-0 )
(for "next token").
That will work a bit like word accumulation in our baby Forth. First, what's
a whitespace? Let's start a new tok.fs
unit with this utility:
: isWS? ( c -- f ) SPC <= ;
SPC
is a system constant for $20
. In signatures, f
means "flag" (0 or 1).
Then, all we need is a buffer and a loop:
$40 const MAXTOKSZ \ maximum size for tokens
create buf MAXTOKSZ 1+ allot
: tonws ( -- c ) begin ( ) in< dup isWS? while drop repeat ;
: nextt ( -- str-or-0 )
buf 1+ tonws begin ( a c )
dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF
swap c!+ in< repeat ( a c )
drop ( a ) buf - 1- ( length )
dup buf c! ( length ) if buf else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;
You now can print all tokens from foo.fs
with:
f<< tok.fs
' tokstype exec<< foo.c
As with our baby Forth word accumulator, we have two loops in this code.
The first one, tonws
1 simply consumes input until a non-whitespace is
encountered. Because this loop requires us to read a non-whitespace character,
we need the word to yield it so it can be used later.
This loop is not a begin..until
, as covered in Starting Forth, but rather a
begin..while..repeat
one, which isn't covered (but is frequent among Forth
implementations). It works in a similar manner, but allows the loop condition to
be in the middle of the body, often simplifying it. For example, the same word
with begin..until
would look like this:
: tonws ( -- c ) begin in< dup isWS? if drop 0 else 1 then until ;
The second loop, within nextt
itself, is the main accumulating loop. We see
that the loop begins with buf+1
as a starting address, the +1
begin for
leaving space for the string count byte, which we'll set later.
The while
condition is more complex than in tonws
because we explicitly
have to check for EOF2. To be able to "and"
two conditions together, we
have to juggle with PS
a little bit.
Then comes the accumulation part, which is straighforward when we use c!+ ( c
a -- a+1 )
.
When the loop exits, c
will be either a whitespace or -1, for which we have
no use and drop. At that point, a
points to the empty space following our
last character, allowing us to easily compute the length of our accumulated
string, which is the last thing we need to do.
Finally, we need to check if we've accumulated something at all. If our result
length is zero, this means that we've read zero or more whitespaces followed
directly by EOF, so we have nothing to return. If it's not, we write the length
to the first byte of buf
and return it as the result.
Our last word, tokstype
repeatedly calls nextt
and print the result until
exhausted, so that we can test our tokenizer.
To have a proper tokenizer, we need to not only split by whitespace, but also by symbols, which unlike whitespaces are also tokens. This presents new challenges to us:
For the first challenge, let me show you the word cidx ( c a u -- ?i f )
:
create symbols ,"+(){};,"
: isSym? ( c -- f ) symbols 7 cidx dup if nip then ;
The word cidx
tries to find the index of character c
in memory range
starting at address a
with a length u
. We've created such a memory range in
symbols
above, which allows us to easily determine whether a given character
is in that range.
This cidx
signature is a bit strange because of the ?i
. This is a word with
a dynamic stack signature. If the character is found, f=1
and i
is the
found index. Otherwise, f=0
and i
is not present. Because all we care
about in isSym?
is f
, we conditionally drop the i
.
For the second challenge, we could save the last character into a variable
instead of dropping it and, on the following nextt
call, pick it up instead
of calling in<
, but that's a bit heavy. We have a much better alternative:
stepback ( -- )
. It "cancels" the previous in<
call.
For the third challenge, we need to make a symbol check before the loop to handle this case and not enter the loop when we encounter it.
This could give use a nextt
that looks like this:
: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
: nextt ( -- str-or-0 )
buf 1+ tonws ( a c )
dup isSym? if swap c! 1 else begin ( a c )
dup boundary? not while ( a c )
swap c!+ in< repeat ( a c )
drop stepback ( a )
buf - 1- then ( length )
dup buf c! ( length ) if buf else 0 then ;
With this new code, tokstype
will spit the right tokens.
Does this mean we're finished? Not yet because our tokenizer has one big problem:
f<< tok.fs
nextt foo
nextt bar
stype stype
What? twice bar
? Yup. That's because we use a static buffer. Calling
nextt
always overwrites the previous result.
The parser will often have to keep a handful of tokens in memory at once and juggle with them a little bit, so this won't do.
To solve this problem, we're going to use the "string pool" of lib/str. That pool is a rolling buffer. Adding stuff to the pool increases its internal pointer and, when it reaches its end, it rolls back to its beginning. It has the advantage of being simple to use because we never have to free what we allocate. The drawback is that these memory areas can only be used as temporary values. If you need to hold onto them, you need to copy them into a more permanent area of memory.
All we need to do to use the string pool is to change the final buf
reference
in nextt
to buf str>pool
, that is, a word that allocates enough space for
the supplied string, copies it into the pool, and return the address of the
newly allocated area.
That's it! we have a functional tokenizer that will be good enough for foo.c
!
buf
overflow? How would you go about
implementing such a check?foo.c
are more than a
single character. But if we had >>=
in there, the tokenization logic would
change significantly. Wanna try?In the next and last chapter of this story arc, we build a parser
and, with the help of the assembler and tokenizer, will be able to compile the
foo
function to an executable i386 word!