Tumble Forth: Feeding the beast

Now that we have our assembler, we can begin tackling C directly. You'll remember that this is the code we want to compile:

int foo(int a, int b) {
    return a + b;
}

The first step in the compilation process is to separate this stream into a series of tokens, which will then be fed to the parser.

Tokenization in C is slightly more complex than tokenization in Forth: whitespaces and symbols are token boundaries. For example, "return a+b;" yields tokens "return", "a", "+", "b" and ";".

You'll see in this article that the actual logic for doing so is almost trivial, hardly enough to be worth a whole article. However, to build this API, we'll need to leverage mechanisms in Dusk that are not trivial and this article will walk you through them.

Strings

Our tokenizer will yield tokens as strings. Strings are a series of contiguous bytes in memory preceded by a "length" field, which is one byte in size. When we refer to a string, we refer to the address of the "length" field.

The " word allows you to create a string literal. In most Forths, this word is limited by Forth tokenization rules, which means that it must always be followed by a whitespace, which will not be included in the string. Dusk is fancy and doesn't have this limitation, so string contents can directly follow the ". For example, "hello" yields an address in memory that will look like this:

 A+0  A+1  A+2  A+3  A+4  A+5
+-----------------------------+
| 05 | 68 | 65 | 6C | 6C | 6F |
+-----------------------------+

Dusk doesn't bother with implementing strlen, as it's exactly the same as c@. It's also frequent to use strings with c@+ ( addr -- addr+1 c ) to process strings. Example:

: foo ( str -- ) c@+ 0 do ( a ) c@+ emit loop ( a ) drop ;
"hello" foo \ prints "hello", exactly like the "stype" word

Stack comments

The snippet above also includes a frequent pattern in Dusk code. First, words are documented with a signature comment. This one indicates that it expects a string to live on PS top and that the word will consume it.

We also see "stack status" comments, to help following along the code. For example, after c@+ is called, PS becomes ( a length ), with a being the address of the first character of the string. Because the for consumes length, the loop body ends up with a stack status of ( a ).

To avoid making words too heavy, we don't bother commenting the stack status at each step. However, Dusk has the habit of documenting stack status at particular points:

Right after the beginning of a loop.
Sometimes at the end of a loop. Especially in cases where there is more than one exit point for the loop, for example in begin..while..repeat, in which exit points can have different stack status!
At tricky points in big complicated words (of which there should be as few as possible).

Crossing the streams

Our tokenizer needs to feed itself from some kind of stream. You can feed yourself directly from the main stream using the in< ( -- c ) word. Here's an example usage:

: foo in< emit ;
foo A \ the A character is not interpreted by Forth, but by "foo"

But getting your C code from the input stream is a bit inconvenient, right? It's much better to get it from a file. It turns out that the input stream in Dusk is very flexible and this can easily be done. For example, exec<< can be used to call a word with its input stream "hijacked" with the contents of a specified file. For example:

: spit begin in< dup -1 <> while emit repeat drop ;
' spit exec<< foo.c

Will spit the entire contents of foo.c to the console.

Unlike mycc.fs, your foo.c file has to live in Dusk's filesystem. Therefore, you need to copy it in the fs/ directory and rebuild Dusk.

Tokenizer API

What we'll want to build is a word that, when called, yields the next token as a string. If we've reached the end of input, we yield zero. Let's call this word nextt ( -- str-or-0 ) (for "next token").

That will work a bit like word accumulation in our baby Forth. First, what's a whitespace? Let's begin a tokenizer in mycc.fs with this utility:

: isWS? ( c -- f ) SPC <= ;

SPC is a system constant for $20. In signatures, f means "flag" (0 or 1).

Then, all we need is a buffer and a loop:

$40 const MAXTOKSZ \ maximum size for tokens
create buf MAXTOKSZ 1+ allot

: tonws ( -- c ) begin ( ) in< dup isWS? while drop repeat ;
: nextt ( -- str-or-0 )
  buf 1+ tonws begin ( a c )
    dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF
    swap c!+ in< repeat ( a c )
  drop ( a ) buf - 1- ( length )
  dup buf c! ( length ) if buf else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;

You now can print all tokens from foo.fs with:

( load mycc.fs )
' tokstype exec<< foo.c

As with our baby Forth word accumulator, we have two loops in this code.

The first one, tonws¹ simply consumes input until a non-whitespace is encountered. Because this loop requires us to read a non-whitespace character, we need the word to yield it so it can be used later.

This loop is not a begin..until, as covered in Starting Forth, but rather a begin..while..repeat one, which isn't covered (but is frequent among Forth implementations). It works in a similar manner, but allows the loop condition to be in the middle of the body, often simplifying it. For example, the same word with begin..until would look like this:

: tonws ( -- c ) begin in< dup isWS? if drop 0 else 1 then until ;

The second loop, within nextt itself, is the main accumulating loop. We see that the loop begins with buf+1 as a starting address, the +1 begin for leaving space for the string count byte, which we'll set later.

The while condition is more complex than in tonws because we explicitly have to check for EOF². To be able to "and" two conditions together, we have to juggle with PS a little bit.

Then comes the accumulation part, which is straighforward when we use c!+ ( c a -- a+1 ).

When the loop exits, c will be either a whitespace or -1, for which we have no use and drop. At that point, a points to the empty space following our last character, allowing us to easily compute the length of our accumulated string, which is the last thing we need to do.

Finally, we need to check if we've accumulated something at all. If our result length is zero, this means that we've read zero or more whitespaces followed directly by EOF, so we have nothing to return. If it's not, we write the length to the first byte of buf and return it as the result.

Our last word, tokstype repeatedly calls nextt and print the result until exhausted, so that we can test our tokenizer.

Tokenizing symbols

To have a proper tokenizer, we need to not only split by whitespace, but also by symbols, which unlike whitespaces are also tokens. This presents new challenges to us:

Stop accumulating a token when we encounter a symbol character³.
Don't drop that character like we do with the whitespace.
When the first character we meet is a symbol, stop accumulating and return it as a token.

For the first challenge, let me show you the word cidx ( c a u -- ?i f ):

create symbols ,"+(){};,"
: isSym? ( c -- f ) symbols 7 cidx dup if nip then ;

The word cidx tries to find the index of character c in memory range starting at address a with a length u. We've created such a memory range in symbols above, which allows us to easily determine whether a given character is in that range.

This cidx signature is a bit strange because of the ?i. This is a word with a dynamic stack signature. If the character is found, f=1 and i is the found index. Otherwise, f=0 and i is not present. Because all we care about in isSym? is f, we conditionally drop the i.

For the second challenge, we could save the last character into a variable instead of dropping it and, on the following nextt call, pick it up instead of calling in<, but that's a bit heavy. We have a much better alternative: stepback ( -- ). It "cancels" the previous in< call.

For the third challenge, we need to make a symbol check before the loop to handle this case and not enter the loop when we encounter it.

This could give use a nextt that looks like this:

: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
: nextt ( -- str-or-0 )
  buf 1+ tonws ( a c )
  dup isSym? if swap c! 1 else begin ( a c )
      dup boundary? not while ( a c )
      swap c!+ in< repeat ( a c )
    drop stepback ( a )
    buf - 1- then ( length )
  dup buf c! ( length ) if buf else 0 then ;

With this new code, tokstype will spit the right tokens.

String pooling

Does this mean we're finished? Not yet because our tokenizer has one big problem:

( load mycc.fs )
nextt foo
nextt bar
stype stype

What? twice bar? Yup. That's because we use a static buffer. Calling nextt always overwrites the previous result.

The parser will often have to keep a handful of tokens in memory at once and juggle with them a little bit, so this won't do.

To solve this problem, we're going to use the "string pool" of lib/str. That pool is a rolling buffer. Adding stuff to the pool increases its internal pointer and, when it reaches its end, it rolls back to its beginning. It has the advantage of being simple to use because we never have to free what we allocate. The drawback is that these memory areas can only be used as temporary values. If you need to hold onto them, you need to copy them into a more permanent area of memory.

All we need to do to use the string pool is to change the final buf reference in nextt to buf str>pool, that is, a word that allocates enough space for the supplied string, copies it into the pool, and return the address of the newly allocated area.

That's it! we have a functional tokenizer that will be good enough for foo.c!

Exercises

Did you notice that we don't check for buf overflow? How would you go about implementing such a check?
We're lucky because none of the symbols we use in foo.c are more than a single character. But if we had >>= in there, the tokenization logic would change significantly. Wanna try?
Try handling comments.

Up next

In the next and last chapter of this story arc, we build a parser and, with the help of the assembler and tokenizer, will be able to compile the foo function to an executable i386 word!

To non-whitespace ↩
which is not a whitespace because all numbers in Dusk are unsigned. So -1 SPC <= is false. ↩
I know, some symbols in C contain more than one character, but not in foo.c. Because our tokenizer is minimal, we can afford to stay simple. ↩