Tumble Forth: Feeding the beast

Now that we have our assembler, we can begin tackling C directly. You'll remember that this is the code we want to compile:

int foo(int a, int b) {
    return a + b;
}

The first step in the compilation process is to separate this stream into a series of tokens, which will then be fed to the parser.

Tokenization in C is slightly more complex than tokenization in Forth: whitespaces and symbols are token boundaries. For example, "return a+b;" yields tokens "return", "a", "+", "b" and ";".

You'll see in this article that the actual logic for doing so is almost trivial, hardly enough to be worth a whole article. However, to build this API, we'll need to leverage mechanisms in Dusk that are not trivial and this article will walk you through them.

Strings

Our tokenizer will yield tokens as strings (see doc/usage/lit.txt). Strings are a series of contiguous bytes in memory preceded by a "length" field, which is one byte in size. When we refer to a string, we refer to the address of the "length" field.

The S" word allows you to create a string literal. It's limited by Forth tokenization rules, which means that it must always be followed by a whitespace, which will not be included in the string. For example, S" hello" yields an address in memory that will look like this:

 A+0  A+1  A+2  A+3  A+4  A+5
+-----------------------------+
| 05 | 68 | 65 | 6C | 6C | 6F |
+-----------------------------+

Dusk doesn't bother with implementing strlen, as it's exactly the same as c@. It's also frequent to use strings with c@+ ( addr -- addr+1 c ) to process strings. Example:

: foo ( str -- ) c@+ for ( a ) c@+ emit next ( a ) drop ;
S" hello" foo \ prints "hello", exactly like the "stype" word

Stack comments

The snippet above also includes a frequent pattern in Dusk code. First, words are documented with a signature comment. This one indicates that it expects a string to live on PS top and that the word will consume it.

We also see "stack status" comments, to help following along the code. For example, after c@+ is called, PS becomes ( a length ), with a being the address of the first character of the string. Because the for consumes length, the loop body ends up with a stack status of ( a ).

To avoid making words too heavy, we don't bother commenting the stack status at each step. However, Dusk has the habit of documenting stack status at particular points:

  1. Right after the beginning of a loop.
  2. Sometimes at the end of a loop. Especially in cases where there is more than one exit point for the loop, for example in begin..while..repeat, in which exit points can have different stack status!
  3. At tricky points in big complicated words (of which there should be as few as possible).

Crossing the streams

Our tokenizer needs to feed itself from some kind of stream. You could feed yourself directly from the main stream using the in< ( -- c ) word. Here's an example usage:

: foo in< emit ;
foo A \ the A character is not interpreted by Forth, but by "foo"

This could work, but you'll spend too much time copy/pasting C code around. A more convenient way would be to use Dusk's File API (doc/sys/file.txt), which itself rests upon the IO API (doc/sys/io.txt).

To try it, write your C file in Dusk's fs/ directory under foo.c and then do this:

f" foo.c"
console :self file :spit \ Spits the contents of the file to the console

console is a global structbind1 (doc/usage/ns.txt) to the IO struct and file is a global structbind to the File struct.

console is not of a particular interest to us at the moment, so we'll set it aside, but file is. file represents Dusk's "current work file". There's only one at once and we can make it open a particular path with the f" word, as shown above. Whenever we open a new file, the old one is closed2.

file obeys Dusk's I/O semantics which allows to seamlessly carry streams of bytes left and right. Any method from the IO struct can be called on file. Our tokenizer will be consuming the stream character by character, so the I/O method we'll want to use is :getc ( hdl -- c ). hdl, a reference to the I/O struct, is provided automatically by structbinds, so for us, the effective word signature is ( -- c ).

The :getc word yields the character at current stream position and then advances this position by 1. If the end of the stream has been reached, -1 is returned. Therefore, we can reimplement (badly), the :spit method above thus:

: myspit ( -- )
  begin ( ) file :getc dup -1 <> while ( c ) emit repeat ( c ) drop ;
f" foo.c"
myspit

You'll notice that if you call myspit twice, the second time doesn't print anything. That's because the file's position is at the end of it. To rewind it, you can do 0 file :seek.

Tokenizer API

What we'll want to build is a word that, when called, yields the next token, as a string, from file. If we've reached the end of the file, we yield zero. Let's call this word nextt ( -- str-or-0 ) (for "next token").

Let's start roughly and yield our stream line by line, leveraging IO :readline:

: nextt ( -- str-or-0 ) file :readline ;
f" foo.c"
nextt stype \ prints "int foo(int a, int b) {"
nextt stype \ prints "    return a + b;"
nextt stype \ prints "}"
nextt .     \ prints 0

Now, all we have to do is to refine the process until our goal is reached.

Splitting whitespace

:readline is a nice little trick to start out, but we'll need to do like we did in our baby Forth and accumulate characters into a buffer. We could imagine some kind of algorithm that yields substrings from the line yielded by :readline, but we'd need to hold onto that line in between nextt calls and find a way to insert token lengths in there. Much more complicated than simply accumulating from characters, so let's do that. First, what's a whitespace? Let's start a new tok.fs unit with this utility:

: isWS? ( c -- f ) SPC <= ;

SPC is a system constant for $20. In signatures, f means "flag" (0 or 1).

Then, all we need is a buffer and a loop:

$40 const MAXTOKSZ \ maximum size for tokens
create buf MAXTOKSZ 1+ allot

: tonws ( -- c ) begin ( ) file :getc dup isWS? while drop repeat ;
: nextt ( -- str-or-0 )
  buf 1+ tonws begin ( a c )
    dup isWS? not over -1 <> and while ( a c ) \ while not WS or EOF
    swap c!+ file :getc repeat ( a c )
  drop ( a ) buf - 1- ( length )
  dup buf c! ( length ) if buf else 0 then ;
: tokstype ( -- ) begin nextt ?dup while stype nl> repeat ;

You now can print all tokens from foo.fs with f<< tok.fs f" foo.c" tokstype.

As with our baby Forth word accumulator, we have two loops in this code.

The first one, tonws3 simply consumes the file until a non-whitespace is encountered. Because this loop requires us to read a non-whitespace character, we need the word to yield it so it can be used later.

This loop is not a begin..until, as covered in Starting Forth, but rather a begin..while..repeat one, which isn't covered (but is frequent among Forth implementations). It works in a similar manner, but allows the loop condition to be in the middle of the body, often simplifying it. For example, the same word with begin..until would look like this:

: tonws ( -- c ) begin file :getc dup isWS? if drop 0 else 1 then until ;

The second loop, within nextt itself, is the main accumulating loop. We see that the loop begins with buf+1 as a starting address, the +1 begin for leaving space for the string count byte, which we'll set later.

The while condition is more complex than in tonws because we explicitly have to check for EOF4. To be able to "and" two conditions together, we have to juggle with PS a little bit.

Then comes the accumulation part, which is straighforward when we use c!+ ( c a -- a+1 ).

When the loop exits, c will be either a whitespace or -1, for which we have no use and drop. At that point, a points to the empty space following our last character, allowing us to easily compute the length of our accumulated string, which is the last thing we need to do.

Finally, we need to check if we've accumulated something at all. If our result length is zero, this means that we've read zero or more whitespaces followed directly by EOF, so we have nothing to return. If it's not, we write the length to the first byte of buf and return it as the result.

Our last word, tokstype repeatedly calls nextt and print the result until exhausted, so that we can test our tokenizer.

Tokenizing symbols

To have a proper tokenizer, we need to not only split by whitespace, but also by symbols, which unlike whitespaces are also tokens. This presents new challenges to us:

  1. Stop accumulating a token when we encounter a symbol character5.
  2. Don't drop that character like we do with the whitespace.
  3. When the first character we meet is a symbol, stop accumulating and return it as a token.

For the first challenge, let me show you the word cidx ( c a u -- i? f ):

create symbols ," +(){};,"
: isSym? ( c -- f ) symbols 7 cidx dup if nip then ;

The word cidx tries to find the index of character c in memory range starting at address a with a length u. We've created such a memory range in symbols above, which allows us to easily determine whether a given character is in that range.

This cidx signature is a bit strange because of the i?. This is a word with a dynamic stack signature. If the character is found, f=1 and i is the found index. Otherwise, f=0 and i is not present. Because all we care about in isSym? is f, we conditionally drop the i.

For the second challenge, we could save the last character into a variable instead of dropping it and, on the following nextt call, pick it up instead of calling :getc, but since we're dealing with a file, we can also rewind the file's position by 1 character, which has the same effect. This is what we do with file pos 1- file :seek.

For the third challenge, we need to make a symbol check before the loop to handle this case and not enter the loop when we encounter it.

This could give use a nextt that looks like this:

: boundary? ( c -- f ) dup isWS? over -1 = or swap isSym? or ;
: nextt ( -- str-or-0 )
  buf 1+ tonws ( a c )
  dup isSym? if swap c! 1 else begin ( a c )
      dup boundary? not while ( a c )
      swap c!+ file :getc repeat ( a c )
    drop file pos 1- file :seek ( a )
    buf - 1- then ( length )
  dup buf c! ( length ) if buf else 0 then ;

With this new code, tokstype will spit the right tokens.

Scratchpads

Does this mean we're finished? Not yet because our tokenizer has one big problem:

f<< tok.fs
f" foo.c"
nextt nextt stype stype

What? twice the second token? Yup. That's because we use a static buffer. Calling nextt always overwrites the previous result.

The parser will often have to keep a handful of tokens in memory at once and juggle with them a little bit, so this won't do.

To solve this problem, we're going to use one of Dusk's dynamic memory allocator (doc/mem/alloc.txt), the scratchpad (doc/mem/scratch.txt). A scratchpad is a rolling buffer. We allocate stuff to it and when it reaches the end of its buffer, it goes back to the beginning. It has the advantage of being simple to use because we never have to free what we allocate. The drawback is that these memory areas can only be used as temporary values. If you need to hold onto them, you need to copy them into a more permanent area of memory.

You can create your own scratchpad buffer, but for small and very localized usage, there's a global scratchpad called syspad6 which is more convenient to use. In our case, it will do fine.

All we need to do is to change the final buf reference in nextt to buf syspad :s,, that is, a method that allocates enough space for the supplied string, copies it into the scratchpad, and return the address of the newly allocated area.

That's it! we have a functional tokenizer that will be good enough for foo.c!

Exercises

  1. Did you notice that we don't check for buf overflow? How would you go about implementing such a check?
  2. We're lucky because none of the symbols we use in foo.c are more than a single character. But if we had >>= in there, the tokenization logic would change significantly. Wanna try?
  3. Try handling comments.

Up next

In the next and last chapter of this story arc, we build a parser and, with the help of the assembler and tokenizer, will be able to compile the foo function to an executable i386 word!


  1. I won't be explaining the struct system in details in this article. You're welcome to read the docs about it, of course, but it's not critical. Suffice it to say that when we call a structbind word, we "activate" its struct namespace, so the next word is a kind of "method" call. 

  2. this doesn't mean that Dusk can only open one file at once. It can open more than one, just not through the file global structbind. 

  3. To non-whitespace 

  4. which is not a whitespace because all numbers in Dusk are unsigned. So -1 SPC <= is false. 

  5. I know, some symbols in C contain more than one character, but not in foo.c. Because our tokenizer is minimal, we can afford to stay simple. 

  6. a structbind to the Scratchpad struct.