For my latest project, Norch-PHP-Client, I had to create a simple parser for a user-supplied string.

I'll try to describe the process I followed to create the parser here. It was a struggle to get to a minimal parser, but it certainly was an educational experience.

The parser is built in two pieces, a lexer and a compiler. This is something I remembered from how C compilers and friends are built.

The lexer

The lexer converts the string to tokens. It knows and validates the syntax of the 'language', and returns some stream of tokens representing keywords, strings, integers, ...

In my case, all tokens actually are strings, but they have a different meaning. The lexer derives the meaning from special characters in the neighborhood of the token. Usually the character is right in front, or right behind the token.

For example: title is just a string, but title: is a field name. So we associate the tokens T_STRING and T_FIELD_NAME with them.

You may want to have a look at the code while reading the next paragraphs, as my explanation makes a lot references to it.

To create the lexer, we loop over the string one character at a time. I prefer to do that with a while-loop instead of a for-loop, so I can choose whether to increment the loop variable or not.

Within the loop, we fetch a character of the string. Then a huge switch-statement executes the right code for the character we just read.

  • '\': Escape character. Just append the next character as-is to the current token.
  • <space>: the end of a token. Put it in the array and create a new one
  • ':': If there is nothing in front (there is a space right in front): Syntax error. If there is something in front of it that has not yet a token id assigned, it is a T_FIELD_NAME, else syntax error. Put it in the array and create a new token T_FIELD_VALUE (that's what comes after a colon)
  • '^': If there is nothing in front: Syntax error. If the thing in front has no token yet, it's T_FIELD_NAME, else it's an error. Put it in the array, and create a new token T_FIELD_WEIGHT. Read an integer into the token. Put it in the array, and if a colon follows, put the T_FIELD_NAME up again. (This simplifies the compiler a lot)
  • '@': If there is something in front: Syntax error. Else, the current token is a T_SEARCH_FIELD.
  • '"': Quoted string. It must have a space in front (or, the current token must not contain any text yet). Read an encapsulated string (up to the next quote).
  • Anything else: Just append it to the current token.

All characters consumed? Return an array with the tokens for the compiler to use.

The compiler

The compiler converts the tokens it receives to actual code.

Basically, it's just another big switch-statement in a loop. But now, we loop over the tokens.

Using a foreach-loop is not possible, because we sometimes need to consume two tokens at a time.

So: If the token is a T_STRING: Append it to the search query.

If it's a T_FIELD_NAME: Take the next token, and make sure there is one.

  • It's a T_FIELD_VALUE: call the addFilter() function with the field name and the field value.
  • It's a T_FIELD_WEIGHT: call the addWeight() function with the field name and the field weight
  • It's something else: Compile error: unexpected token

It's a T_FIELD_SEARCH: call addSearchField() with the field.

It's something else: Our lexer messed up. Compile error.


Simple parsers are quite easy to write manually, and are a lot faster than generated regex-based ones.

However, if it's the first parser you write, things can take a while figuring out. You'll have to bite the bullet, and start experimenting.

And don't forget to write tests! If you break a parser, and you don't figure it out immediately, you're gonna have a hard time fixing it.

If you don't write tests, you're gonna have a bad time.