So you want to create a programming language? Awesome!
Should you do it? Definitely not. Better yet, go ahead, but don’t take it lightly.
When I created my first programming (scripting? is there even a difference?) language I was about 17 – a lovely templating language that through a series of regexes was transformed into PHP code. Had everything from variables to functions and loops. Wonderful.
My next foray into language creation was about two years ago. Older and wiser, I knew I wanted to create “a lisp without parentheses”. Cool huh?
Failed as soon as I realized I don’t know how to parse “if this then if this then that else that”
Remember, no parentheses.
Building a real compiler
This semester I jumped at the chance to take a compilers class – we built a compiler for a stripped down version of Pascal. Practically from scratch.
Turns out that “if this then if this then that else that” cannot be parsed with a linear grammar – you need an “elif” construct or parentheses. Using a recursive grammar would be too slow.
Writing a compiler is fun! And by fun I mean it makes you feel like driving a metal rod through your brain. It’s fun in that rewarding Holy crap, did I just survive that!? I survived _that_? Damn.
The complexity is immense. The difficulty of discovering there’s a problem at all … even immenser.
A compiler works in several stages:
- Lexical analysis – parses out comments and whitespace, unifies the language used (a list of lexemes, you use JFlex or something)
- Syntactical analysis – checks the syntax is correct and builds the Abstract Syntax Tree (using a linear grammar with a tool like java_cup)
- Semantic analysis – takes care of the semantics of the language (only call functions, supply correct parameters etc. – type checking)
- Frames – essentially memory management. Give functions some breathing space, pointers to their memory and so on.
- Intermediate code generation – this stage turns the AST into a tree of assembler-like instructions
- Code linearization – next step is to change that tree into a linear set of instructions, make sure registers are used well and so on. At this point you can run an interpreter.
- There are a few more stages before reaching machine code; luckily we stopped here.
The really fun part is that, given a random issue, any of those stages can be the problem. Even though separately they all look like they’re working perfectly.
The debugging … oh god the debugging. This relatively simple compiler is beyond a doubt the toughest little bastard I have ever had the pleasure of fixing.
For starters, you don’t even know if there is or isn’t a bug. Your only chance at debugging (and finding the bugs in the first place) is to write code in the target language and hope they break something.
- Compile the compiler, see Java devours it and all is well
- Run the compiler, there are no runtime errors
- Write some code in the target language
- Compile+run with your compiler/interpreter
One of two things will happen. The code will run smoothly and output the correct result.
Or there will be a syntax error. Or a semantic error. Or the result will be simply wrong.
You now have to carefully look through the example code and decide that it is in fact correct, written properly and should work. Remember, you cannot test it anywhere else, because you are creating the compiler. In a class setting, your mates can help with their compilers (which are also be buggy), if you’re creating a new language – you’re on your own.
Once you’ve decided the target code is correct it’s time to look through your compiler.
In the case of syntax/semantic errors the task is simple – look at the output of the appropriate stage and decide that after several months of everything working, hey your grammar is actually wrong. Or hey, your type checker is actually doing that one thing wrong. Or maybe your name checker is being silly … whatever.
The really nasty buggers are those logical errors – the code didn’t come up with the right result. There is no real symptom to look at. Your only hope of success is carefully inspecting the intermediate code and seeing if anything looks wrong.
Even once you’ve found the problem, there’s still the issue of what’s actually causing it.
For instance: I was chasing a bug for days. Arrays were overwriting their neighbours in a record … turns out my sample code wasn’t properly reserving memory and shouldn’t be working anyway. That was fun.
And keep in mind that finding the bugs in the first place is really hard. The professor gave my very buggy compiler a 100%. Simply because every program he ran worked.
That’s why it can take decades to discover a bug in a compiler used by millions of people. And how many buggy compilers are out there when people just assume their code is the problem and change it?
Seriously, the people out there who make compilers and languages used by millions of people are superheroes. I can’t imagine doing that and keep even a semblance of my fragile sanity.