After months of struggling through generating LLVM IR in Go the only thing I know for sure is how much I don’t understand.
This question puzzled me because I didn’t feel like I was making much progress at all. I had been stuck on LLVM for what felt like forever. And unlike my forays into BNF grammars and type systems, even once I finished the conversion of my AST to valid LLVM IR, I still felt like I understood very little about it.
The Dark Arts of Designing Programming Languages
If acronyms like AST and IR are foreign to you, let me bring you quickly up to speed: a programming language is really just a set of applications that convert text files into something machine executable. The first application is the lexer, which matches strings and maps them to a predefined token (basically a label) so the programs after can easily figure out what to do. Then comes the parser, which takes those substrings and tokens and arranges them into a tree structure (called an Abstract Syntax Tree or AST) that will define the flow of execution. After that there are any number of passes that can be made to modify, adjust, and add data to that tree. The most common and relevant of these is the type checker, which attempts to limit the number of edge cases the program can reach when executed by enforcing rules about what you can do with variables in the program.
Some of these passes will fundamentally alter the AST into an intermediate representation (IR) from which more optimizations can be run against. Eventually the last application transforms the IR into machine code — or another programming language in the case of a transpiler — that can be executed.
Since people have been designing programming languages for close to a century, there’s a lot of collective wisdom on how to do each one of these steps well. Each of the major steps have tools available that will do 90% of the work for you. On the lexer/parser side there’s ANTLR4, bison, yacc, flex. On the code generation and optimization side there’s LLVM.