\(\Im\in\int\int\) \(:\) math \(\to\) metal

appendix

This appendix is a condensed summary of the infrastructure stack for software 1.0, where humans perform computations by orchestrating electrons. The problem that the computer solves is the bridging of the semantic chasm between logic and physics — this communication mismatch is decomposed down to many layers of abstraction. Other devices can map logic to physics in one jump: take the compass for instance.

                --------------------- <-- Human
                | Problem           |
                --------------------- <-- API
                | Algorithm         |
                --------------------- <-- Syntax
                | Language          |
                --------------------- <-- ISA
                | µarch             |
                --------------------- <-- HDL
                | Gates             |
                --------------------- <-- CMOS
                | Circuits          |
                --------------------- <-- Physics
                | Electrons         |
                ---------------------

For the past fifty years, machines were designed for general-purpose computing, largely enabled by 1. transistor scaling and 2. frequency scaling However, it's important to realize that any piece of the system, whether it be language, runtime, or microarchitecture can implement a capabiltiy... - i'm kind of stretching here. this is where it's helpful to know what's at the bottom (link to feynman). b/c then you know what's possible with gates! polynomial instructions? - hardware is essentially an interpreter. difference between interpeter and compiler - tradeoff, etc. hence hardware/software codesign - while the focus is on compiler implemtnation (i might build a RISC simulator. we'll see.) - food for thought: probsting's law (link) - semantic gap - becomes more obvious when talking about parallel software models and hw implementations

compilers
chips

compilers

Lexical analysis:
- alphabet (input): characters
- rules (output): tokens
- language: regular
  - spec: regular expressions (RE)
  - impl: deterministic finite automata (DFA)

Syntactic analysis:
- alphabet (input): tokens
- rules (output): tree
- language: context-free
  - spec: Backus-Naur Form (BNF)
  - impl: pushdown automata (PA)

Semantic analysis
- alphabet (input): tree
- rules (output): attributed tree
- language: context-sensitive
  - spec: ??
  - impl: ??

Compiler parsing has strong ties to computational theory and linguistics. There's this Chomsky hierarchy that models the different levels of *expressitivity* in grammars. Languages which RE/FA are too weak to recognize (for instance, unbounded balanced parenthesis), motivate the jump to BNF/PA. A lot of interesting foundations whight can be useful for language designers, but there's nothing practical wrt compiler construction for pre-existing languages.

1. Lexical Analysis: Lexing

Grammar:

// introductions
LITERAL_INT      ::= [0-9]+
ID               ::= [a−zA−Z][a−zA−Z0−9]*

// keywords
KEYWORD_INT      ::= int
KEYWORD_VOID     ::= void
STMT_RETURN      ::= return
STMT_IF          ::= if

// eliminations
PLUS             ::= +
MINUS            ::= -
STAR             ::= *
SLASH            ::= /

// punctuation
PUNC_LEFTPAREN   ::= (
PUNC_RIGHTPARE   ::= )
PUNC_LEFTBRACE   ::= {
PUNC_RIGHTBRAC   ::= }
PUNC_SEMICOLON   ::= ;

---

2. Syntactic Analysis: Parsing

**Syntactic Grammar**

<program>        ::= <function>
<function>       ::= KEYWORD_INT KEYWORD_MAIN PUNC_LEFTPAREN KEYWORD_VOID
                     PUNC_RIGHTPAREN PUNC_LEFTBRACE <statement> PUNC_RIGHTBRACE

<statement>      ::= KEYWORD_RETURN <expr> PUNC_SEMICOLON
<expr>           ::= LITERAL_INT
                    | <expr> <binop> <expr>

<binop>          ::= PLUS
                   | MINUS
                   | STAR
                   | SLASH

The syntactic grammar specified above via BNF is a subset of the C language. It only recognizes and parses programs that return arithmetic expressions. The `` production refers to ``, which, in turn, refers to ``, but these can be easily rewritten as one single RE. The one rule which REs cannot specify is the `` production; it refers to itself an arbitary amount of times. **Recursive Descent** Recursive descent parsers parse tokens the same way you lex characters into tokens. They *descend* down the grammar's spec, and because of that, are also called top down parsers. They are *predictive* parsers because they recognize and produce the correct rule based on a single character of lookahead without any backtracking. Parsers which implement recursive descent lend well to a hand-written implementation, since you have a single function per non-terminal. Whether you are recursing with the host language's stack frames or an explicit stack data structure, you just have to make sure to avoid the non-halting scenario from incorrect base/inductive case ordering.

<expr> ::= <expr> <binop> <expr>
         | LITERAL_INT

*Problem 1: Left recursion: does not halt* Given a simple BNF grammar for a calculator (just expressions) like the one above, the first line is problematic if we directly translate it with a recursive implementation, and we can't reorder the rules so that `LITERAL_INT`'s precedence above ` ` As [matklad explains](https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing), academia points out that left-recursive grammars are the the Achilles heel of recursive descent grammars, which, theoretically, motivates LR parsing techniques. However, practically, you can just stick with LL parsing pattern by using a loop.


fn parse_expr(tokens: &[Token]) -> Expr {
  match tokens {
    [] => panic!(),
    [f, r @ ..] => match f.typ {
      TokenType::LiteralInt => {
        // 1. create root with left initialized
        let mut root = if let Ok((op, _)) = parse_binop(r) {
            Expr::Binary {
                op,
                l: Box::new(Expr::Num(f.lexeme.parse().unwrap())),
                r: Box::new(Expr::Num(-1)),
            }
        } else {
            Expr::Num(f.lexeme.parse().unwrap())
        };

        // 2. initialize &mut root, and r_tokens, continually updated by loop
        let mut cur_node = &mut root;
        let mut r_tokens = r;

        // 3. while there still exists ops in input, fill in right childs
        //    (pseudocode for readability: does not pass borrowchecker)
        while let Ok((op, r)) = parse_binop(r_tokens) {
          let lit = parse_lit(r, TokenType::LiteralInt);

          if let Ok((op, r)) = parse_binop {
            // same as step 1
            cur_node = Box::new(Expr::Binary { op, l: Box::new(Expr::Num(lit)), r: Box::New(Expr::Num(-1))})
          } else {
            cur_node = Box::new(Expr::Num(lit))
          }

          r_tokens = r
        }

        // 4. return
        Ok((root, r_tokens))
      },
      _ => panic!()
    }
  }
}

The implementation now halts and produces the correct answer for the following expressions: - `9 + 10 + 11` - `9 - 10 - 11` - `9 * 10 * 11` - `9 / 10 / 11` *Problem 2: Precedence* While the parser now halts for all inputs, it does not produce the correct answer for all. While it does handle associativity *within* operator, it does not handle precedence *across* operators. Parsing the following expressions will result in right-leaning trees, which do not match the semantics of convention around arithmetic. Multiplications and divisions must happen before additions and subtractions, without any overriding parentheses. Associativity

                                 -
                                / \
  9 - 10 - 11 -> |parser| ->   9   -
                                  / \
                                 10 11

Precedence

                                 *
                                / \
  9 * 10 + 11 -> |parser| ->   9   +
                                  / \
                                 10 11

                                 /
                                / \
  9 / 10 - 11 -> |parser| ->   9   -
                                  / \
                                 10 11

If we were writing an interpreter, which would lend itself naturally to a recursive solution, we would visit the tree in a depth-first fashion. However, this depth-first order still applies to `din`, because the generator is implemented with the same pattern. Because of this, we want the *tighter* operations to be at the bottom of the tree. There are two solutions: 1. Modify/obfuscate the grammar by stratifying the rule even further according to it's precedence levels. 2. Modify the implementation with pratt parsing which uses precedence and associativity as first-class citizens. **Sol 1: Recursive Descent (Stratified)**

<term>   ::= <factor>
           | <term> (PLUS|MINUS) <factor>
<factor> ::= <atom>
           | <term> (STAR|SLASH) <atom>
<atom>   ::= LITERAL_INT
           | PUNC_LEFT_PAREN <expr> PUNC_RIGHT_PAREN

The functions for `` and `` will recognize and parse the left sub-tree which has a stronger precedence level than the current level of recursion. `parse_term()` is going to recur on `parse_factor()` first, and `parse_factor()` will recur on `parse_atom()` first for their left sub-trees. You can see how the grammar and implementation gets obfuscated. In order to parse an expressions with just terms like 9 + 10, `parse_term()` will need call `parse_factor()` as proxy, to call `parse_atom()`. The implementation becomes a waterfall of parse functions.


fn parse_expr(tokens: &[Token]) -> Result<(Expr, &[Token]), io::Error> {
    parse_term(tokens)
}

fn parse_term(tokens: &[Token]) -> Result<(Expr, &[Token]), io::Error> {
    let (left, r) = parse_factor(tokens)?;

    match r {
        [] => Ok((left, r)),
        r => {
            let mut root = left;
            let mut r_tokens = r;

            while let Ok((op, r)) = parse_term_op(r_tokens) {
                let (right, r) = parse_factor(r)?;

                root = Expr::Binary {
                    op,
                    l: Box::new(root),
                    r: Box::new(right),
                };

                r_tokens = r;
            }

            Ok((root, r_tokens))
        }
    }
}

fn parse_factor(tokens: &[Token]) -> Result<(Expr, &[Token]), io::Error> {
    let (left, r) = parse_atom(tokens)?;

    match r {
        [] => Ok((left, r)),
        r => {
            let mut root = left;
            let mut r_tokens = r;

            while let Ok((op, r)) = parse_factor_op(r_tokens) {
                let (right, r) = parse_atom(r)?;

                root = Expr::Binary {
                    op,
                    l: Box::new(root),
                    r: Box::new(right),
                };

                r_tokens = r;
            }

            Ok((root, r_tokens))
        }
    }
}

fn parse_atom(tokens: &[Token]) -> Result<(Expr, &[Token]), io::Error> {
    match tokens {
        [] => todo!(),
        [f, r @ ..] => Ok((Expr::Num(f.lexeme.parse().unwrap()), r)),
    }
}

fn parse_term_op(tokens: &[Token]) -> Result<(Op, &[Token]), io::Error> {
    match tokens {
        [] => todo!(),
        [f, r @ ..] => match f.typ {
            TokenType::Plus => Ok((Op::Add, r)),
            TokenType::Minus => Ok((Op::Sub, r)),
            _ => {
                Err(io::Error::new(io::ErrorKind::Other, "bla"))
            }
        },
    }
}

fn parse_factor_op(tokens: &[Token]) -> Result<(Op, &[Token]), io::Error> {
    match tokens {
        [] => todo!(),
        [f, r @ ..] => match f.typ {
            TokenType::Star => Ok((Op::Mult, r)),
            TokenType::Slash => Ok((Op::Div, r)),
            _ => {
                Err(io::Error::new(io::ErrorKind::Other, "bla"))
            }
        },
    }
}

**Sol 2: Pratt Parsing** - https://www.youtube.com/watch?v=2l1Si4gSb9A - https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing#Recursive-descent-and-left-recursion - https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing - https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/ - https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#RDP - https://www.oilshell.org/blog/2017/03/31.html ---

3. Semantic Analysis: Type Checking

Lexical analysis passes judgement on ill-formed tokens, while syntactic analysis passes judgement on ill-formed grammars. Now, we reach the semantic analysis phase, which passes judgement on well-typed programs. Once semantic analysis is complete and successful, the program must be a legal program in the programming language. No further errors in the program should bereported by the compiler.

chips

References