# Parser Subsystem The parser transforms a flat token stream from the lexer into an abstract syntax tree (AST) representing a complete C translation unit. It implements a hand-written recursive descent parser for the C language, covering C99/C11 semantics plus a broad set of GCC and Clang extensions required for compatibility with real-world system headers and codebases. --- ## Table of Contents 1. [Why Recursive Descent](#why-recursive-descent) 2. [Module Organization](#module-organization) 3. [The Parser Struct](#the-parser-struct) 4. [Expression Parsing and Precedence Climbing](#expression-parsing-and-precedence-climbing) 5. [The Typedef/Identifier Ambiguity](#the-typedefidentifier-ambiguity) 6. [Declarator Parsing and the Inside-Out Rule](#declarator-parsing-and-the-inside-out-rule) 7. [Type Specifier Parsing](#type-specifier-parsing) 8. [Declaration Parsing](#declaration-parsing) 9. [Statement Parsing](#statement-parsing) 10. [Inline Assembly (GCC Extended Syntax)](#inline-assembly-gcc-extended-syntax) 11. [AST Node Types](#ast-node-types) 12. [Packed Bitfield Attributes](#packed-bitfield-attributes) 13. [Error Recovery Strategy](#error-recovery-strategy) 14. [Known Limitations](#known-limitations) --- ## Why Recursive Descent The parser is written entirely by hand rather than generated by a tool like yacc, bison, or LALR table generators. This is a deliberate choice driven by several properties of the C grammar: - **Context sensitivity.** C's grammar is not context-free. The meaning of an identifier depends on whether it has been previously declared as a `typedef` name. A hand-written parser can consult a live typedef table during parsing, which is awkward or impossible to express in a declarative grammar file. - **Ambiguity resolution.** Many C constructs are syntactically ambiguous. A `(` token could begin a cast expression, a compound literal, a parenthesized expression, or a parenthesized declarator. The parser resolves these with speculative parsing (save position, try one interpretation, backtrack on failure) -- a technique that maps naturally to imperative Rust code. - **GCC extension coverage.** Real-world C code (particularly Linux kernel headers and glibc) uses many GCC extensions: `__attribute__`, statement expressions, computed gotos, inline assembly, `typeof`, `__auto_type`, case ranges, label addresses, and more. These extensions are easier to add incrementally to a hand-written parser than to graft onto a generated grammar. - **Error reporting.** A recursive descent parser knows exactly what construct it was attempting to parse when something goes wrong, enabling precise diagnostic messages ("expected `;` after return statement") with fix-it hints and notes pointing back to matching delimiters. --- ## Module Organization The `Parser` struct is defined once in `parse.rs`, but its methods are spread across six files via separate `impl Parser` blocks. Each file is a private module that adds domain-specific parsing methods: ``` parser/ mod.rs -- Module declarations; re-exports Parser parse.rs -- Parser struct, constructor, token helpers, entry point expressions.rs -- Expression parsing (precedence climbing) types.rs -- Type specifier collection and resolution declarations.rs -- External and local declarations, initializers, function defs declarators.rs -- C declarator syntax (pointers, arrays, function pointers) statements.rs -- All statement types + inline assembly ast.rs -- AST type definitions (pure data, no parsing logic) ``` All parsing methods are `pub(super)` so they can call each other across module boundaries within the parser crate, while remaining invisible outside it. The public API consists of `Parser::new()`, `Parser::parse()`, `Parser::set_diagnostics()`, `Parser::take_diagnostics()`, the `error_count` field, and the AST types exported from `ast.rs`. This split keeps each file focused on one aspect of the grammar. The type specifier parser (`types.rs`) handles the combinatorial complexity of C's type keywords; the declarator parser (`declarators.rs`) handles the recursive inside-out syntax; the expression parser (`expressions.rs`) handles operator precedence; and so on. Despite spanning six files, there is a single `Parser` struct with a single token stream and a single parse position -- no separate sub-parsers or intermediate representations. --- ## The Parser Struct The core state of the parser is: ```rust pub struct Parser { tokens: Vec, // Full token stream from the lexer pos: usize, // Current read position typedefs: FxHashSet, // Known typedef names shadowed_typedefs: FxHashSet, // Typedefs shadowed by local vars attrs: ParsedDeclAttrs, // Accumulated decl attributes (reset per decl) pragma_pack_stack: Vec>, // #pragma pack push/pop stack pragma_pack_align: Option, // Current #pragma pack alignment pragma_visibility_stack: Vec, // #pragma GCC visibility push/pop pragma_default_visibility: Option, error_count: usize, diagnostics: DiagnosticEngine, enum_constants: FxHashMap, // For constant-expr eval unevaluable_enum_constants: FxHashSet, // Enum consts with unknown values struct_tag_alignments: FxHashMap, // For __alignof__ lookups } ``` Key design points: - **Token array + position index.** The parser operates on a pre-lexed `Vec` rather than a streaming lexer. This enables O(1) lookahead at any depth and trivial backtracking by saving and restoring `pos`. - **Typedef tracking.** The `typedefs` set is pre-seeded with ~90 common standard library type names (`size_t`, `uint32_t`, `FILE`, `va_list`, etc.) and grows as `typedef` declarations are parsed. The `shadowed_typedefs` set tracks typedef names that have been redeclared as local variables in the current scope, which is necessary for correct disambiguation. - **Attribute accumulator.** `ParsedDeclAttrs` is a temporary buffer that collects storage-class specifiers, type qualifiers, and GCC attributes as they are encountered during type specifier parsing. It is consumed and reset at each declaration boundary. Its lifecycle is "set during `parse_type_specifier`, consumed by declaration builders in `declarations.rs`." - **Pragma state.** The parser maintains push/pop stacks for `#pragma pack` (struct field alignment) and `#pragma GCC visibility` (ELF symbol visibility). These are emitted by the preprocessor as synthetic tokens and consumed during parsing. - **Semantic caches.** `enum_constants` maps enum constant names to their integer values, enabling the parser to evaluate constant expressions in contexts like `__attribute__((aligned(1 << ENUM_CONST)))`. `unevaluable_enum_constants` tracks enum constants whose values couldn't be computed at parse time (e.g., `SIZE = sizeof(some_typedef)`); these are recognized as constants by `_Static_assert` even though their values are unknown. `struct_tag_alignments` stores computed struct/union alignments for `__alignof__` lookups on tag-only references. The main entry point is `Parser::parse()`, which loops over `parse_external_decl()` until EOF, collecting `ExternalDecl` nodes into a `TranslationUnit`. --- ## Expression Parsing and Precedence Climbing The expression parser uses **operator precedence climbing** via a table-driven design. Rather than writing a separate function for each of C's ten binary precedence levels, a single `parse_binary_expr` method is parameterized by a `PrecedenceLevel` enum: ```rust enum PrecedenceLevel { LogicalOr, // || LogicalAnd, // && BitwiseOr, // | BitwiseXor, // ^ BitwiseAnd, // & Equality, // == != Relational, // < <= > >= Shift, // << >> Additive, // + - Multiplicative, // * / % } ``` The `token_to_binop` method on `Parser` maps a `(&TokenKind, PrecedenceLevel)` pair to an optional `BinOp`. The shared loop in `parse_binary_expr` repeatedly: 1. Parses the next-tighter level via `parse_next_tighter`. 2. Checks if the current token maps to an operator at this level. 3. If so, consumes the operator, parses the right-hand operand (at the next-tighter level for left-associativity), and wraps both sides in a `BinaryOp` node. 4. Repeats until no operator matches. This eliminates what would otherwise be ten nearly identical recursive functions while preserving correct associativity and precedence. The full call hierarchy from loosest to tightest binding is: ``` parse_expr -- comma operator (,) parse_assignment_expr -- = += -= *= /= %= <<= >>= &= ^= |= parse_conditional_expr -- ? : (ternary, including GNU omitted-middle) parse_binary_expr(LogicalOr) ...10 levels... parse_binary_expr(Multiplicative) parse_cast_expr -- (type)expr, compound literals parse_unary_expr -- prefix ++/-- +/- ~ ! & * sizeof _Alignof parse_postfix_expr -- postfix ++/-- [] . -> fn() parse_primary_expr -- literals, identifiers, parens, builtins ``` Assignment is parsed as right-associative by having `parse_assignment_expr` recurse to itself for the right operand. The ternary operator similarly recurses for both the then-branch (via `parse_expr`, allowing comma expressions) and the else-branch (via `parse_conditional_expr`, disallowing commas). ### Cast Expression Ambiguity `parse_cast_expr` demonstrates speculative parsing. When it sees `(`, it: 1. Saves the current position and attribute state. 2. Advances past `(` and checks `is_type_specifier()`. 3. If a type specifier is found, parses it through abstract declarator suffixes. 4. If `)` follows, checks for `{` (compound literal) or parses the operand as a cast. 5. If any step fails, restores position and falls through to `parse_unary_expr` (treating the `(` as a parenthesized expression). This same speculative pattern is used by `parse_sizeof_expr`, where `sizeof(type)` and `sizeof expr` must be disambiguated. --- ## The Typedef/Identifier Ambiguity C's grammar is famously context-sensitive because an identifier can be either a type name (if previously declared as a `typedef`) or a variable/function name. This creates genuine ambiguity: ```c typedef int T; void f(void) { T * x; // declaration: pointer to T named x? or expression: T times x? T(y); // declaration: T y? or function call: T(y)? } ``` The parser resolves this with four mechanisms: 1. **Typedef set.** `self.typedefs` tracks all names introduced by `typedef` declarations. `is_type_specifier()` returns `true` for identifiers in this set, causing the parser to enter the declaration path rather than the expression path. 2. **Typedef shadowing.** When a local variable declaration uses a name that shadows a typedef (e.g., `int size_t = 42;`), the name is added to `shadowed_typedefs`. The `is_type_specifier()` check excludes shadowed names, so subsequent uses parse as identifiers. The shadowing set is saved/restored at compound statement boundaries to handle scope correctly. 3. **Label vs. declaration.** A typedef name followed by `:` is a label, not a declaration (e.g., `size_t: stmt;`). The `is_typedef_label()` check prevents the parser from misinterpreting labeled statements as declarations. 4. **Builtin typedefs.** The parser pre-seeds `typedefs` with approximately 90 common standard library type names (`size_t`, `int32_t`, `FILE`, `va_list`, `pthread_t`, GCC vector types, etc.). These built-in entries ensure that code using standard types parses correctly even when compiling without real system headers (e.g., cross-compiling without a sysroot). --- ## Declarator Parsing and the Inside-Out Rule C declarators are notoriously difficult to read because they follow an **inside-out rule**: you start at the declared name, then alternate between suffixes (arrays, function parameters) and prefixes (pointers), reading outward through parentheses. For example: ```c int (*fp)(int) // fp is a pointer to a function(int) returning int int (*arr)[3][6] // arr is a pointer to an array[3] of array[6] of int int *(*fps[10])(int) // fps is array[10] of pointer to function(int) returning int* int (**fpp)(int) // fpp is a pointer to a pointer to a function(int) returning int ``` The declarator parser in `declarators.rs` handles this by decomposing the parse into three parts: 1. **Outer pointers.** Leading `*` tokens (with optional cv-qualifiers) before the name or parenthesized group. 2. **Inner derived declarators.** If a `(` is found that starts a parenthesized declarator (not a parameter list), the parser recurses into `parse_declarator()` to parse the inner declarator, then expects `)`. 3. **Outer suffixes.** After the name (or closing paren), the parser collects array dimensions `[expr]` and function parameter lists `(params)`. These three parts are then combined by `combine_declarator_parts`, which applies C's inside-out reading to produce a linear `Vec`. The method handles several cases: - **Function pointers:** `int (*fp)(int)` produces `[Pointer, FunctionPointer([int])]`. - **Pointer-to-array:** `int (*p)[3]` produces `[Array(3), Pointer]`. - **Array of function pointers:** `int (*fps[10])(int)` produces `[Pointer, FunctionPointer([int]), Array(10)]`. - **Double pointers to functions:** `int (**fpp)(int)` produces `[Pointer, FunctionPointer([int]), Pointer]`. - **Nested function pointers returning function pointers:** handled via `ParenAbstractDecl::NestedFnPtr` for types like `void(*(*)(void*))(void)`. ### Disambiguating `(` in Declarators The `is_paren_declarator()` helper distinguishes a parenthesized declarator (like `(*fp)`) from a function parameter list (like `(int x, int y)`). It peeks at the token after `(`: | Token after `(` | Interpretation | |---|---| | `*`, `^`, `(`, `[` | Declarator grouping | | Non-typedef identifier | Named declarator | | `__attribute__`, `__extension__` | Declarator grouping | | Type keyword (`int`, `void`, `struct`, ...) | Parameter list | | `)`, `...` | Parameter list (empty or variadic) | | Typedef name | Parameter list (since `(size_t)` is a parameter) | ### GCC Attributes on Declarators `parse_declarator_with_attrs` also collects GCC `__attribute__` annotations that may appear before and after the declarator proper. These include `mode(QI/HI/SI/DI/TI)` for integer width overrides, `common` for tentative definitions, and `aligned(N)` for alignment. Pre-declarator and post-declarator alignment values are merged by taking the maximum, per C11 section 6.7.5. --- ## Type Specifier Parsing C allows type specifier keywords in any order: `unsigned long int`, `long unsigned int`, and `int long unsigned` are all the same type. The parser in `types.rs` handles this by collecting boolean flags into a `TypeSpecFlags` struct: ```rust struct TypeSpecFlags { has_void: bool, has_bool: bool, has_float: bool, has_double: bool, has_complex: bool, has_char: bool, has_short: bool, has_int: bool, has_unsigned: bool, has_signed: bool, has_struct: bool, has_union: bool, has_enum: bool, has_typeof: bool, long_count: u32, // tracks "long" vs "long long" typedef_name: Option, } ``` The `parse_type_specifier()` method loops, consuming tokens and setting flags, until it hits a token that is not a type specifier, qualifier, or storage class. Then `resolve_type_flags()` maps the flag combination to a concrete `TypeSpecifier` variant: | Flags | Result | |---|---| | `unsigned` + `long_count=2` | `UnsignedLongLong` | | `signed` + `char` | `Char` | | `long` + `double` | `LongDouble` | | `complex` + `float` | `ComplexFloat` | | `complex` + `double` | `ComplexDouble` | | just `unsigned` | `UnsignedInt` | | just `long_count=1` | `Long` | | none of the above + `has_int` | `Int` | Storage-class specifiers (`static`, `extern`, `typedef`, `inline`, `_Thread_local`) and type qualifiers (`const`, `volatile`, `restrict`) are accumulated into the `ParsedDeclAttrs` buffer during the same loop, rather than being part of the returned `TypeSpecifier`. The method also handles: - **`struct`/`union`/`enum`** definitions and forward declarations, including packed structs and `#pragma pack` alignment propagation. - **`typeof(expr)`** and **`typeof(type)`** -- GCC extensions for computed types. - **`__auto_type`** -- GCC type inference from initializer. - **`_Atomic(type)`** -- C11 atomic type specifier (parsed but atomic qualifier not tracked). - **`_Alignas(N)`** and **`_Alignas(type)`** -- C11 alignment specifiers. - **`__attribute__((mode(...)))`** -- GCC integer width override that transforms the base type to a specific bit-width (`QI`=8, `HI`=16, `SI`=32, `DI`=64, `TI`=128) while preserving signedness. - **`__int128`** and **`__uint128_t`** -- 128-bit integer types (rejected on 32-bit targets with a diagnostic). --- ## Declaration Parsing Declarations are parsed in `declarations.rs`. The top-level entry point `parse_external_decl()` handles: 1. **Top-level `asm` directives** -- `asm("...")` outside any function, emitted verbatim as `ExternalDecl::TopLevelAsm`. 2. **`_Static_assert`** -- the assertion expression is evaluated at parse time; if it evaluates to zero, a compile error is emitted with the assertion message. No meaningful AST node is produced (at file scope, an empty `Declaration` sentinel is returned to satisfy the `ExternalDecl` return type). 3. **Implicit `int`** -- C89 compatibility: if no type specifier is found but the token sequence looks like `identifier(`, it is treated as a function with implicit `int` return type. 4. **Type specifier + declarator** -- The method parses a type specifier, then a declarator (with GCC attributes), then inspects what follows: - If the declarator ends with `Function(...)` and the next token is `{` (or a type specifier for K&R style), it is a **function definition**. - Otherwise, it is a **declaration** (variable, typedef, function prototype). ### Function Definitions `parse_function_def()` handles both modern and K&R-style parameter declarations: ```c // Modern (prototype) style: int add(int a, int b) { return a + b; } // K&R (identifier list) style: int add(a, b) int a; int b; { return a + b; } ``` For K&R functions, `parse_kr_params()` matches parameter names from the identifier list with type declarations that follow. The method also shadows typedef names that collide with parameter names for the duration of the function body, restoring the shadowing set afterward. The return type is computed from the derived declarator chain by `build_return_type`, which applies post-Function and pre-Function derivations (handling cases like `int (*func())[3]` where the return type is a pointer to an array). ### Variable Declarations `parse_declaration_rest()` handles multi-declarator declarations (`int x = 1, y = 2, *z;`), initializer parsing (including nested designated initializers with C99 `.field` and `[index]` syntax plus GCC range designators `[lo ... hi]`), and the transfer of accumulated GCC attributes to per-declarator `DeclAttributes`. A `DeclContext` struct groups per-declaration attribute state (alignment, `_Alignas` type, common flag, per-declarator attributes) to avoid threading many individual parameters between `parse_external_decl` and `parse_declaration_rest`. When a `typedef` declaration is parsed, the declared names are added to the `typedefs` set, making them available for disambiguation in subsequent parsing. ### Attribute Merging GCC attributes can appear in multiple positions on a declaration: ```c __attribute__((aligned(16))) int __attribute__((weak)) x __attribute__((section(".data"))); ``` The parser collects attributes from each position (type-level, declarator-level, post-declarator-level) and merges them. For alignment, the maximum value is taken per C11 section 6.7.5. For visibility, an explicit `__attribute__((visibility(...)))` takes priority over the current `#pragma GCC visibility` default. --- ## Statement Parsing The statement parser in `statements.rs` handles all C statement forms: | Statement | AST variant | |---|---| | `expr;` (expression statement) | `Stmt::Expr` | | `return expr;` | `Stmt::Return` | | `if (cond) then else` | `Stmt::If` | | `while (cond) body` | `Stmt::While` | | `do body while (cond);` | `Stmt::DoWhile` | | `for (init; cond; inc) body` | `Stmt::For` | | `{ items }` | `Stmt::Compound` | | `break;` / `continue;` | `Stmt::Break` / `Stmt::Continue` | | `switch (expr) body` | `Stmt::Switch` | | `case expr:` / `default:` | `Stmt::Case` / `Stmt::Default` | | `case lo ... hi:` | `Stmt::CaseRange` (GCC extension) | | `goto label;` | `Stmt::Goto` | | `goto *expr;` | `Stmt::GotoIndirect` (GCC computed goto) | | `label: stmt` | `Stmt::Label` | | `asm(...)` | `Stmt::InlineAsm` | | declaration in statement position | `Stmt::Declaration` (C23/GNU) | ### Compound Statements and Scope Management `parse_compound_stmt()` implements scope management. At entry it saves the typedef shadowing set and the declaration attribute flags; at exit it restores both. This prevents storage-class specifiers from declarations inside the block from leaking into the enclosing context -- critical for GCC statement expressions used inside `typeof()`: ```c typeof(({ extern void f(void); 42; })) x = 10; // 'extern' must not leak ``` The method also handles GNU `__label__` declarations at the start of blocks, which declare local labels scoped to the compound statement. Within a compound statement, the parser distinguishes declarations from statements by calling `is_type_specifier()`. If the current token begins a type specifier (and is not a typedef-name used as a label), it takes the declaration path; otherwise it takes the statement path. ### For Statements `for` statements receive special handling because C99 allows a declaration in the initializer position: ```c for (int i = 0; i < n; i++) { ... } ``` The parser checks `is_type_specifier()` in the initializer to distinguish `ForInit::Declaration` from `ForInit::Expr`. --- ## Inline Assembly (GCC Extended Syntax) The parser supports GCC extended inline assembly with the full four-colon syntax: ```c asm [volatile] [goto] [inline] ( "template string" // template with %0, %1, %[name] placeholders : [output operands] // first colon : [input operands] // second colon : [clobber list] // third colon : [goto labels] // fourth colon (asm goto) ); ``` Each operand is parsed as: ``` [symbolic_name] "constraint" (c_expression) ``` The symbolic name (in brackets) is optional. Constraints are string literals like `"=r"`, `"+m"`, `"i"`. The constraint string may be formed from concatenated string literals. Clobbers are bare string literals (`"memory"`, `"cc"`, register names). Goto labels are comma-separated identifiers. All four operand sections are optional; parsing stops at each colon boundary if there are no more sections. Template strings support concatenation of adjacent string literals. Top-level `asm("...")` directives (outside any function) are also supported and produce `ExternalDecl::TopLevelAsm` nodes for verbatim emission into the assembly output. --- ## AST Node Types The AST is defined as a collection of enums and structs in `ast.rs`. The root is `TranslationUnit`, which contains a `Vec`. ### Top-Level Structure ``` TranslationUnit Vec FunctionDef(FunctionDef) -- function definition with body Declaration(Declaration) -- variable/typedef/prototype declaration TopLevelAsm(String) -- asm("...") directive at file scope ``` ### `FunctionDef` Represents a function definition: return type, name, parameter list, variadic flag, compound statement body, function attributes, K&R flag, and source span. ### `Declaration` Represents variable declarations, typedef declarations, and function prototypes. Contains a base `TypeSpecifier`, a list of `InitDeclarator`s (each with a name, derived declarator chain, optional initializer, and per-declarator `DeclAttributes`), and packed boolean flags for storage class and qualifiers. Also carries alignment overrides, address space qualifiers (`__seg_gs`, `__seg_fs`), and vector size attributes. ### `TypeSpecifier` A large enum covering all C type forms: - **Primitive types:** `Void`, `Char`, `Short`, `Int`, `Long`, `LongLong`, `Float`, `Double`, `LongDouble`, `Signed`, `Unsigned`, `Bool`, `Int128` - **Unsigned variants:** `UnsignedChar` through `UnsignedInt128` - **Complex types:** `ComplexFloat`, `ComplexDouble`, `ComplexLongDouble` - **Aggregate types:** `Struct(name, fields, packed, pragma_pack, aligned)`, `Union(...)`, `Enum(name, variants, packed)` - **Derived types:** `Pointer(inner, address_space)`, `Array(inner, size)`, `FunctionPointer(ret, params, variadic)`, `BareFunction(ret, params, variadic)` - **Name-based:** `TypedefName(String)` - **Deferred:** `Typeof(Expr)`, `TypeofType(TypeSpecifier)`, `AutoType` - **Vector:** `Vector(inner, total_bytes)` — wraps a base element type with `__attribute__((vector_size(N)))` in cast/compound-literal/sizeof contexts Note: `Signed` and `Unsigned` are standalone variants that exist for completeness in type resolution but are not currently emitted by the parser (they are resolved into their concrete forms like `Int` or `UnsignedInt` during `resolve_type_flags`). ### `Expr` The expression enum has 43 variants covering: - **Literals:** `IntLiteral`, `UIntLiteral`, `LongLiteral`, `ULongLiteral`, `LongLongLiteral`, `ULongLongLiteral`, `FloatLiteral`, `FloatLiteralF32`, `FloatLiteralLongDouble` (with full 128-bit precision bytes), `StringLiteral`, `WideStringLiteral`, `Char16StringLiteral`, `CharLiteral`, `ImaginaryLiteral`, `ImaginaryLiteralF32`, `ImaginaryLiteralLongDouble` - **Names:** `Identifier` - **Operators:** `BinaryOp`, `UnaryOp`, `PostfixOp`, `Assign`, `CompoundAssign` - **Access:** `ArraySubscript`, `MemberAccess`, `PointerMemberAccess`, `AddressOf`, `Deref` - **Control flow:** `Conditional` (ternary), `GnuConditional` (omitted-middle `cond ?: else`), `Comma` - **Type operations:** `Cast`, `Sizeof`, `Alignof`, `GnuAlignof` (preferred alignment), `AlignofExpr`, `GnuAlignofExpr`, `VaArg` - **Compound constructs:** `FunctionCall`, `CompoundLiteral`, `StmtExpr` (GCC statement expression `({ ... })`) - **Generic selection:** `GenericSelection` (C11 `_Generic`) - **GCC extensions:** `LabelAddr` (`&&label` for computed goto), `BuiltinTypesCompatibleP` Every `Expr` variant carries a `Span` for source location tracking. The `ExprId` newtype wraps a stable pointer address for use as a cache key in downstream type-checking and constant-evaluation phases. ### `Stmt` Covers all C statement forms as described in the [Statement Parsing](#statement-parsing) section. The full set of variants: - `Expr(Option)` -- expression statement (the most common statement type) - `Return`, `If`, `While`, `DoWhile`, `For`, `Compound` - `Break`, `Continue`, `Switch`, `Case`, `Default` - `CaseRange` (GCC extension), `Goto`, `GotoIndirect` (GCC computed goto) - `Label`, `Declaration`, `InlineAsm` Each statement variant that represents a control-flow construct carries a `Span` for source location, accessible via `Stmt::span()`. ### `DerivedDeclarator` Describes type modifications applied by a declarator: ```rust enum DerivedDeclarator { Pointer, // * Array(Option>), // [expr] or [] Function(Vec, bool), // (params) with variadic flag FunctionPointer(Vec, bool), // (*)(params) -- pointer to function } ``` The distinction between `Function` and `FunctionPointer` is essential: `Function` appears in function definitions and prototypes, while `FunctionPointer` appears in declarators like `void (*fp)(int)` where the `(*...)` syntax introduces a pointer-to-function. ### Initializers ```rust enum Initializer { Expr(Expr), // simple: int x = 42; List(Vec), // braced: int a[] = {1, 2, 3}; } struct InitializerItem { designators: Vec, // .field, [index], [lo...hi] init: Initializer, // recursively nested } enum Designator { Index(Expr), // [expr] Range(Expr, Expr), // [lo ... hi] (GCC extension) Field(String), // .fieldname } ``` --- ## Packed Bitfield Attributes Four AST and parser-internal types use packed bitfield integers to store boolean flags instead of individual `bool` fields: | Struct | Flag type | Booleans stored | |---|---|---| | `FunctionAttributes` | `u16` (13 bits) | static, inline, extern, gnu_inline, always_inline, noinline, constructor, destructor, weak, used, fastcall, naked, noreturn | | `Declaration` | `u16` (9 bits) | static, extern, typedef, const, volatile, common, thread_local, transparent_union, inline | | `DeclAttributes` | `u16` (8 bits) | constructor, destructor, weak, error_attr, noreturn, used, fastcall, naked | | `ParsedDeclAttrs` | `u32` (19 bits) | typedef, static, extern, thread_local, inline, const, volatile, constructor, destructor, weak, used, gnu_inline, always_inline, noinline, noreturn, error_attr, transparent_union, fastcall, naked | Each flag is a named constant (power of two) in a companion module (e.g., `func_attr_flag::STATIC = 1 << 0`). Getters use `flags & MASK != 0` and setters use `flags |= MASK` / `flags &= !MASK`. This design achieves three goals: - **Memory savings.** 13 booleans in `FunctionAttributes` collapse from 13 bytes to 2 bytes. For `ParsedDeclAttrs`, 19 booleans collapse from 19 bytes to 4 bytes. Since AST nodes are allocated in large numbers, these savings are meaningful. - **Bulk save/restore.** `ParsedDeclAttrs::save_flags()` and `restore_flags()` capture and restore all 19 boolean flags as a single `u32`, used to prevent storage-class specifiers from leaking across compound statement scope boundaries. - **API ergonomics.** Callers use named getter/setter methods (`is_static()`, `set_inline(true)`) rather than raw bit manipulation. The `Debug` implementations expand flags into named booleans for readable output. Non-boolean attributes (section name, visibility string, alias target, asm register, cleanup function name, vector size, alignment overrides) remain as `Option` or `Option` fields alongside the packed flags. --- ## Error Recovery Strategy The parser employs several error recovery strategies to continue parsing after encountering invalid input: 1. **Emit and continue.** Most `expect*` methods report an error but do not halt. If the expected token is missing, the parser records the error and continues from the current position, producing a potentially malformed but complete AST. 2. **Contextual error messages.** Four `expect` variants produce increasingly specific diagnostics: - `expect(token)` -- generic: "expected `;` before `}`" - `expect_after(token, context)` -- positional: "expected `;` after return statement" with a fix-it hint - `expect_context(token, context)` -- contextual: "expected `(` after `if`" with a fix-it hint - `expect_closing(token, open_span)` -- matching: "expected `)` before `;`" with a note "to match this `(` at file.c:10:5" and a fix-it hint 3. **Fix-it hints.** Error diagnostics include insertion suggestions (e.g., "insert `;`") that an IDE or error display could use to suggest corrections. 4. **Speculative backtracking.** For ambiguous constructs (casts vs. parenthesized expressions, type specifiers vs. identifiers), the parser saves its position, attempts one interpretation, and backtracks on failure. This avoids committing to a parse that might be wrong. 5. **Top-level skip.** In `parse()`, if `parse_external_decl()` returns `None` for a non-EOF, non-semicolon token, the parser reports an error and advances past the unrecognized token to resynchronize. 6. **Error counting.** `error_count` tracks the total number of parse errors, allowing the driver to decide whether to abort compilation after parsing. --- ## Known Limitations - **Parser does not invoke the preprocessor.** The compiler has a fully integrated C preprocessor (see `frontend/preprocessor/`), but the driver orchestrates the pipeline: preprocessing runs first, then lexing, then parsing. From the parser's perspective, input tokens are already macro-expanded and include-resolved. The parser also pre-seeds a set of builtin typedef names (see below) so that standard types like `size_t` parse correctly even when compiling without real system headers. - **No type checking during parsing.** The parser produces an untyped AST. Type resolution, implicit conversions, and semantic checks happen in later lowering phases. This means some constructs that are syntactically valid but semantically illegal will parse successfully. - **Typedef scope is approximated.** Typedef names are tracked globally with a shadowing set for local scopes. This is correct for the vast majority of code but does not model C's full block-scope typedef rules with complete precision (e.g., a typedef introduced inside a `for` loop initializer may remain visible longer than the standard specifies). - **`_Atomic` is parsed but not fully modeled.** `_Atomic(T)` is treated as equivalent to `T` -- the atomic qualification is syntactically consumed but not preserved in the AST. - **`restrict` is consumed but ignored.** The `restrict` qualifier is recognized and skipped but does not appear in the AST, since it is an optimization hint that does not affect code generation in this compiler. - **Constant expression evaluation is limited.** The parser can evaluate simple constant expressions (for enum values and alignment attributes) using a lookup table of previously parsed enum constants, but cannot evaluate arbitrary compile-time expressions involving sizeof of composite types. - **`_Generic` const tracking is partial.** The parser tracks `is_const` flags on `_Generic` associations and parameter declarations to distinguish const-qualified pointer types, but CType itself does not carry qualifiers. This means const-sensitivity works for the common cases but not for complex expressions like casts or subscripts. - **No C++ support.** This is a C parser only. C++ constructs (classes, templates, namespaces, references, overloading) are not recognized.