3 unstable releases
0.2.1 | Jun 17, 2025 |
---|---|
0.2.0 | Jun 1, 2025 |
0.1.0 | May 12, 2025 |
#1083 in Procedural macros
140 downloads per month
62KB
1K
SLoC
Auto LSP Codegen
To generate an AST, simply provide a Tree-sitter node-types.json and LanguageFn of any language to the generate
function of the auto_lsp_codegen
crate.
cargo add auto_lsp_codegen
[!NOTE] Although
auto_lsp_codegen
is a standalone crate, the generated code depends on the mainauto_lsp
crate.
Usage
The auto_lsp_codegen
crate exposes a single generate
function, which takes:
- A
node-types.json
, - A
LanguageFn
- A
HashMap<&str, &str>
to rename tokens (see Custom Tokens) - And returns a TokenStream.
How you choose to use the TokenStream
is up to you.
The most common setup is to call it from a build.rs script and write the generated code to a Rust file.
Note, however, that the output can be quite large—for example, Python’s AST results in ~11,000 lines of code.
use auto_lsp_codegen::generate;
use std::{fs, path::PathBuf};
fn main() {
if std::env::var("AST_GEN").unwrap_or("0".to_string()) == "0" {
return;
}
let output_path = PathBuf::from("./src/generated.rs");
fs::write(
output_path,
generate(
tree_sitter_python::NODE_TYPES,
&tree_sitter_python::LANGUAGE.into(),
None,
)
.to_string(),
)
.unwrap();
}
You can also invoke it from your own CLI or tool if needed.
How Codegen Works
The generated code structure depends on the Tree-sitter grammar.
Structs for Rules
Each rule in node-types.json
becomes a dedicated Rust struct. For example, given the rule:
function_definition: $ => seq(
optional('async'),
'def',
field('name', $.identifier),
field('type_parameters', optional($.type_parameter)),
field('parameters', $.parameters),
optional(
seq(
'->',
field('return_type', $.type),
),
),
':',
field('body', $._suite),
),
The generated struct would look like this:
#[derive(Debug, Clone, PartialEq)]
pub struct FunctionDefinition {
pub name: std::sync::Arc<Identifier>,
pub body: std::sync::Arc<Block>,
pub type_parameters: Option<std::sync::Arc<TypeParameter>>,
pub parameters: std::sync::Arc<Parameters>,
pub return_type: Option<std::sync::Arc<Type>>,
/* ... */
}
Field Matching
To match fields, codegen uses the field_id()
method from the Tree-sitter cursor.
From the above example, the generated builder might look like this:
builder.builder(db, &node, Some(id), |b| {
b.on_field_id::<Identifier, 19u16>(&mut name)?
.on_field_id::<Block, 6u16>(&mut body)?
.on_field_id::<TypeParameter, 31u16>(&mut type_parameters)?
.on_field_id::<Parameters, 23u16>(&mut parameters)?
.on_field_id::<Type, 24u16>(&mut return_type)
});
Each u16 represents the unique field ID assigned by the Tree-sitter language parser.
Handling Children
If a node has no named fields, a children enum is generated to represent all possible variants.
- If the children are unnamed, a generic "Operator_" enum is generated
- If the children are named, the enum will be a concatenation of all possible child node types with underscores, using sanitized Rust-friendly names.
For example, given the rule:
_statement: $ => choice(
$._simple_statement,
$._compound_statement,
),
The generated enum would look like this:
pub enum SimpleStatement_CompoundStatement {
SimpleStatement(SimpleStatement),
CompoundStatement(CompoundStatement),
}
[!NOTE] If the generated enum name becomes too long, consider using a Tree-sitter supertype to group nodes together.
The kind_id()
method is used to determine child kinds during traversal.
The AstNode::contains
method relies on this to check whether a node kind belongs to a specific struct or enum variant.
Vec and Option Fields
repeat
and repeat1
in the grammar will generate a Vec
field.
optional(...)
will generate an Option<T>
field.
Token Naming
Unnamed tokens are mapped to Rust enums using a built-in token map. For instance:
{ "type": "+", "named": false },
{ "type": "+=", "named": false },
{ "type": ",", "named": false },
{ "type": "-", "named": false },
{ "type": "-=", "named": false },
Generates:
pub enum Token_Plus {}
pub enum Token_PlusEqual {}
pub enum Token_Comma {}
pub enum Token_Minus {}
pub enum Token_MinusEqual {}
Tokens with regular identifiers are converted to PascalCase.
Custom Tokens
If your grammar defines additional unnamed tokens not covered by the default map, you can provide a custom token mapping to generate appropriate Rust enum names.
use auto_lsp_codegen::generate;
let _result = generate(
&tree_sitter_python::NODE_TYPES,
&tree_sitter_python::LANGUAGE.into(),
Some(HashMap::from([
("+", "Plus"),
("+=", "PlusEqual"),
(",", "Comma"),
("-", "Minus"),
("-=", "MinusEqual"),
])),
);
Tokens that are not in the map will be added, and tokens that already exist in the map will be overwritten.
Super Types
Tree-sitter supports supertypes, which allow grouping related nodes under a common type.
For example, in the Python grammar:
{
"type": "_compound_statement",
"named": true,
"subtypes": [
{
"type": "class_definition",
"named": true
},
{
"type": "decorated_definition",
"named": true
},
/* ... */
{
"type": "with_statement",
"named": true
}
]
},
This becomes a Rust enum:
pub enum CompoundStatement {
ClassDefinition(ClassDefinition),
DecoratedDefinition(DecoratedDefinition),
/* ... */
WithStatement(WithStatement),
}
[!NOTE] Some super types might contain other super types, in which case, the generated enum will flatten the hierarchy.
Dependencies
~4.5–7MB
~132K SLoC