#clang #ast #abstract-syntax-tree #format #data #processing

clang-ast

Data structures for processing Clang’s -ast-dump=json format

9 releases

0.1.8 Aug 1, 2022
0.1.7 Apr 30, 2022
0.1.6 May 14, 2021
0.1.4 Apr 29, 2021

#4 in Programming languages

Download history 1728/week @ 2022-04-25 1813/week @ 2022-05-02 1865/week @ 2022-05-09 1817/week @ 2022-05-16 1577/week @ 2022-05-23 1869/week @ 2022-05-30 2650/week @ 2022-06-06 3332/week @ 2022-06-13 3363/week @ 2022-06-20 3755/week @ 2022-06-27 2564/week @ 2022-07-04 2777/week @ 2022-07-11 3721/week @ 2022-07-18 4240/week @ 2022-07-25 4367/week @ 2022-08-01 5126/week @ 2022-08-08

17,524 downloads per month
Used in cxxbridge-macro

MIT/Apache

100KB
2.5K SLoC

-ast-dump=json

github crates.io docs.rs build status

This library provides deserialization logic for efficiently processing Clang's -ast-dump=json format from Rust.

[dependencies]
clang-ast = "0.1"

Format overview

An AST dump is generated by a compiler command like:

$  clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc

The high-level structure is a tree of nodes, each of which has an "id" and a "kind", zero or more further fields depending on what the node kind is, and finally an optional "inner" array of child nodes.

As an example, for an input file containing just the declaration class S;, the AST would be as follows:

{
  "id": "0x1fcea38",                 //<-- root node
  "kind": "TranslationUnitDecl",
  "inner": [
    {
      "id": "0xadf3a8",              //<-- first child node
      "kind": "CXXRecordDecl",
      "loc": {
        "offset": 6,
        "file": "source.cc",
        "line": 1,
        "col": 7,
        "tokLen": 1
      },
      "range": {
        "begin": {
          "offset": 0,
          "col": 1,
          "tokLen": 5
        },
        "end": {
          "offset": 6,
          "col": 7,
          "tokLen": 1
        }
      },
      "name": "S",
      "tagUsed": "class"
    }
  ]
}

Library design

By design, the clang-ast crate does not provide a single great big data structure that exhaustively covers every possible field of every possible Clang node type. There are three major reasons:

  • Performance — these ASTs get quite large. For a reasonable mid-sized translation unit that includes several platform headers, you can easily get an AST that is tens to hundreds of megabytes of JSON. To maintain performance of downstream tooling built on the AST, it's critical that you deserialize only the few fields which are directly required by your use case, and allow Serde's deserializer to efficiently ignore all the rest.

  • Stability — as Clang is developed, the specific fields associated with each node kind are expected to change over time in non-additive ways. This is nonproblematic because the churn on the scale of individual nodes is minimal (maybe one change every several years). However, if there were a data structure that promised to be able to deserialize every possible piece of information in every node, practically every change to Clang would be a breaking change to some node somewhere despite your tooling not caring anything at all about that node kind. By deserializing only those fields which are directly relevant to your use case, you become insulated from the vast majority of syntax tree changes.

  • Compile time — a typical use case involves inspecting only a tiny fraction of the possible nodes or fields, on the order of 1%. Consequently your code will compile 100× faster than if you tried to include everything in the data structure.


Data structures

The core data structure of the clang-ast crate is Node<T>.

pub struct Node<T> {
    pub id: Id,
    pub kind: T,
    pub inner: Vec<Node<T>>,
}

The caller must provide their own kind type T, which is an enum or struct as described below. T determines exactly what information the clang-ast crate will deserialize out of the AST dump.

By convention you should name your T type Clang.


T = enum

Most often, you'll want Clang to be an enum. In this case your enum must have one variant per node kind that you care about. The name of each variant matches the "kind" entry seen in the AST.

Additionally there must be a fallback variant, which must be named either Unknown or Other, into which clang-ast will put all tree nodes not matching one of the expected kinds.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl { name: Option<String> },
    EnumDecl { name: Option<String> },
    EnumConstantDecl { name: String },
    Other,
}

fn main() {
    let json = std::fs::read_to_string("ast.json").unwrap();
    let node: Node = serde_json::from_str(&json).unwrap();

}

The above is a simple example with variants for processing "kind": "NamespaceDecl", "kind": "EnumDecl", and "kind": "EnumConstantDecl" nodes. This is sufficient to extract the set of variants of every enum in the translation unit, and the enums' namespace (possibly anonymous) and enum name (possibly anonymous).

Newtype variants are fine too, particularly if you'll be deserializing more than one field for some nodes.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    EnumConstantDecl(EnumConstantDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumConstantDecl {
    pub name: String,
}

T = struct

Rarely, it can make sense to instantiate Node with Clang being a struct type, instead of an enum. This allows for deserializing a uniform group of data out of every node in the syntax tree.

The following example struct collects the "loc" and "range" of every node if present; these fields provide the file name / line / column position of nodes. Not every node kind contains this information, so we use Option to collect it for just the nodes that have it.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    pub loc: Option<clang_ast::SourceLocation>,
    pub range: Option<clang_ast::SourceRange>,
}

If you really need, it's also possible to store every other piece of key/value information about every node via a weakly typed Map<String, Value> and the Serde flatten attribute.

use serde::Deserialize;
use serde_json::{Map, Value};

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    #[serde(flatten)]
    pub data: Map<String, Value>,
}

Hybrid approach

To deserialize kind-specific information about a fixed set of node kinds you care about, as well as some uniform information about every other kind of node, you can use a hybrid of the two approaches by giving your Other / Unknown fallback variant some fields.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    Other {
        kind: clang_ast::Kind,
    },
}

Source locations

Many node kinds expose the source location of the corresponding source code tokens, which includes:

  • the filepath at which they're located;
  • the chain of #includes by which that file was brought into the translation unit;
  • line/column positions within the source file;
  • macro expansion trace for tokens constructed by expansion of a C preprocessor macro.

You'll find this information in fields called "loc" and/or "range" in the JSON representation.

{
  "id": "0x1251428",
  "kind": "NamespaceDecl",
  "loc": {                           //<--
    "offset": 7004,
    "file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
    "line": 258,
    "col": 11,
    "tokLen": 3,
    "includedFrom": {
      "file": "/usr/include/c++/10/utility"
    }
  },
  "range": {                         //<--
    "begin": {
      "offset": 6994,
      "col": 1,
      "tokLen": 9
    },
    "end": {
      "offset": 7155,
      "line": 266,
      "col": 1,
      "tokLen": 1
    }
  },
  ...
}

The naive deserialization of these structures is challenging to work with because Clang uses field omission to mean "same as previous". So if a "loc" is printed without a "file" inside, it means the loc is in the same file as the immediately previous loc in serialization order.

The clang-ast crate provides types for deserializing this source location information painlessly, producing Arc<str> as the type of filepaths which may be shared across multiple source locations.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
    pub loc: clang_ast::SourceLocation,    //<--
    pub range: clang_ast::SourceRange,     //<--
}

Node identifiers

Every syntax tree node has an "id". In JSON it's the memory address of Clang's internal memory allocation for that node, serialized to a hex string.

The AST dump uses ids as backreferences in nodes of directed acyclic graph nature. For example the following MemberExpr node is part of the invocation of an operator bool conversion, and thus its syntax tree refers to the resolved operator bool conversion function declaration:

{
  "id": "0x9918b88",
  "kind": "MemberExpr",
  "valueCategory": "rvalue",
  "referencedMemberDecl": "0x12d8330",     //<--
  ...
}

The node it references, with memory address 0x12d8330, is found somewhere earlier in the syntax tree:

{
  "id": "0x12d8330",                       //<--
  "kind": "CXXConversionDecl",
  "name": "operator bool",
  "mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
  "type": {
    "qualType": "std::integral_constant<bool, true>::value_type () const noexcept"
  },
  "constexpr": true,
  ...
}

Due to the ubiquitous use of ids for backreferencing, it is valuable to deserialize them not as strings but as a 64-bit integer. The clang-ast crate provides an Id type for this purpose, which is cheaply copyable, hashable, and comparible more cheaply than a string. You may find yourself with lots of hashtables keyed on Id.


License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies

~135–255KB