3 unstable releases

0.2.0	Jun 30, 2024
0.1.1	Jun 16, 2024
0.1.0	Jun 16, 2024

#807 in Encoding

134 downloads per month

MIT license

98KB
1.5K SLoC

Serde Datalog

Serde Datalog provides an implementation of the Serializer trait from Serde to generate facts from any data structure whose type implements the serde::Serializable trait. In Datalog parlance, Serde Datalog serializes data structures to EDBs.

Serde Datalog has two main components: an extractor that generates facts about data structures, and a backend that materializes these facts into an explicit representation. You can swap out different implementations of the backend to change the representation of facts.

Example

Consider the following enum type that implements the Serialize trait:

#[derive(Serialize)]
enum Foo {
    A(Box<Foo>),
    B(i64)
}

Then consider the enum instance Foo::A(Foo::B(10)). The extractor generates the following facts to represent this data structure:

Element 1 is a newtype variant
Element 1 has type Foo and variant name A
The first field of Element 1 references Element 2
Element 2 is a newtype variant
Element 2 has type Foo and variant name B
The first field of Element 2 references Element 3
Element 3 is an i64
Element 3 has value 10

The extractor generates facts from a data structure through flattening: it generates unique identifiers for each element within the data structure, and references between elements are "unswizzled" into identifiers.

For each of these facts, the extractor will make the following calls to an extractor backend.

For each fact, the extractor will make calls to an extractor backend to materialize the fact. For example, we can use the vector backend to materialize these extracted facts as vectors of tuples. You can then use these vectors as inputs to queries for Datalog engines embedded in Rust, such as Ascent or Crepe.

let input = Foo::A(Box::new(Foo::B(10)));
let mut extractor = DatalogExtractor::new(backend::vector::Backend::default());
input.serialize(&mut extractor);

// Now we can inspect the tables in the backend to see what facts got
// extracted from the input.

let data: backend::vector::BackendData<ElemId> = extractor.get_backend().get_data();

// there are 3 total elements
assert!(data.type_table.len() == 3);

// there are 2 enum variant elements
assert!(data.variant_type_table.len() == 2);

// there is 1 number element
assert!(data.number_table.len() == 1);

Alternatively, you can store the generated facts in a SQLite file with the Souffle SQLite backend. You can then use this file as an input EDB for Datalog queries executed by Souffle.

let input = Foo::A(Box::new(Foo::B(10)));
let mut backend = backend::souffle_sqlite::Backend::default();
let mut extractor = DatalogExtractor::new(&mut backend);
input.serialize(&mut extractor);
backend.dump_to_db("input.db");

Command-line Tool

Serde Datalog also comes as a command-line tool serde_datalog that can convert data from a variety of input formats such as JSON or YAML to a SQLite file using the Souffle SQLite backend. This allows you to use Souffle Datalog as a query language for data formats, much like jq or yq.

Example

Consider the following JSON file census.json containing borough-level population data in New York City from the 2020 census:

{
	"boroughs": [
		{ "name": "Bronx", "population": 1472654 },
		{ "name": "Brooklyn", "population": 2736074 },
		{ "name": "Manhattan", "population": 1694251 },
		{ "name": "Queens", "population": 2405464 },
		{ "name": "Staten Island", "population": 495747 }
	]
}

We can write a Souffle Datalog query to calculate the total population of New York City. First, extract a fact database from the JSON file using the following invocation of serde_datalog:

> serde_datalog census.json -o census.db

Next, we write the actual query in a Souffle Datalog file, census.dl:

#include "schemas/serde_string_key.dl"

.decl boroPopulation(boro: ElemId, population: number)
boroPopulation(boro, population) :-
    rootElem(_, root),
    map(root, "boroughs", boroList),
    seq(boroList, _, boro),
    map(boro, "population", popId),
    number(popId, population).

.decl totalPopulation(total: number)
totalPopulation(sum pop : { boroPopulation(_, pop) }).

.input type, bool, number, string, map, struct, seq, tuple, structType, variantType(IO=sqlite, dbname="census.db")
.output totalPopulation(IO=stdout)

Note that the the schema defined in schemas/serde_string_key.dl assumes that maps can only have string keys. This is true for formats like JSON or TOML. The file schemas/serde.dl defines a more general schema that does not have this assumption, and thus can represent any value serializable by Serde. The serde_datalog tool generates facts in the former schema when applicable (i.e. when processing input in JSON or TOML format), but will generate facts that conform to the latter schema otherwise.

An Example with Recursion

Datalog excels in queries that involve recursion. For example, consider this JSON file that contains information about package dependencies:

{
    "packages": [
        { "package": "A", "dependencies": ["B"] },
        { "package": "B", "dependencies": ["C", "D"] }
    ]
}

We can write a query that computes the transitive dependencies of package A as follows:

#include "schemas/serde_string_key.dl"

.decl dependsOn(package1: symbol, package2: symbol)
dependsOn(package1, package2) :-
    rootElem(_, root),
    map(root, "packages", plist),
    seq(plist, _, p),
    map(p, "package", pname),
    string(pname, package1),
    map(p, "dependencies", pdeps),
    seq(pdeps, _, dep),
    string(dep, package2).

dependsOn(package1, package3) :-
    dependsOn(package1, package2),
    dependsOn(package2, package3).

.decl depsA(dep: symbol)
depsA(dep) :- dependsOn("A", dep).

.input rootElem, type, bool, number, string, map, struct, seq, tuple, structType, variantType(IO=sqlite, dbname="test3_json.db")
.output depsA(IO=stdout)

This query will return the following output:

---------------
depsA
===============
B
C
D
===============

Dependencies

~24MB
~466K SLoC