3 unstable releases
0.2.0 | Jun 30, 2024 |
---|---|
0.1.1 | Jun 16, 2024 |
0.1.0 | Jun 16, 2024 |
#645 in Encoding
134 downloads per month
98KB
1.5K
SLoC
Serde Datalog
Serde Datalog provides an implementation of the Serializer
trait from
Serde to generate facts from any data structure whose type
implements the serde::Serializable
trait. In Datalog parlance, Serde Datalog
serializes data structures to EDBs.
Serde Datalog has two main components: an extractor that generates facts about data structures, and a backend that materializes these facts into an explicit representation. You can swap out different implementations of the backend to change the representation of facts.
Example
Consider the following enum type that implements the Serialize
trait:
#[derive(Serialize)]
enum Foo {
A(Box<Foo>),
B(i64)
}
Then consider the enum instance Foo::A(Foo::B(10))
. The extractor
generates the following facts to represent this data structure:
- Element 1 is a newtype variant
- Element 1 has type
Foo
and variant nameA
- The first field of Element 1 references Element 2
- Element 2 is a newtype variant
- Element 2 has type
Foo
and variant nameB
- The first field of Element 2 references Element 3
- Element 3 is an i64
- Element 3 has value 10
The extractor generates facts from a data structure through flattening: it generates unique identifiers for each element within the data structure, and references between elements are "unswizzled" into identifiers.
For each of these facts, the extractor will make the following calls to an extractor backend.
For each fact, the extractor will make calls to an extractor backend to materialize the fact. For example, we can use the vector backend to materialize these extracted facts as vectors of tuples. You can then use these vectors as inputs to queries for Datalog engines embedded in Rust, such as Ascent or Crepe.
let input = Foo::A(Box::new(Foo::B(10)));
let mut extractor = DatalogExtractor::new(backend::vector::Backend::default());
input.serialize(&mut extractor);
// Now we can inspect the tables in the backend to see what facts got
// extracted from the input.
let data: backend::vector::BackendData<ElemId> = extractor.get_backend().get_data();
// there are 3 total elements
assert!(data.type_table.len() == 3);
// there are 2 enum variant elements
assert!(data.variant_type_table.len() == 2);
// there is 1 number element
assert!(data.number_table.len() == 1);
Alternatively, you can store the generated facts in a SQLite file with the Souffle SQLite backend. You can then use this file as an input EDB for Datalog queries executed by Souffle.
let input = Foo::A(Box::new(Foo::B(10)));
let mut backend = backend::souffle_sqlite::Backend::default();
let mut extractor = DatalogExtractor::new(&mut backend);
input.serialize(&mut extractor);
backend.dump_to_db("input.db");
Command-line Tool
Serde Datalog also comes as a command-line tool serde_datalog
that can convert
data from a variety of input formats such as JSON or YAML to a SQLite file
using the Souffle SQLite backend. This allows you to use Souffle Datalog as a
query language for data formats, much like jq
or yq.
Example
Consider the following JSON file census.json
containing borough-level
population data in New York City from the 2020 census:
{
"boroughs": [
{ "name": "Bronx", "population": 1472654 },
{ "name": "Brooklyn", "population": 2736074 },
{ "name": "Manhattan", "population": 1694251 },
{ "name": "Queens", "population": 2405464 },
{ "name": "Staten Island", "population": 495747 }
]
}
We can write a Souffle Datalog query to calculate the total population of
New York City. First, extract a fact database from the JSON file using
the following invocation of serde_datalog
:
> serde_datalog census.json -o census.db
Next, we write the actual query in a Souffle Datalog file, census.dl
:
#include "schemas/serde_string_key.dl"
.decl boroPopulation(boro: ElemId, population: number)
boroPopulation(boro, population) :-
rootElem(_, root),
map(root, "boroughs", boroList),
seq(boroList, _, boro),
map(boro, "population", popId),
number(popId, population).
.decl totalPopulation(total: number)
totalPopulation(sum pop : { boroPopulation(_, pop) }).
.input type, bool, number, string, map, struct, seq, tuple, structType, variantType(IO=sqlite, dbname="census.db")
.output totalPopulation(IO=stdout)
Note that the the schema defined in schemas/serde_string_key.dl
assumes
that maps can only have string keys. This is true for formats like JSON or TOML.
The file schemas/serde.dl
defines a more general schema that does not have
this assumption, and thus can represent any value serializable by Serde.
The serde_datalog
tool generates facts in the former schema when applicable
(i.e. when processing input in JSON or TOML format), but will generate facts that
conform to the latter schema otherwise.
An Example with Recursion
Datalog excels in queries that involve recursion. For example, consider this JSON file that contains information about package dependencies:
{
"packages": [
{ "package": "A", "dependencies": ["B"] },
{ "package": "B", "dependencies": ["C", "D"] }
]
}
We can write a query that computes the transitive dependencies of package A
as follows:
#include "schemas/serde_string_key.dl"
.decl dependsOn(package1: symbol, package2: symbol)
dependsOn(package1, package2) :-
rootElem(_, root),
map(root, "packages", plist),
seq(plist, _, p),
map(p, "package", pname),
string(pname, package1),
map(p, "dependencies", pdeps),
seq(pdeps, _, dep),
string(dep, package2).
dependsOn(package1, package3) :-
dependsOn(package1, package2),
dependsOn(package2, package3).
.decl depsA(dep: symbol)
depsA(dep) :- dependsOn("A", dep).
.input rootElem, type, bool, number, string, map, struct, seq, tuple, structType, variantType(IO=sqlite, dbname="test3_json.db")
.output depsA(IO=stdout)
This query will return the following output:
---------------
depsA
===============
B
C
D
===============
Dependencies
~24MB
~458K SLoC