8 releases

0.1.9 May 15, 2024
0.1.8 May 7, 2024
0.1.6 Apr 28, 2024

#742 in Parser implementations

MIT license

275KB
7K SLoC

Dapt

Dapt is in heavy development, any use in production might see breaking changes. If you have any suggestions or issues please open an issue or PR.

example workflow Crates.io Docs.rs

Dapt (data packet) is a rust object that allows for serialization and deserialization of dynamic data utilizing the serde crate. Dapt allows you to traverse your data structure with a jq ish syntax. The intent behind the project is to act as the data packet for a stream processing engine which handles unstructured data, though it is generic enough to be used for many things.

Example

use dapt::Dapt;

fn main() {
    let data = r#"
    {
        "name": "John Doe",
        "age": 30,
        "phones": [
            {
                "type": "home",
                "number": "212 555-1234"
            },
            {
                "type": "office",
                "number": "646 555-4567"
            }
        ]
    }

    // use serde_json to parse the data into a Dapt object
    let d: Dapt = serde_json::from_str(data).unwrap();

    // get just the phones numbers
    let name = d.get("phones[].number").unwrap();

    // write the data back out
    println!("{}", serde_json::to_string_pretty(&name).unwrap();
}

// output will be:
// [
//   "212 555-1234",
//   "646 555-4567"
// ]

run it yourself with cargo run --example simple

Matching Strategy

Dapt is made up of two key parts:

  • A byte slice which holds the data in a binary format
  • A vector of indexes to locations within that binary data

This means that a dapt packet can be pointing to multiple locations in a dapt packet at once. As a user you can traverse the dapt packet using the get method. This method traverses the packet, returning a new packet with new indexes that match the path you provided. The underlying binary data is not copied during this process, reducing heap allocations.

Dapt paths are nodes seperated by .. Each part of the path looks at the document to find ALL matching values. Lets step through the previous example to get a better idea of how dapt works. Our path was phones[].number. First dapt uses the node phone to traverse down the document:

[
  {
    "type": "home",
    "number": "212 555-1234"
  },
  {
    "type": "office",
    "number": "646 555-4567"
  }
]

Next dapt uses [] to traverse the array. The array node allows you to specify an index, though because we didn't our dapt packet will now point to both indexes in the array.

{
  "type": "home",
  "number": "212 555-1234"
}

{
  "type": "office",
  "number": "646 555-4567"
}

Now when we use the number node, it will return on the number field of both objects. This is because each location we point to is traversed with the same node.

"212 555-1234"

"646 555-4567"

When we ask serde to serialize this, dapt realizes that it points to multiple locations, and will serialize an array of it's pointers. These values don't have to be the same type, check out the mixed example to see how dapt handles this.

Available Path Nodes

  • Field Literal: ex. host.name This is the most basic node, it matches a field with the exact name.
  • Array: ex. hosts[0].name Matches all indexes in an array. Index is optional, if supplied it will only match items with that index, otherwise matches everything
  • Array Wildcard: ex. *.name Matches all children of a map. Is only one level deep
  • Recursive: ex. ~.name Recursively searches downward for the matching node. Any Node can follow a recursive node, the only requirement is it has some child
  • Regex: ex. /^host.*/.name Matches all fields that match the regex. The regex is a rust regex, and is matched against the field name
  • First: ex. host.{name,ip.*} Matches the first node that returns values. Each child of a match is a full path. If the child could match multiple values, it will still only match the first value.
  • Multi: ex. host.(name|ip.*) Matches all paths specified

Query Features

dapt has query features which are defined in the query module. You can use the Select struct to aggregate data, the Filter struct to filter data, or the Query filter to utilize an implementation of SQL like queries.

Query

I could tease you with filters and aggregations first, but lets just show off what we can do. Let's say you have data coming in which has something like the following structure:

{
  "tickets_purchased": "3",
  "state": "NY",
  "name": "John Doe",
  "purchase_date": "2021-01-01",
}

You could then write a query such as:

SELECT
  sum("tickets_purchased") as "total.tickets",
  count() as "total.purchases",
  "name"
WHERE
  "state" IN ['NY', 'CA']
HAVING "total_tickets" > 20
GROUP BY "name"
ORDER BY "total_tickets" DESC
TOP 3

This query would return something similar to:

[
  {
    "total": {
      "tickets": 30,
      "purchases": 10
    },
    "name": "John Doe"
  },
  {
    "total": {
      "tickets": 25,
      "purchases": 5
    },
    "name": "Jane Doe"
  },
  {
    "total": {
      "tickets": 20,
      "purchases": 5
    },
    "name": "John Smith"
  }
]

The interface for creating a query and collecting data is quite easy:

use dapt::query::Query;
use dapt::Dapt;

fn main() {
  let q = Query::new("SELECT sum(\"tickets_purchased\") as \"total.tickets\", count() as \"total.purchases\", \"name\" WHERE \"state\" IN ['NY', 'CA'] HAVING \"total.tickets\" > 20 GROUP BY \"name\" ORDER BY \"total.tickets\" DESC TOP 3");

  // load up some data
  let data: Vec<Dapt> = vec![
    serde_json::from_str(r#"{"tickets_purchased": "3", "state": "NY", "name": "John Doe", "purchase_date": "2021-01-01"}"#).unwrap(),
    serde_json::from_str(r#"{"tickets_purchased": "30", "state": "NY", "name": "John Doe", "purchase_date": "2021-01-01"}"#).unwrap(),
    serde_json::from_str(r#"{"tickets_purchased": "25", "state": "NY", "name": "Jane Doe", "purchase_date": "2021-01-01"}"#).unwrap(),
    serde_json::from_str(r#"{"tickets_purchased": "20", "state": "NY", "name": "John Smith", "purchase_date": "2021-01-01"}"#).unwrap(),
  ];

  for d in data {
    q.process(d);
  }

  let results = q.results();
}

Dependencies

~6–10MB
~183K SLoC