4 releases (2 breaking)

0.2.0 Apr 4, 2022
0.1.0 Jan 10, 2022
0.0.0-alpha.1 Dec 26, 2021
0.0.0-alpha.0 Oct 26, 2021

#2 in #pancake


Used in 2 crates

Apache-2.0

47KB
975 lines

Crates.io Maven Central

Using the PancakeDB API

A typical PancakeDB use case involves

  • human-driven management of PancakeDB tables with create_table
  • writing incoming events or processed data as streams (up to 256 rows at a time) to PancakeDB with write_to_partition.
  • reading millions (or more) rows at a time, often in a distributed manner, from PancakeDB for batch processing (e.g. Spark or Hive), training machine learning models, or exporting to data warehouses. This involves calling list_segments and then read_segment_column for each listed segment and desired column.

Importantly, PancakeDB expects to receive many small writes per second that it will respond to with reliably fast response times, and it expects to receive a few giant reads per second that it will respond to with high throughput.

GRPC API specification

PancakeDB's main API protocol is GRPC. See service.proto for its service definition. Below is a description of what each API call does.

Alter Table

Modifies the configuration of an existing table, e.g. by adding columns. Fails if any of the added columns already exist.

Create Table

Creates a table.

The keys in the partitioning and columns objects are column names.

This will ensure a table exists following the behavior requested by mode:

  • FAIL_IF_EXISTS works like most other databases' create table commands, returning an error if the table already exists.
  • OK_IF_EXACT will not give an error if the table already exists, as long as its schema is identical.
  • ADD_NEW_COLUMNS will not give an error if the table already exists, as long as the existing columns are a subset of the requested ones. Any new columns in the request will be added to the table. This is a declarative way to create a table.

In a successful response, already_exists indicates whether the table already existed prior to the request, and columns_added is a list of the column names that were added by the request.

Your schema is important. Partitioning is the only way to allow filtering your data. For example, you may wish to partition by a timestamp truncated to the last hour so that your users can quickly select all rows falling into a time range. You may partition by multiple fields, but keep in mind that each partition should ultimately be large (>100k rows) to be efficient.

Previous data architectures typically required partitioning by a timestamp at hourly or daily granularity to fit with the compaction schedule and make sure each partition has a moderate amount of data. With PancakeDB, table design is liberated from those restrictions, and the developer can choose partitioning to match their filter query pattern instead.

There is a limit of 255 columns per table. If you find yourself needing more than that, it's probably time to rethink your data model.

Delete from Segment

This deletes rows from a segment, using row id. You can determine which row ids you want to delete by scanning the built-in _row_id column when you read a segment.

Deleted rows will not immediately be removed from disk. They will persist until the segment is next compacted, which could be up to a day. It is necessary to call read_segment_deletions whenever reading a segment in order to get an accurate view of the data.

Drop Table

Deletes a table and all its data.

Caution - dropping a table is not reversible. It actually deletes your data on disk.

Get Schema

Returns the schema of an existing table.

List Segments

This lists all segments (large groups of rows rows) in the table, matching all the partition filters specified. Additional metadata about each segment will only be returned if requested, because retrieving the metadata is a bit slower.

List Tables

Returns a list of all tables.

Read Segment Column

The way to read columnar data from PancakeDB. Returns data as a stream of encoded bytes, so you'll typically want to use a client library to decode the data.

If codec is included in the response, it means the byte data following the JSON blob is compressed; otherwise it is uncompressed. The data can be decoded with the PancakeDB core library.

Either a correlation ID (for the first request for the segment column) or a continuation token (for each following request) is required. You should use the same correlation ID you are using for the segment's deletions (see below) and other columns.

Read Segment Deletions

Returns compressed data for a list of booleans representing whether each row is deleted. You must use the same correlation ID you are using for reading the segment's columns.

This route is also hard to use directly, so it is better to use a client library's "decode" functionality.

If you read a segment's columns without its deletions, you will have an incorrect view of the data if any recent deletions have happened. The deletion data is very lightweight, so this step should not be skipped.

Write to Partition

Write rows to a partition of a PancakeDB table.

This is the main way to get data into PancakeDB. Each write to partition request may contain up to 256 rows. If you need to write to multiple partitions at once, you must send multiple requests.

REST API specification

PancakeDB also supports certain routes via HTTP+JSON for ease of use and demonstration purposes.

Example curl:

curl \
  -XGET \
  -H "Content-Type: application/json" \
  $IP:$PORT/rest/list_segments \
  -d '{"tableName": "'$TABLE_NAME'"}'

Create Table

POST /rest/create_table

request body format:

{
  "tableName": "...",
  "schema": {
    "partitioning": {
      "...": {
        "dtype": "STRING" | "INT64" | "BOOL" | "TIMESTAMP_MINUTE"
      },
      ...
    },
    "columns": {
      "...": {
        "dtype": "STRING" | "INT64" | "BOOL" | "BYTES" | "FLOAT32" | "FLOAT64" | "TIMESTAMP_MICROS",
        "nestedListDepth": int
      },
      ...
    }
  },
  "mode": "FAIL_IF_EXISTS" | "OK_IF_EXACT" | "ADD_NEW_COLUMNS"
}

response body format:

{
  "already_exists": bool,
  "columns_added": ["...", ...]
}

Drop Table

POST /rest/drop_table

request body format:

{
  "tableName": "..."
}

response body format:

{}

List Tables

GET /rest/list_tables

request body format:

{}

response body format:

{
  "tables": [
    {"tableName": "..."},
    ...
  ]
}

Write to Partition

POST /rest/write_to_partition

request body format:

{
  "tableName": "...",
  "partition": {
    "...": <value>,
    ...
  },
  "rows": [
    {
      "...": <value>,
    },
    ...
  ]
}

response body format:

{}

For example, a valid row would be

{
  "my_int_col": 33,
  "my_nested_string_col": ["foo", "bar"],
  "my_timestamp_col": "2022-01-01T00:00:00Z"
}

Dependencies

~2–3.5MB
~74K SLoC