2 releases

0.1.1 Feb 23, 2023
0.1.0 Feb 23, 2023

#36 in #query-engine

MIT/Apache

515KB
12K SLoC

Rust 9K SLoC // 0.0% comments TypeScript 2K SLoC // 0.1% comments GraphQL 531 SLoC // 0.4% comments

chaindexer

build status tests status chat

The open source indexer and query engine for blockchain data.

Installation

Cargo is required to build/install (rustup is an easy way to install it). Then from the root of this repo, run

cargo install --path .

Quickstart

The query engine supports querying directly from an RPC node (i.e. without pre-indexing the entire chain). To spin up a SQL REPL and start querying data from your RPC, you can simply run:

ETH_RPC_API=<your rpc url> chaindexer sql

Queries

Running queries and building indices both require an RPC api that you can connect to. But it is possible to query chain data without building an index ahead of time. This can be done as long as the queries ran are highly selective in terms of block numbers accessed. In other words: since on-chain data is range partitioned by block number, queries must only access a specific range of block numbers--otherwise the entire chain state would have to be indexed (see next section for how to do that.)

For example, a valid query would be something like:

select * from eth.logs l
join eth.blocks b
on b.hash = l.block_hash
where b.number >= (eth_current_block() - 10) -- 10 most recent blocks

Predicates that can be evaluated on block numbers will always be pushed down to the table scan. As long as queries have predicates like this that can be pushed down so that only a small subset of block numbers are needed to be retrieved, you should be able to query data directly via the RPC api (i.e. without a full index).

You can also have the query engine automatically filter out all but the last n most recent blocks by specifying. Running:

chaindexer sql --last-n-blocks 10

select * from eth.logs l
join eth.blocks b
on b.hash = l.block_hash;

is equivalent to the query shown previously. For doing interactive querying without doing a pre-index, using the last-n-blocks option is highly recommended.

NOTE: Other SQL interfaces

A JDBC compatible server is coming soon (this would allow you to use this in something like PyCharm or any other database client that accepts JDBC).

Available schemas/tables

Schemas, tables and each table's columns can be shown via SQL:

SHOW TABLES; -- shows available tables, what tables are available will depend on your config
DESCRIBE eth.blocks; -- describe a single table

For more info on the CLI commands, run: chaindexer help

Indexing

To query the full chain, an index for that chain must be constructed and saved either on disk, or on a blob store (currently IPFS, and S3). An index is essentially just a file that contains mappings for tables and block range to the location of files containing the raw data for that table/block range. Each index is backed by a single storage config, which specifies the location of the index file as well as where to place the actual data partitions.

Config

You'll probably want a config file to specify your storage layer before building an index:

chaindexer config

This will create a TOML file that you can edit. For example, to use S3 (other configuration examples can be seen in the source code at src/storage/conf.rs, datatype is named StorageConf) as your storage layer, you can open up the config file at ~/.chaindexer/config.toml and change the stores.eth section (s3 credentials are read from the environment):

[stores.eth]
type="s3"
bucket="<your s3 bucket>"
# s3 key prefix for where all partition files will be placed
prefix="/prefix"
# the root index will be stored here. the index file contains mappings to the partition
# files which will be loaded into prefix.
# this file will also be under prefix, so if your prefix is `/prefix`, the full s3 key
# of the root index file would be `/prefix/eth.db`
filename="eth.db"

Each chain has its own storage configuration, which is why the storage layer for eth is specified under stores.eth. If you'd rather store data on disk, the storage layer in the default generated config file should be fine.

Building an index

Now with an eth store layer specified, we can run the index command to start building the ethereum index. Currently running index indexes the raw chain data only (we are actively working to move some of our custom data pipelines into this open source project).

chaindexer index --chain eth

Now non-selective queries can be run (if the partition index is not fully built, it will just treat the blocks that is built for as the only blocks that exist.)

Index time

We recognize that indexing an entire chain just to run queries is unrealistic for most users. We are currently working to put our pre-indexed ethereum onto IPFS, so that other users can use the query engine without needing to index the entire chain.

Tech overview

Chain data is indexed and then persisted using Apache Parquet on various storage layers. The query engine is leverages DataFusion, a highly flexible and extensible in-memory query engine which uses Apache Arrow as its in-memory columnar data format.

Each index is a SQLite database that describes to the query engine how to access the raw data during each table scan. For example, an ethereum index would have a row in the SQLite db mapping from the blocks table, for block range 0 - 1,000,000 to a parquet file on S3 (or other storage layers, including your disk). That parquet file would have 1,000,000 rows, representing all the blocks in that range.

Dependencies

~97MB
~2M SLoC