4 releases (2 breaking)

0.3.2	Jan 15, 2024
0.3.0	Dec 1, 2023
0.2.0	Aug 13, 2023
0.1.0	Jul 6, 2023

#3 in #cryo

346 downloads per month
Used in 5 crates (4 directly)

MIT/Apache

360KB
9K SLoC

❄️🧊 cryo 🧊❄️

cryo is the easiest way to extract blockchain data to parquet, csv, json, or a python dataframe.

cryo is also extremely flexible, with many different options to control how data is extracted + filtered + formatted

cryo is an early WIP, please report bugs + feedback to the issue tracker

note that cryo's default settings will slam a node too hard for use with 3rd party RPC providers. Instead, --requests-per-second and --max-concurrent-requests should be used to impose ratelimits. Such settings will be handled automatically in a future release.

to discuss cryo, check out the telegram group

Example Usage
Installation
Data Schema
Code Guide
Documentation
1. Basics
2. Syntax
3. Datasets

Example Usage

use as cryo <dataset> [OPTIONS]

Example	Command
Extract all logs from block 16,000,000 to block 17,000,000	`cryo logs -b 16M:17M`
Extract blocks, logs, or traces missing from current directory	`cryo blocks txs traces`
Extract to csv instead of parquet	`cryo blocks txs traces --csv`
Extract only certain columns	`cryo blocks --include number timestamp`
Dry run to view output schemas or expected work	`cryo storage_diffs --dry`
Extract all USDC events	`cryo logs --contract 0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48`

For a more complex example, see the Uniswap Example.

cryo uses ETH_RPC_URL env var as the data source unless --rpc <url> is given

Installation

The simplest way to use cryo is as a cli tool:

Method 1: install from source

git clone https://github.com/paradigmxyz/cryo
cd cryo
cargo install --path ./crates/cli

This method requires having rust installed. See rustup for instructions.

Method 2: install from crates.io

cargo install cryo_cli

This method requires having rust installed. See rustup for instructions.

Make sure that ~/.cargo/bin is on your PATH. One way to do this is by adding the line export PATH="$HOME/.cargo/bin:$PATH" to your ~/.bashrc or ~/.profile.

Python Installation

cryo can also be installed as a python package:

Installing `cryo` python from pypi

(make sure rust is installed first, see rustup)

pip install maturin
pip install cryo

Installing `cryo` python from source

pip install maturin
git clone https://github.com/paradigmxyz/cryo
cd cryo/crates/python
maturin build --release
pip install --force-reinstall <OUTPUT_OF_MATURIN_BUILD>.whl

Data Schemas

Many cryo cli options will affect output schemas by adding/removing columns or changing column datatypes.

cryo will always print out data schemas before collecting any data. To view these schemas without collecting data, use --dry to perform a dry run.

Schema Design Guide

An attempt is made to ensure that the dataset schemas conform to a common set of design guidelines:

By default, rows should contain enough information in their columns to be order-able (unless the rows do not have an intrinsic order).
Columns should usually be named by their JSON-RPC or ethers.rs defaults, except in cases where a much more explicit name is available.
To make joins across tables easier, a given piece of information should use the same datatype and column name across tables when possible.
Large ints such as u256 should allow multiple conversions. A value column of type u256 should allow: value_binary, value_string, value_f32, value_f64, value_u32, value_u64, and value_d128. These types can be specified at runtime using the --u256-types argument.
By default, columns related to non-identifying cryptographic signatures are omitted by default. For example, state_root of a block or v/r/s of a transaction.
Integer values that can never be negative should be stored as unsigned integers.
Every table should allow a chain_id column so that data from multiple chains can be easily stored in the same table.

Standard types across tables:

block_number: u32
transaction_index: u32
nonce: u32
gas_used: u64
gas_limit: u64
chain_id: u64
timestamp: u32

JSON-RPC

cryo currently obtains all of its data using the JSON-RPC protocol standard.

dataset	blocks per request	results per block	method
Blocks	1	1	`eth_getBlockByNumber`
Transactions	1	multiple	`eth_getBlockByNumber`, `eth_getBlockReceipts`, `eth_getTransactionReceipt`
Logs	multiple	multiple	`eth_getLogs`
Contracts	1	multiple	`trace_block`
Traces	1	multiple	`trace_block`
State Diffs	1	multiple	`trace_replayBlockTransactions`
Vm Traces	1	multiple	`trace_replayBlockTransactions`

cryo use ethers.rs to perform JSON-RPC requests, so it can be used any chain that ethers-rs is compatible with. This includes Ethereum, Optimism, Arbitrum, Polygon, BNB, and Avalanche.

A future version of cryo will be able to bypass JSON-RPC and query node data directly.

Code Guide

Code is arranged into the following crates:
- cryo_cli: convert textual data into cryo function calls
- cryo_freeze: core cryo code
- cryo_python: cryo python adapter
- cryo_to_df: procedural macro for generating dataset definitions
Do not use panics (including panic!, todo!, unwrap(), and expect()) except in the following circumstances: tests, build scripts, lazy static blocks, and procedural macros

cryo help

(output of cryo help)

cryo extracts blockchain data to parquet, csv, or json

Usage: cryo [OPTIONS] [DATATYPE]...

Arguments:
  [DATATYPE]...  datatype(s) to collect, use cryo datasets to see all available

Options:
      --remember    Remember current command for future use
  -v, --verbose     Extra verbosity
      --no-verbose  Run quietly without printing information to stdout
  -h, --help        Print help
  -V, --version     Print version

Content Options:
  -b, --blocks <BLOCKS>...           Block numbers, see syntax below
      --timestamps <TIMESTAMPS>...   Timestamp numbers in unix, overridden by blocks
  -t, --txs <TXS>...                 Transaction hashes, see syntax below
  -a, --align                        Align chunk boundaries to regular intervals,
                                     e.g. (1000 2000 3000), not (1106 2106 3106)
      --reorg-buffer <N_BLOCKS>      Reorg buffer, save blocks only when this old,
                                     can be a number of blocks [default: 0]
  -i, --include-columns [<COLS>...]  Columns to include alongside the defaults,
                                     use `all` to include all available columns
  -e, --exclude-columns [<COLS>...]  Columns to exclude from the defaults
      --columns [<COLS>...]          Columns to use instead of the defaults,
                                     use `all` to use all available columns
      --u256-types <U256_TYPES>...   Set output datatype(s) of U256 integers
                                     [default: binary, string, f64]
      --hex                          Use hex string encoding for binary columns
  -s, --sort [<SORT>...]             Columns(s) to sort by, `none` for unordered
      --exclude-failed               Exclude items from failed transactions

Source Options:
  -r, --rpc <RPC>                    RPC url [default: ETH_RPC_URL env var]
      --network-name <NETWORK_NAME>  Network name [default: name of eth_getChainId]

Acquisition Options:
  -l, --requests-per-second <limit>  Ratelimit on requests per second
      --max-retries <R>              Max retries for provider errors [default: 5]
      --initial-backoff <B>          Initial retry backoff time (ms) [default: 500]
      --max-concurrent-requests <M>  Global number of concurrent requests
      --max-concurrent-chunks <M>    Number of chunks processed concurrently
      --chunk-order <CHUNK_ORDER>    Chunk collection order (normal, reverse, or random)
  -d, --dry                          Dry run, collect no data

Output Options:
  -c, --chunk-size <CHUNK_SIZE>      Number of blocks per file [default: 1000]
      --n-chunks <N_CHUNKS>          Number of files (alternative to --chunk-size)
      --partition-by <PARTITION_BY>  Dimensions to partition by
  -o, --output-dir <OUTPUT_DIR>      Directory for output files [default: .]
      --subdirs <SUBDIRS>...         Subdirectories for output files
                                     can be `datatype`, `network`, or custom string
      --label <LABEL>                Label to add to each filename
      --overwrite                    Overwrite existing files instead of skipping
      --csv                          Save as csv instead of parquet
      --json                         Save as json instead of parquet
      --row-group-size <GROUP_SIZE>  Number of rows per row group in parquet file
      --n-row-groups <N_ROW_GROUPS>  Number of rows groups in parquet file
      --no-stats                     Do not write statistics to parquet files
      --compression <NAME [#]>...    Compression algorithm and level [default: lz4]
      --report-dir <REPORT_DIR>      Directory to save summary report
                                     [default: {output_dir}/.cryo/reports]
      --no-report                    Avoid saving a summary report

Dataset-specific Options:
      --address <ADDRESS>...         Address(es)
      --to-address <address>...      To Address(es)
      --from-address <address>...    From Address(es)
      --call-data <CALL_DATA>...     Call data(s) to use for eth_calls
      --function <FUNCTION>...       Function(s) to use for eth_calls
      --inputs <INPUTS>...           Input(s) to use for eth_calls
      --slot <SLOT>...               Slot(s)
      --contract <CONTRACT>...       Contract address(es)
      --topic0 <TOPIC0>...           Topic0(s) [aliases: event]
      --topic1 <TOPIC1>...           Topic1(s)
      --topic2 <TOPIC2>...           Topic2(s)
      --topic3 <TOPIC3>...           Topic3(s)
      --event-signature <SIG>...     Event signature for log decoding
      --inner-request-size <BLOCKS>  Blocks per request (eth_getLogs) [default: 1]
      --js-tracer <tracer>           Event signature for log decoding

Optional Subcommands:
      cryo help                      display help message
      cryo help syntax               display block + tx specification syntax
      cryo help datasets             display list of all datasets
      cryo help <DATASET(S)>         display info about a dataset

cryo syntax

(output of cryo help syntax)

Block specification syntax
- can use numbers                    --blocks 5000 6000 7000
- can use ranges                     --blocks 12M:13M 15M:16M
- can use a parquet file             --blocks ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files     --blocks ./path/to/files/*.parquet[:COLUMN_NAME]
- numbers can contain { _ . K M B }  5_000 5K 15M 15.5M
- omitting range end means latest    15.5M: == 15.5M:latest
- omitting range start means 0       :700 == 0:700
- minus on start means minus end     -1000:7000 == 6001:7001
- plus sign on end means plus start  15M:+1000 == 15M:15.001M
- can use every nth value            2000:5000:1000 == 2000 3000 4000
- can use n values total             100:200/5 == 100 124 149 174 199

Timestamp specification syntax
- can use numbers                    --timestamp 5000 6000 7000
- can use ranges                     --timestamp 12M:13M 15M:16M
- can use a parquet file             --timestamp ./path/to/file.parquet[:COLUMN_NAME]
- can use multiple parquet files     --timestamp ./path/to/files/*.parquet[:COLUMN_NAME]
- can contain { _ . m h d w M y }    31_536_000 525600m 8760h 365d 52.143w 12.17M 1y
- omitting range end means latest    15.5M: == 15.5M:latest
- omitting range start means 0       :700 == 0:700
- minus on start means minus end     -1000:7000 == 6001:7001
- plus sign on end means plus start  15M:+1000 == 15M:15.001M
- can use n values total             100:200/5 == 100 124 149 174 199

Transaction specification syntax
- can use transaction hashes         --txs TX_HASH1 TX_HASH2 TX_HASH3
- can use a parquet file             --txs ./path/to/file.parquet[:COLUMN_NAME]
                                     (default column name is transaction_hash)
- can use multiple parquet files     --txs ./path/to/ethereum__logs*.parquet

cryo datasets

(output of cryo help datasets)

cryo datasets
─────────────
- address_appearances
- balance_diffs
- balance_reads
- balances
- blocks
- code_diffs
- code_reads
- codes
- contracts
- erc20_balances
- erc20_metadata
- erc20_supplies
- erc20_transfers
- erc20_approvals
- erc721_metadata
- erc721_transfers
- eth_calls
- four_byte_counts (alias = 4byte_counts)
- geth_calls
- geth_code_diffs
- geth_balance_diffs
- geth_storage_diffs
- geth_nonce_diffs
- geth_opcodes
- javascript_traces (alias = js_traces)
- logs (alias = events)
- native_transfers
- nonce_diffs
- nonce_reads
- nonces
- slots (alias = storages)
- storage_diffs (alias = slot_diffs)
- storage_reads (alias = slot_reads)
- traces
- trace_calls
- transactions (alias = txs)
- vm_traces (alias = opcode_traces)

dataset group names
───────────────────
- blocks_and_transactions: blocks, transactions
- call_trace_derivatives: contracts, native_transfers, traces
- geth_state_diffs: geth_balance_diffs, geth_code_diffs, geth_nonce_diffs, geth_storage_diffs
- state_diffs: balance_diffs, code_diffs, nonce_diffs, storage_diffs
- state_reads: balance_reads, code_reads, nonce_reads, storage_reads

use cryo help <DATASET> to print info about a specific dataset

`lib.rs`:

cryo_freeze extracts EVM data to parquet, csv, or json

Dependencies

~67–100MB
~2M SLoC