120 releases (53 stable)
| new 3.6.1 | Feb 3, 2026 |
|---|---|
| 3.5.0 | Jan 30, 2026 |
| 3.3.0 | Dec 22, 2025 |
| 3.1.0 | May 30, 2025 |
| 0.0.1 | Dec 17, 2020 |
#290 in Network programming
750KB
14K
SLoC
Chewdata
This application is a lightweight ETL written in Rust, designed to act as a high-performance connector between heterogeneous systems. It focuses on parallelism, low overhead, and modular feature selection, making it suitable for both local workloads and cloud-native deployments.
| Feature | Values | Description |
|---|---|---|
| Generate data | - | Generate synthetic data for testing and development |
| Supported data formats | json [E] , jsonl [E] , csv [D] , toml [D] , xml [D] , yaml [E] , text [E] , parquet [D] |
Read and write multiple structured and semi-structured formats |
| Multiple Connectors | mongodb [D] , bucket [D], curl [D] , psql [D], local [E], cli [E], inmemory [E] |
Read, write, and clean data across different backends |
| Multiple Http auths | basic [D] , bearer [D], jwt [D] |
Authentication strategies for the curl connector |
| Data transformation | tera [E] | Transform data on the fly using templates |
| Configuration formats allowed | json [E], yaml [E], hjson [E] |
Job definitions provided via versionable config files |
| Parallel / sequential reading | cursor[E] , offset [E] |
Flexible pagination strategies for data ingestion |
| Application Performance Monitoring (APM) | apm[D] |
Export traces and metrics |
[E] -
Enabled by default - disable with--no-default-features[D] -Disabledby default - enable explicitly via--features
More useful information:
- Requires only
rustup— no external runtime dependencies - No garbage collector
- Parallel works
- Fully cross-platform
- Use async/await for concurrent execution with zero-cost
- Highly parallel execution model
- Read multiple files concurrently from:
-
- Local filesystem
-
- S3 compatible bucket
- Query CSV / JSON / Parquet files using S3 Select
- Deployable as an AWS Lambda
- Configuration is easy to version and review
- Generate test data on the fly
- Built-in data validation
-
- Separate streams for valid and invalid records
- Compile only what you need:
-
- cargo build --no-default-features --features "toml,psql"
Getting started
Setup from source code
Requirement:
- Rust
- Docker and Docker-compose for testing the code in local
Commands to execute:
git clone https://github.com/jmfiaschi/chewdata.git chewdata
cd chewdata
cp .env.dev .env
vim .env // Edit the .env file
just build
just test
If all the test pass, the project is ready. read the Makefile in order to see, what kind of shortcut you can use.
If you want some examples to discover this project, go in this section ./examples
Setup from cargo package
Default installation
This command will install the project with all features.
cargo install chewdata
With minimal features
If you need just read/write json file, transform and store them into the local environment.
cargo install chewdata --no-default-features
With custom features
If you want to specify some features to add to your installation
cargo install chewdata --no-default-features --features "xml,bucket"
Please, referer to the features documentation.
How to change the log level
If you need to change the log level of the command, you need to define it during the installation.
cargo install chewdata --no-default-features --features "log-release-trace"
echo '{"field1":"value1"}' | RUST_LOG=trace chewdata '[{"type":"reader","document":{"type":"json"},"connector":{"type":"cli"}},{"type":"writer","document":{"type":"json"},"connector":{"type":"cli"}}]'
If you want to filter logs, you can use the directive syntax from tracing_subscriber.
echo '{"field1":"value1"}' | RUST_LOG=chewdata=trace chewdata '[{"type":"reader","document":{"type":"json"},"connector":{"type":"cli"}},{"type":"writer","document":{"type":"json"},"connector":{"type":"cli"}}]'
Run
First of all, you can check how the command works with the option --help
chewdata --help
...
USAGE:
chewdata [OPTIONS] [JSON]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-f, --file <FILE> Init steps with file configuration in input
ARGS:
<JSON> Init steps with a json/hjson configuration in input
Without configuration
It is possible to run the command without configuration, the application will wait until you write json data. By default, the program write json data in the output and the program stop when you enter empty value.
$ cargo run
$ [{"key":"value"},{"name":"test"}]
$ --enter--
[{"key":"value"},{"name":"test"}]
Another examples without configuration and with file in input
$ cat ./data/multi_lines.json | cargo run
[{...}]
or
$ cat ./data/multi_lines.json | just run
[{...}]
or
$ cat ./data/multi_lines.json | chewdata
[{...}]
With configuration
The configuration is usefull to customize a list of steps. It support hjson format in order to enrich it.
$ cat ./data/multi_lines.csv | cargo run --features "csv" '[{"type":"reader","document":{"type":"csv"}},{"type":"writer"}]'
[{...}] // Will transform the csv data into json format
or
$ cat ./data/multi_lines.csv | just run-with-json '[{"type":"reader","document":{"type":"csv"}},{"type":"writer"}]'
[{...}] // Will transform the csv data into json format
or
$ cat ./data/multi_lines.csv | chewdata '[{"type":"reader","document":{"type":"csv"}},{"type":"writer"}]'
[{...}] // Will transform the csv data into json format
Another example, With file configuration in argument
$ echo '[{"type":"reader","connector":{"type":"cli"},"document":{"type":"csv"}},{"type":"writer"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | cargo run --features "csv" -- --file my_etl.conf.json
[{...}]
or
$ echo '[{"type":"reader","connector":{"type":"cli"},"document":{"type":"csv"}},{"type":"writer"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | just run-with-file my_etl.conf.json
[{...}]
or
$ echo '[{"type":"reader","connector":{"type":"cli"},"document":{"type":"csv"}},{"type":"writer"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | chewdata --file my_etl.conf.json
[{...}]
PS: It's possible to replace Json configuration file by Yaml format.
Chain commands
It is possible to chain chewdata program :
task_A=$(echo '{"variable": "a"}' | chewdata '[{"type":"r"},{"type":"transformer","actions":[{"field":"/","pattern":"{{ input | json_encode() }}"},{"field":"value","pattern":"10"}]},{"type":"w", "doc":{"type":"jsonl"}}]') &&\
task_B=$(echo '{"variable": "b"}' | chewdata '[{"type":"r"},{"type":"transformer","actions":[{"field":"/","pattern":"{{ input | json_encode() }}"},{"field":"value","pattern":"20"}]},{"type":"w", "doc":{"type":"jsonl"}}]') &&\
echo $task_A | CHEWDATA_VAR_B=$task_B chewdata '[{"type":"r"},{"type":"transformer","actions":[{"field":"var_b","pattern":"{{ get_env(name=\"VAR_B\") }}"},{"field":"result","pattern":"{{ output.var_b.value * input.value }}"},{"field":"var_b","type":"remove"}]},{"type":"w"}]'
[{"result":200}]
Apply custom environmnet variables
If you want to inject an environment variable, please prefix it with CHEWDATA.
It's mandatory in order to avoid collision and for security.
CHEWDATA_VAR_B=$task_B CHEWDATA_CURL_ENDPOINT=my_endpoint RUST_LOG=info chewdata '[{"type":"r"},{"type":"transformer","actions":[{"field":"var_b","pattern":"{{ get_env(name=\"VAR_B\") }}"},{"field":"result","pattern":"{{ output.var_b.value * input.value }}"},{"field":"var_b","type":"remove"}]},{"type":"w", "connector": {"type": "curl","endpoint": "{{ CURL_ENDPOINT }}","path": "/post","method": "post"}}]'
How it works ?
This program execute steps from a configuration file that you need to inject in Json or Yaml format :
Example:
[
{
"type": "erase",
"connector": {
"type": "local",
"path": "./my_file.out.csv"
}
},
{
"type": "reader",
"connector": {
"type": "local",
"path": "./my_file.csv"
}
},
{
"type": "writer",
"connector": {
"type": "local",
"path": "./my_file.out.csv"
}
},
...
]
These steps are executed in the FIFO order.
All steps are linked together by an input and output context queue.
When a step finishes handling data, a new context is created and send into the output queue. The next step will handle this new context.
Step1(Context) -> Q1[Contexts] -> StepN(Context) -> QN[Contexts] -> StepN+1(Context)
Each step runs asynchronously. Each queue contains a limit that can be customized in the step's configuration.
Check the module step to see the list of steps you can use and their configuration. Check the folder /examples to have some examples how to use and build a configuration file.
List of steps with the configurations
How to contribute ?
Follow the GitHub flow.
Folow the Semantic release Specification
After code modifications, please run all tests.
just test
Useful links
Dependencies
~36–88MB
~1.5M SLoC