40 releases (23 stable)

1.14.0 Feb 8, 2023
1.13.0 Jan 28, 2023
1.13.0-beta.2 Nov 27, 2022
1.13.0-beta.1 Jul 26, 2022
0.0.1 Dec 17, 2020

#221 in Encoding

Download history 68/week @ 2023-02-13 48/week @ 2023-02-20 1/week @ 2023-02-27 1/week @ 2023-03-13 82/week @ 2023-03-20 1/week @ 2023-03-27 46/week @ 2023-04-03 58/week @ 2023-04-10 7/week @ 2023-04-17 1/week @ 2023-04-24 88/week @ 2023-05-08 3/week @ 2023-05-15 8/week @ 2023-05-22 51/week @ 2023-05-29

150 downloads per month

MIT/Apache and maybe GPL-3.0

540KB
11K SLoC

Chewdata

CI Coverage semantic-release

This application is a light ETL in rust that can be used as a connector between systems

Feature Values Description
Generate data - Generate data for testing
Supported formats json [E] , jsonl [E] , csv [D] , toml [D] , xml [D] , yaml [E] , text [E] , parquet [D] Read and Write in these formats
Multi Connectors mongodb [D] , bucket [D], curl [D] , psql [D], local [E], io [E], inmemory [E] Read / Write / Clean data
Multi Http auths basic [D] , bearer [D], jwt [D] Give different possibilities to authenticate the curl
Transform data tera [E] Transform the data in the fly
Configuration formats allowed json [E], yaml [E] The project need a jobs configuration in input
Read data in parallel or sequential mode cursor[E] , offset [E] With this type of paginator, the data can be read in different way
Application Performance Monitoring (APM) apm[D] Send APM logs into Jaeger

[E] - Feature Enabled by default. Use --no-default-features argument to remove all enabled features by default.

[D] - Feature Disabled and must be enabled with the --features argument.

More useful information:

  • It need only rustup
  • No garbage collector
  • Parallel works
  • Cross-platform
  • Use async/await for concurrent threads with zero-cost
  • Read multi files in parallel into the local or in a bucket
  • Search data into multi csv/json/parquet files with S3 select
  • Can be deployed into AWS Lambda
  • The configuration easly versionable
  • Can generate data in the fly for testing purpose
  • Control and validate the data. Handle bad and valid data in a dedicated stream
  • Enable only required feature: --no-default-features --features "toml psql"

Getting started

Setup from source code

Requirement:

Commands to execute:

git clone https://github.com/jmfiaschi/chewdata.git chewdata
cd chewdata
cp .env.dev .env
vim .env // Edit the .env file
make build
make unit-tests
make integration-tests

If all the test pass, the project is ready. read the Makefile in order to see, what kind of shortcut you can use.

If you want some examples to discover this project, go in this section ./examples

Run the ETL

If you run the program without parameters, the application will wait until you write json data. By default, the program write json data in the output and the program stop when you enter empty value.

$ cargo run
$ [{"key":"value"},{"name":"test"}]
$ enter
[{"key":"value"},{"name":"test"}]

Another example without etl configuration and with file in input

$ cat ./data/multi_lines.json | cargo run
[{...}]

or

$ cat ./data/multi_lines.json | make run
[{...}]

Another example, With a json etl configuration in argument

$ cat ./data/multi_lines.csv | cargo run '[{"type":"reader","document":{"type":"csv"}},{"type":"writer"}]'
[{...}] // Will transform the csv data into json format

or

$ cat ./data/multi_lines.csv | make run json='[{\"type\":\"reader\",\"document\":{\"type\":\"csv\"}},{\"type\":\"writer\"}]'
[{...}] // Will transform the csv data into json format

Another example, With etl file configuration in argument

$ echo '[{"type":"reader","connector":{"type":"io"},"document":{"type":"csv"}},{"type":"writer"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | cargo run -- --file my_etl.conf.json
[{...}]

or

$ echo '[{"type":"reader","connector":{"type":"io"},"document":{"type":"csv"}},{"type":"writer"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | make run file=my_etl.conf.json
[{...}]

It is possible to use alias and default value to decrease the configuration length

$ echo '[{"type":"r","doc":{"type":"csv"}},{"type":"w"}]' > my_etl.conf.json
$ cat ./data/multi_lines.csv | make run file=my_etl.conf.json
[{...}]

How to contribute

In progress...

After code modifications, please run all tests.

make test

Dependencies

~54–94MB
~2M SLoC