2 stable releases

1.3.0	Oct 14, 2024
1.2.0	Oct 13, 2024

#14 in #iceberg

Used in 8 crates

MIT license

16KB
115 lines

evolution

A robust, platform agnostic, and highly efficient framework for converting old fixed-length files to future-proof targets suitable for analytics and data science.

The evolution project was created as a response to the emergin need for a tool which can transform old fixed-length files to data formats which seamlessly integrate with the modern data analytics landscape, whilst being able to do so fully automatically.

We utilize the native speed of Rust together with multithreading and SIMD techniques to efficiently transform your old fixed-length files (of any size!) to a more modern target. The only target currently implemented is parquet, but we aim to implement support for delta, iceberg, indradb, and more.

The project is structured as a monorepo which hosts all of the evolution framework components, which can be found under crates/ as their own modules. A modular monorepo design of the framework allows anyone to implement their own target converters that can seamlessly integrate with core frameworks existing functionality.

Installation

The easiest way to install an evolution binary on your system with support for all implemented output targets is by using the Cargo package manager (which downloads it from this link). This binary can be found at examples/full in this repo.

cargo install evolution

(available features)
 - mock
 - nightly

Alternatively you can build everything from source by cloing the repo and compiling using Cargo.

git clone https://github.com/firelink-data/evolution.git
cd evolution
cargo build --release

If you want to integrate any of the evolution crates in your own project that you're building, simply add them as dependencies to your projects Cargo.toml file like you would any other third-party dependecy, like below.

[dependencies]
evolution-common = "1.2.0"
evolution-schema = "1.2.0"

Schema setup

To be able to work with automatic file conversion you need to have a valid schema available which specifies the structure of the source file you want to convert. A valid schema, in this context, is a json file which adhers to this template. If you are unsure whether or not your own schema file is valid according to the template, you can use this validator tool.

An example schema can be found here, and if you are unsure about valid values for datatypes, alignment modes, and padding symbols, please refer to the template which lists all valid values. For specifics on all the currently supported padding modes, characters, and default values, please see the padder crate (which we also maintain).

Quick start

If you install the program as explained above then by simply running the binary you will see the following usage print:

Efficiently evolve your old fixed-length data files into modern file formats. 

Usage: evolution.exe [OPTIONS] <COMMAND>

Commands:
  convert  Convert a fixed-length file to another file format
  mock     Generate mocked fixed-length files
  help     Print this message or the help of the given subcommand(s)

Options:
  -N, --n-threads <N_THREADS>
          Enable multithreading and set the number of threads (logical cores) to use [default: 1]
  -C, --thread-channel-capacity <THREAD_CHANNEL_CAPACITY>
          The maximum capacity of the thread channel (in number of messages) [default: 32]
  -R, --read-buffer-size <READ_BUFFER_SIZE>
          The size of the read buffer used when converting (in bytes) [default: 5368709120]
  -W, --write-buffer-size <WRITE_BUFFER_SIZE>
          The size of the write buffer used when mocking (in rows) [default: 1000000]
  -h, --help
          Print help
  -V, --version
          Print version

To specify the log verbosity set the RUST_LOG environment variable to your wanted value, e.g., INFO.

Threading

To know how many threads (logical cores) you have available run either of the following commands depending on your host system:

Windows:
- Command: Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors
- Use the value found under NumberOfLogicalProcessors.
Unix:
- Command: lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
- The number of logical cores is calculed as: threads per core X cores per socket X sockets.

License

All code is copyright of firelink and published under a general MIT license, please see LICENSE for specific information.

Dependencies

~0.4–1MB
~23K SLoC