1 unstable release

Uses old Rust 2015

0.1.0 Dec 7, 2017

#19 in #unsigned

Custom license

83KB
1.5K SLoC

ETL

Build Status

This package is general-purpose Extract-Transform-Load (ETL) library for Rust, built to load arbitrary plain text files into data frame objects.

Features:

  • Delimiter speification (comma, tab, etc.)
  • Data types:
    • Signed / unsigned integers
    • Floating point numbers
    • Text fields
    • Boolean values
  • Transformations:
    • Concatenation (of text fields)
    • Mapping (from one text field to another)
    • Conversion between types
    • Scaling of values (for numeric values, e.g. between -1 and 1)
    • Normalization of values
    • Vectorization (one-hot or feature hashing)
  • Filtering

Configuration is handled through a TOML file. For example:

## data_config.toml

[[source_files]]
name = "source1.csv"
delimiter = ","
fields = [ { source_name = "a_text_field", field_type = "Text", add_to_frame = false },
           { source_name = "another_text_field", field_type = "Text", add_to_frame = false } ]

[[source_files]]
name = "sourc2.tsv"
delimiter = "\t"
fields = [ { source_name = "an_integer", field_type = "Signed" },
           { source_name = "another_integer", field_type = "Signed" },
           { source_name = "a_category", field_type = "Text" },
           { source_name = "an_unused_float", field_type = "Float", add_to_frame = false } ]

[[transforms]]
method = { action = "Concatenate",  separator = " & " }
source_fields = [ "a_text_field", "another_text_field" ]
target_name = "a_new_text_field"

[[transforms]]
source_fields = [ "a_category" ]
target_name = "category_mapped_to_integers"

[transforms.method]
action = "Map"
default_value = "-1"
map = { "first_category" = "0", "second_category" = "1" }

To load a configuration file:

let data_path = PathBuf::from(file!()).parent().unwrap().join("data_config.toml");

let (config, df) = DataFrame::load(data_path.as_path()).unwrap();

let mut fieldnames = df.fieldnames();
fieldnames.sort();
assert_eq!(fieldnames, ["a_category", "a_new_text_field", "an_integer", "another_integer"
    "category_mapped_to_integers"]);

Once loaded, files can be transformed into a matrix for further processing.

let (config, df) = DataFrame::load(data_path.as_path()).unwrap();
let (fieldnames, mat) = df.as_matrix().unwrap();

Dependencies

~70MB
~1M SLoC