#dataframe #table #csv #sql #matrix

bin+lib rowboat

A package for working with row/column/cell structured data AKA dataframe

8 releases

0.0.8 Oct 14, 2024
0.0.7 Oct 11, 2024
0.0.5 Sep 28, 2024

#45 in Database implementations

Download history 358/week @ 2024-09-05 242/week @ 2024-09-12 17/week @ 2024-09-19 320/week @ 2024-09-26 179/week @ 2024-10-03 371/week @ 2024-10-10 27/week @ 2024-10-17

756 downloads per month

MIT license

89KB
2.5K SLoC

Rowboat 🛶

Dataframe in rust 🦀

+-----------+------+-----------+
| strangs   | nums | null nums |
+-----------+------+-----------+
| sugar     |    0 |       -10 |
| sweets    |    1 |      Null |
| candy pop |    2 |       200 |
| caramel   |    3 |       400 |
| chocolate |    4 |       777 |
+-----------+------+-----------+

Import

use rowboat::dataframe::*;

Create

From rows

using the row! macro

let df = Dataframe::from_rows(
    vec!["id", "name", "score", "val"],
    vec![
        row!(1, "Sally", 23, true),
        row!(2, "Jasper", 41, false),
        row!(3, "Jake", 33, true),
    ],
)
.unwrap();

From csv

With ToRow proc-macro

#[derive(Deserialize, ToRow)]
struct MyRow {
    name: String,
    score: i64,
    val: bool,
}

let df = Dataframe::from_csv::<MyRow>("./tests/test.csv").unwrap();

Or implement ToRow manually

impl ToRow for MyRow {
    fn to_row(&self) -> Vec<Cell> {
        vec![self.name.as_str().into(), self.age.into(), self.val.into()]
    }
    fn labels(&self) -> Vec<String> {
        vec!["name".to_string(), "age".to_string(), "val".to_string()]
    }
}

From structs

Create from a Vec<T> where T implements ToRow

#[derive(ToRow)]
struct MyRow {
    name: String,
    score: i64,
    val: bool,
}

let df = Dataframe::from_structs(vec![
    MyRow {
        name: "Jake".to_string(),
        age: 23,
        val: true,
    },
    MyRow {
        name: "Sally".to_string(),
        age: 44,
        val: false,
    },
    MyRow {
        name: "Jasper".to_string(),
        age: 61,
        val: true,
    },
])
.unwrap();

With null values

let df = Dataframe::from_rows(
    vec!["name", "age", "score", "val"],
    vec![
        row!("Sasha", None::<i64>, 160, Some(false)),
        row!("Jane", Some(24), 70, None::<bool>),
        row!("Jerry", None::<i64>, 40, Some(true)),
    ],
)
.unwrap();

With timestamp

let df = Dataframe::from_rows(
    vec!["id", "label", "at"],
    vec![
        row!(2, "Noon", Timestamp(2024, 8, 26, 12, 15, 0)),
        row!(3, "Night", Timestamp(2024, 8, 26, 22, 45, 0)),
        row!(1, "Morning", Timestamp(2024, 8, 26, 8, 5, 0)),
    ],
)
.unwrap();

Supported types

  • Int(i64)
  • Uint(u64)
  • Str(String)
  • Bool(bool)
  • Float(f64)
  • DateTime(chrono::NaiveDateTime)
  • Null(Box<Cell>)

Display

All

df.print();

Head

df.head(5);

Tail

df.tail(5);

Metadata

Info

Print shape and types

df.info();
// DF Info
// Shape: 3_col x 5_row
// Columns: strangs <Str>, nums <Int>, null nums <Int>

Describe

df.describe().print();

Creates a describe df and prints it:

+---------+---------+------+-----------+
| ::      | strangs | nums | null nums |
+---------+---------+------+-----------+
| count   |       5 |    5 |         5 |
| mean    |    Null |    2 |    341.75 |
| std     |    Null | 1.41 |    301.15 |
| min     |    Null |    0 |       -10 |
| 25%     |    Null |  0.5 |        95 |
| 50%     |    Null |    2 |       300 |
| 75%     |    Null |  3.5 |     588.5 |
| max     |    Null |    4 |       777 |
| unique  |       5 | Null |      Null |
| top idx |       0 | Null |      Null |
| freq    |       1 | Null |      Null |
+---------+---------+------+-----------+

Column names

df.col_names();

Extend

Add column

df.add_col("value", vec![-10, 30, 20, 4]).unwrap();
+----+-------+--------+  +-------+
| id | name  | active |    value |
+----+-------+--------+  +-------+
|  0 | Jake  | true   |      -10 |
|  1 | Jane  | true   |       30 |
|  2 | Sally | false  |       20 |
|  3 | Sam   | false  |        4 |
+----+-------+--------+  +-------+

Add row

df.add_row(row!(4, "Susan", false, 7)).unwrap();
+----+-------+--------+-------+
| id | name  | active | value |
+----+-------+--------+-------+
|  0 | Jake  | true   |   -10 |
|  1 | Jane  | true   |  Null |
|  2 | Sally | false  |   200 |
|  3 | Sam   | false  |   400 |

+    +       +        +       +
|  4 | Susan | false  |     7 |
+----+-------+--------+-------+

Concat

Extend vertically, essentially a union join

df.concat(other_df).unwrap();
+-----------+------+-----------+
| strangs   | nums | null nums |
+-----------+------+-----------+
| sugar     |    0 |       -10 |
| sweets    |    1 |      Null |
| candy pop |    2 |       200 |
| caramel   |    3 |       400 |

+           +      +           +

| chocolate |    4 |       777 |
| cinnamon  |    5 |       300 |
| syrup     |    6 |      Null |
| sprinkles |    7 |      -500 |
+-----------+------+-----------+

Join

Extend horizontally on left/right column value match

Inner join

// join(other_df, left_column, right_column)
let result_df = df.join(&other_df, "id", "user_id").unwrap();
+----+-------+--------  +  -----+---------+
| id | name  | active       uid | balance |
+----+-------+--------  +  -----+---------+
|  0 | Jake  | true           0 |     -10 |
|  1 | Jane  | true           1 |    Null |
|  2 | Sally | false          2 |     200 |
|  3 | Sam   | false          3 |     400 |
|  4 | Susan | false          4 |     777 |
+----+-------+--------  +  -----+---------+

Left join

let result_df = df.left_join(&other_df, "id", "user_id").unwrap();

More on columns

Copy/update an existing column into a new column

df.add_col(
    "age is even",
    df.col_values("age")
        .unwrap()
        .iter()
        .map(|cell| match cell {
            Cell::Int(age) => Some(age % 2 == 0),
            _ => None::<bool>,
        })
        .collect(),
)
.unwrap();

Create a column derived from multiple source column values

df.add_col(
    "id and age odd",
    df.col_slice(["id", "age"].into())
        .unwrap()
        .iter()
        .map(|row| {
            let id_odd = match row.get("id").unwrap() {
                Cell::Int(v) => v % 2 != 0,
                _ => false,
            };
            let score_odd = match row.get("age").unwrap() {
                Cell::Int(v) => v % 2 != 0,
                _ => false,
            };
            id_odd && score_odd
        })
        .collect(),
)
.unwrap();

Slice

By index

// to_dataframe copies DataSlice into new Dataframe
df.slice(1, 4).unwrap().to_dataframe();
+      +       +        +       +
|  100 | Jane  | true   |  Null |
|  200 | Sally | false  |   200 |
|  300 | Sam   | false  |   400 |
+      +       +        +       +

By column

df.col_slice(["name", "age"].into())
    .unwrap()
    .to_dataframe();
+--------+-----+
  name   | age 
+--------+-----+
  Jane   |  24 
  Sally  |  56 
  Susan  |  43 
  Jasper |  78 
  Sam    |  37 
+--------+-----+

Get cell

// (row_index, col_name)
let cell = df.cell(1, "score").unwrap();

Reshape

Drop columns

Drop specified columns

df.drop_cols(["name", "registered"].into());

Retain columns

Drop all columns other than those specified

df.retain_cols(["name", "registered"].into());

Rename column

df.rename_col("strangs", "Strings").unwrap();

Filter

Operation enum variants:

  • Eq equal
  • Neq not equal
  • Gt greater than
  • Lt less than
  • GtEq greater or equal than
  • LtEq less or equal than
  • Mod(i64) mod i is
  • Regex matches regex

Simple

// where age val is not null
let df = df.filter(exp("age", Neq, None::<i64>)).unwrap();
Before                     After
+--------+------+-------+  +--------+------+-------+
| name   | age  | value |  | name   | age  | value |
+--------+------+-------+  +--------+------+-------+
| Jane   | Null |   -10 |  | Sally  |   56 |  Null |
| Sally  |   56 |  Null |  | Susan  |   43 |   200 |
| Susan  |   43 |   200 |  | Sam    |   37 |   777 |
| Jasper | Null |   400 |  +--------+------+-------+
| Sam    |   37 |   777 |
+--------+------+-------+

Complex

Nest as many and/or/not/exp as needed

let df = df
    .filter(or(vec![
        and(vec![exp("id", Gt, 2), exp("score", Lt, 1000)]),
        exp("val", Eq, false),
    ]))
    .unwrap();

Negate

Wrap any expression in not() to inverse the result

// filter odd values
let df = df.filter(not(exp("age", Mod(2), 0))).unwrap();

Mutate

By column

df.col_mut("id")
    .unwrap()
    .apply(|cell| {
        if let Cell::Int(val) = cell {
            *val *= 2
        }
    })
    .unwrap();

By cell

Directly

// index, column, new_value
df.set_val(2, "score", 500).unwrap();

Via function

// index, column, function
df.update_val(0, "score", |cell| {
    if let Cell::Int(val) = cell {
        *val *= 2
    }
})
.unwrap();

Sort

Simple

// sort by, sort dir [Asc | Desc]
df.sort("at", Asc).unwrap();

Complex

Use this method for multi column sorting

let sorted = df
    .into_sort()
    .sort("one", Asc)
    .sort("two", Asc)
    .sort("three", Desc)
    .collect()
    .unwrap();

Iterate

Iter

let unames = df
    .iter()
    .map(|row| match row.get("username") {
        Some(Cell::Str(val)) => val,
        _ => "None",
    })
    .collect::<Vec<&str>>();

Into iter

A consuming df.into_iter() is also available

Iter chunk

df.iter_chunk(2).for_each(|chunk| chunk.print());

Group by

Reducer enum variants

  • Count
  • Sum
  • Prod
  • Mean
  • Min
  • Max
  • Top
  • Unique
  • Coalesce
  • NonNull

Query

Group df by common group_by values then do selects to reduce data groups into a new dataframe

// Source column, reducer, new alias name
let grouped_df = df
    .group_by("department")
    .select("department", Coalesce, "department")
    .select("name", Count, "count")
    .select("salary", Max, "max sal")
    .select("salary", Min, "min sal")
    .select("age", Mean, "avg age")
    .to_dataframe()
    .unwrap();

Above query transforms this raw data:

+--------+-------------+--------+-----+
| name   | department  | salary | age |
+--------+-------------+--------+-----+
| Jasper | Sales       |    100 |  29 |
| James  | Marketing   |    200 |  44 |
| Susan  | Sales       |    300 |  65 |
| Jane   | Marketing   |    400 |  47 |
| Sam    | Sales       |    100 |  55 |
| Sally  | Engineering |    200 |  30 |
+--------+-------------+--------+-----+

Into this new dataframe:

+-------------+-------+---------+---------+---------+
| department  | count | max sal | min sal | avg age |
+-------------+-------+---------+---------+---------+
| Sales       |     3 |     300 |     100 |   49.67 |
| Marketing   |     2 |     400 |     200 |    45.5 |
| Engineering |     1 |     200 |     200 |      30 |
+-------------+-------+---------+---------+---------+

Grouped chunks

Group df by common chunk_by values into a Vec<Dataframe>

df.to_slice()
    .chunk_by("State")
    .unwrap()
    .iter()
    .for_each(|chunk| chunk.print());

Store

To csv

df.to_csv("./tests/test.csv").unwrap();

To SQL

Convert the df into chunks of SQL insert statements with corresponding Vec<String> args. Meant to be compatible with sqlx library.

df.iter_sql("my_table", 500).for_each(|(query, args)| {
    // insert records
    println!("{query}");
    println!("{:?}", args);
});

DataSlice type also has a to_sql method.

Examples

For more examples, see ./tests/integration_test.rs, ./tests/example/example.rs, and ./tests/example/example_from_sql.rs

Dependencies

~5–7MB
~112K SLoC