8 releases
0.0.8 | Oct 14, 2024 |
---|---|
0.0.7 | Oct 11, 2024 |
0.0.5 | Sep 28, 2024 |
#45 in Database implementations
756 downloads per month
89KB
2.5K
SLoC
Rowboat 🛶
Dataframe in rust 🦀
+-----------+------+-----------+
| strangs | nums | null nums |
+-----------+------+-----------+
| sugar | 0 | -10 |
| sweets | 1 | Null |
| candy pop | 2 | 200 |
| caramel | 3 | 400 |
| chocolate | 4 | 777 |
+-----------+------+-----------+
Import
use rowboat::dataframe::*;
Create
From rows
using the row!
macro
let df = Dataframe::from_rows(
vec!["id", "name", "score", "val"],
vec![
row!(1, "Sally", 23, true),
row!(2, "Jasper", 41, false),
row!(3, "Jake", 33, true),
],
)
.unwrap();
From csv
With ToRow proc-macro
#[derive(Deserialize, ToRow)]
struct MyRow {
name: String,
score: i64,
val: bool,
}
let df = Dataframe::from_csv::<MyRow>("./tests/test.csv").unwrap();
Or implement ToRow manually
impl ToRow for MyRow {
fn to_row(&self) -> Vec<Cell> {
vec![self.name.as_str().into(), self.age.into(), self.val.into()]
}
fn labels(&self) -> Vec<String> {
vec!["name".to_string(), "age".to_string(), "val".to_string()]
}
}
From structs
Create from a Vec<T>
where T
implements ToRow
#[derive(ToRow)]
struct MyRow {
name: String,
score: i64,
val: bool,
}
let df = Dataframe::from_structs(vec![
MyRow {
name: "Jake".to_string(),
age: 23,
val: true,
},
MyRow {
name: "Sally".to_string(),
age: 44,
val: false,
},
MyRow {
name: "Jasper".to_string(),
age: 61,
val: true,
},
])
.unwrap();
With null values
let df = Dataframe::from_rows(
vec!["name", "age", "score", "val"],
vec![
row!("Sasha", None::<i64>, 160, Some(false)),
row!("Jane", Some(24), 70, None::<bool>),
row!("Jerry", None::<i64>, 40, Some(true)),
],
)
.unwrap();
With timestamp
let df = Dataframe::from_rows(
vec!["id", "label", "at"],
vec![
row!(2, "Noon", Timestamp(2024, 8, 26, 12, 15, 0)),
row!(3, "Night", Timestamp(2024, 8, 26, 22, 45, 0)),
row!(1, "Morning", Timestamp(2024, 8, 26, 8, 5, 0)),
],
)
.unwrap();
Supported types
Int(i64)
Uint(u64)
Str(String)
Bool(bool)
Float(f64)
DateTime(chrono::NaiveDateTime)
Null(Box<Cell>)
Display
All
df.print();
Head
df.head(5);
Tail
df.tail(5);
Metadata
Info
Print shape and types
df.info();
// DF Info
// Shape: 3_col x 5_row
// Columns: strangs <Str>, nums <Int>, null nums <Int>
Describe
df.describe().print();
Creates a describe df and prints it:
+---------+---------+------+-----------+
| :: | strangs | nums | null nums |
+---------+---------+------+-----------+
| count | 5 | 5 | 5 |
| mean | Null | 2 | 341.75 |
| std | Null | 1.41 | 301.15 |
| min | Null | 0 | -10 |
| 25% | Null | 0.5 | 95 |
| 50% | Null | 2 | 300 |
| 75% | Null | 3.5 | 588.5 |
| max | Null | 4 | 777 |
| unique | 5 | Null | Null |
| top idx | 0 | Null | Null |
| freq | 1 | Null | Null |
+---------+---------+------+-----------+
Column names
df.col_names();
Extend
Add column
df.add_col("value", vec![-10, 30, 20, 4]).unwrap();
+----+-------+--------+ +-------+
| id | name | active | value |
+----+-------+--------+ +-------+
| 0 | Jake | true | -10 |
| 1 | Jane | true | 30 |
| 2 | Sally | false | 20 |
| 3 | Sam | false | 4 |
+----+-------+--------+ +-------+
Add row
df.add_row(row!(4, "Susan", false, 7)).unwrap();
+----+-------+--------+-------+
| id | name | active | value |
+----+-------+--------+-------+
| 0 | Jake | true | -10 |
| 1 | Jane | true | Null |
| 2 | Sally | false | 200 |
| 3 | Sam | false | 400 |
+ + + + +
| 4 | Susan | false | 7 |
+----+-------+--------+-------+
Concat
Extend vertically, essentially a union join
df.concat(other_df).unwrap();
+-----------+------+-----------+
| strangs | nums | null nums |
+-----------+------+-----------+
| sugar | 0 | -10 |
| sweets | 1 | Null |
| candy pop | 2 | 200 |
| caramel | 3 | 400 |
+ + + +
| chocolate | 4 | 777 |
| cinnamon | 5 | 300 |
| syrup | 6 | Null |
| sprinkles | 7 | -500 |
+-----------+------+-----------+
Join
Extend horizontally on left/right column value match
Inner join
// join(other_df, left_column, right_column)
let result_df = df.join(&other_df, "id", "user_id").unwrap();
+----+-------+-------- + -----+---------+
| id | name | active uid | balance |
+----+-------+-------- + -----+---------+
| 0 | Jake | true 0 | -10 |
| 1 | Jane | true 1 | Null |
| 2 | Sally | false 2 | 200 |
| 3 | Sam | false 3 | 400 |
| 4 | Susan | false 4 | 777 |
+----+-------+-------- + -----+---------+
Left join
let result_df = df.left_join(&other_df, "id", "user_id").unwrap();
More on columns
Copy/update an existing column into a new column
df.add_col(
"age is even",
df.col_values("age")
.unwrap()
.iter()
.map(|cell| match cell {
Cell::Int(age) => Some(age % 2 == 0),
_ => None::<bool>,
})
.collect(),
)
.unwrap();
Create a column derived from multiple source column values
df.add_col(
"id and age odd",
df.col_slice(["id", "age"].into())
.unwrap()
.iter()
.map(|row| {
let id_odd = match row.get("id").unwrap() {
Cell::Int(v) => v % 2 != 0,
_ => false,
};
let score_odd = match row.get("age").unwrap() {
Cell::Int(v) => v % 2 != 0,
_ => false,
};
id_odd && score_odd
})
.collect(),
)
.unwrap();
Slice
By index
// to_dataframe copies DataSlice into new Dataframe
df.slice(1, 4).unwrap().to_dataframe();
+ + + + +
| 100 | Jane | true | Null |
| 200 | Sally | false | 200 |
| 300 | Sam | false | 400 |
+ + + + +
By column
df.col_slice(["name", "age"].into())
.unwrap()
.to_dataframe();
+--------+-----+
name | age
+--------+-----+
Jane | 24
Sally | 56
Susan | 43
Jasper | 78
Sam | 37
+--------+-----+
Get cell
// (row_index, col_name)
let cell = df.cell(1, "score").unwrap();
Reshape
Drop columns
Drop specified columns
df.drop_cols(["name", "registered"].into());
Retain columns
Drop all columns other than those specified
df.retain_cols(["name", "registered"].into());
Rename column
df.rename_col("strangs", "Strings").unwrap();
Filter
Operation enum variants:
Eq
equalNeq
not equalGt
greater thanLt
less thanGtEq
greater or equal thanLtEq
less or equal thanMod(i64)
modi
isRegex
matches regex
Simple
// where age val is not null
let df = df.filter(exp("age", Neq, None::<i64>)).unwrap();
Before After
+--------+------+-------+ +--------+------+-------+
| name | age | value | | name | age | value |
+--------+------+-------+ +--------+------+-------+
| Jane | Null | -10 | | Sally | 56 | Null |
| Sally | 56 | Null | | Susan | 43 | 200 |
| Susan | 43 | 200 | | Sam | 37 | 777 |
| Jasper | Null | 400 | +--------+------+-------+
| Sam | 37 | 777 |
+--------+------+-------+
Complex
Nest as many and/or/not/exp as needed
let df = df
.filter(or(vec![
and(vec![exp("id", Gt, 2), exp("score", Lt, 1000)]),
exp("val", Eq, false),
]))
.unwrap();
Negate
Wrap any expression in not()
to inverse the result
// filter odd values
let df = df.filter(not(exp("age", Mod(2), 0))).unwrap();
Mutate
By column
df.col_mut("id")
.unwrap()
.apply(|cell| {
if let Cell::Int(val) = cell {
*val *= 2
}
})
.unwrap();
By cell
Directly
// index, column, new_value
df.set_val(2, "score", 500).unwrap();
Via function
// index, column, function
df.update_val(0, "score", |cell| {
if let Cell::Int(val) = cell {
*val *= 2
}
})
.unwrap();
Sort
Simple
// sort by, sort dir [Asc | Desc]
df.sort("at", Asc).unwrap();
Complex
Use this method for multi column sorting
let sorted = df
.into_sort()
.sort("one", Asc)
.sort("two", Asc)
.sort("three", Desc)
.collect()
.unwrap();
Iterate
Iter
let unames = df
.iter()
.map(|row| match row.get("username") {
Some(Cell::Str(val)) => val,
_ => "None",
})
.collect::<Vec<&str>>();
Into iter
A consuming df.into_iter()
is also available
Iter chunk
df.iter_chunk(2).for_each(|chunk| chunk.print());
Group by
Reducer enum variants
Count
Sum
Prod
Mean
Min
Max
Top
Unique
Coalesce
NonNull
Query
Group df by common group_by
values then do selects to reduce data groups into a new dataframe
// Source column, reducer, new alias name
let grouped_df = df
.group_by("department")
.select("department", Coalesce, "department")
.select("name", Count, "count")
.select("salary", Max, "max sal")
.select("salary", Min, "min sal")
.select("age", Mean, "avg age")
.to_dataframe()
.unwrap();
Above query transforms this raw data:
+--------+-------------+--------+-----+
| name | department | salary | age |
+--------+-------------+--------+-----+
| Jasper | Sales | 100 | 29 |
| James | Marketing | 200 | 44 |
| Susan | Sales | 300 | 65 |
| Jane | Marketing | 400 | 47 |
| Sam | Sales | 100 | 55 |
| Sally | Engineering | 200 | 30 |
+--------+-------------+--------+-----+
Into this new dataframe:
+-------------+-------+---------+---------+---------+
| department | count | max sal | min sal | avg age |
+-------------+-------+---------+---------+---------+
| Sales | 3 | 300 | 100 | 49.67 |
| Marketing | 2 | 400 | 200 | 45.5 |
| Engineering | 1 | 200 | 200 | 30 |
+-------------+-------+---------+---------+---------+
Grouped chunks
Group df by common chunk_by
values into a Vec<Dataframe>
df.to_slice()
.chunk_by("State")
.unwrap()
.iter()
.for_each(|chunk| chunk.print());
Store
To csv
df.to_csv("./tests/test.csv").unwrap();
To SQL
Convert the df into chunks of SQL insert statements with corresponding Vec<String>
args. Meant to be compatible with sqlx
library.
df.iter_sql("my_table", 500).for_each(|(query, args)| {
// insert records
println!("{query}");
println!("{:?}", args);
});
DataSlice type also has a to_sql
method.
Examples
For more examples, see ./tests/integration_test.rs
, ./tests/example/example.rs
, and ./tests/example/example_from_sql.rs
Dependencies
~5–7MB
~112K SLoC