9 unstable releases

new 0.7.1 Jan 14, 2025
0.7.0 Oct 16, 2023
0.6.1 Dec 3, 2022
0.6.0 Jun 15, 2022
0.3.1 Mar 11, 2021

#345 in Machine learning

Download history 317/week @ 2024-09-23 173/week @ 2024-09-30 112/week @ 2024-10-07 171/week @ 2024-10-14 163/week @ 2024-10-21 148/week @ 2024-10-28 186/week @ 2024-11-04 127/week @ 2024-11-11 217/week @ 2024-11-18 232/week @ 2024-11-25 162/week @ 2024-12-02 279/week @ 2024-12-09 160/week @ 2024-12-16 54/week @ 2024-12-23 44/week @ 2024-12-30 172/week @ 2025-01-06

460 downloads per month
Used in 14 crates

MIT/Apache

255KB
4K SLoC

Datasets

linfa-datasets provides a collection of commonly used datasets ready to be used in tests and examples.

The Big Picture

linfa-datasets is a crate in the linfa ecosystem, an effort to create a toolkit for classical Machine Learning implemented in pure Rust, akin to Python's scikit-learn.

Current State

Currently the following datasets are provided:

Name Description #samples, #features, #targets Targets Reference
iris The Iris dataset provides samples of flower properties, belonging to three different classes. Only two of them are linearly separable. It was introduced by Ronald Fisher in 1936 as an example for linear discriminant analysis. 150, 4, 1 Multi-class classification here
winequality The winequality dataset measures different properties of wine, such as acidity, and gives a scoring from 3 to 8 in quality. It was collected in the north of Portugal. 441, 10, 1 Multi-class classification here
diabetes The diabetes dataset gives samples of human biological measures, such as BMI, age, blood measures, and tries to predict the progression of diabetes. 1599, 11, 1 Regression here
linnerud The linnerud dataset contains samples from 20 middle-aged men in a fitness club. Their physical capability, as well as biological measures are related. 20, 3, 3 Regression here

The purpose of this crate is to faciliate dataset loading and make it as simple as possible. Loaded datasets are returned as a linfa::Dataset structure with named features. The crate also includes helper functions for reading arrays from CSV files.

Additionally, this crate provides utility functions to randomly generate test datasets.

Using a dataset

To use one of the provided datasets in your project add the linfa-datasets crate to your Cargo.toml and enable the corresponding feature:

linfa-datasets = { version = "0.x", features = ["winequality"] }

You can then use the dataset in your working code:

let (train, valid) = linfa_datasets::winequality()
    .split_with_ratio(0.8);

Reading from a file

linfa-datasets is also capable of reading 2D arrays from CSV files:

use std::fs::File;
use std::io::Read;

let file = File::open("data.csv.gz").unwrap();
// Read the array from a GZipped CSV file with a header and separated by commas
let array = linfa_datasets::array_from_gz_csv(file, true, b',').unwrap();

Data generation

To generate datasets randomly, enable the generate feature on linfa-datasets. The API is in the generate module of the crate.

Dependencies

~4–5MB
~90K SLoC