### 1 unstable release

0.1.0 | Feb 25, 2024 |
---|

#**554** in Math

**52** downloads per month

**LGPL-3.0-or-later**

245KB

5K
SLoC

# std-dev

Your Swiss Army knife for swiftly processing any amount of data. Implemented for industrial and educational purposes alike.

This codebase is well-documented and comments, in an effort to expose the wonderful algorithms of data analysis to the masses.

We're ever expanding, but for now the following are implemented.

- Standard deviation, both for generic slices and clusters.
- Fast median and mean for large datasets with limited options of values (clusters)
- O(n) - linear time - algorithms, both for arbitrary generic lists (any type of number) and clusters:
- percentile
- median

- standard deviation
- mean

- percentile
- Ordinary least square for linear and polynomial regression
- Naive (O(n²))Theil-Sen estimator for both linear and polynomial (O(n^(m)), where m is the degree + 1) regression
- Exponential/growth and power regression, with
**correct handling of negatives**(most other applications silently ignores them) - "best fit" method if you don't know which regression model to use
- (binary) A basic plotting feature to preview the equation in relation to the input data

# Usage

This application supports using it both as a **library** (with optional cargo features),
an interactive **CLI** program, and through **piping** data to it, through standard input.

It accepts any comma/space separated values. Scientific notation is supported. This is minimalistic by design, as other programs may be used to produce/modify the data before it's processed by us.

## Shell completion

Using the subcommand

, std-dev automatically generates shell completions for your shell and tries to put them in the appropriate location.`completion`

When using Bash or Zsh, you should run std-dev as root, as we need root privileges to write to their completion directories.
Alternatively, use the

option to yourself write the completion file.`-- print`

# Cargo features

When using this as a library, I recommend disabling all features (except

) (`base`

)
and enabling those you need.`std -dev = { version = "0.1", default-features = false, features = ["base"] }`

(default, binary feature): This enables the binary to compile.`bin`

(default, binary feature): Makes the binary output prettier. Includes colours and prompts for interactive use.`prettier`

(default, binary feature): Enable the ability to generate shell completions.`completion`

(default, library and binary feature): Enables all regression estimators. This requires`regression`

, which provides linear algebra.`nalgebra`

(default, library feature): Enables the use of OLS, which is the "default" estimator. This also enables polynomial Theil-Sen for degrees > 2 & polynomial regression in`ols`

functions.`best_fit`

(default, library feature): Uses arbitrary precision algebra for >10 degree polynomial regression.`arbitrary-precision`

(default, base, library feature): Enables the recommended`percentile-rand`

for percentile-related functions.`pivot_fn`

(default, base, library feature): Fractions are simplified. Relaxes the requirements for fraction input and implements Eq & Ord for fractions.`simplify-fraction`

(default, base, library feature): Makes`generic-impls`

,`mean`

, and percentile resolving generic over numbers. This enables you to use numerical types from other libraries without hassle.`standard_deviation`

# Documentation

Documentation of the main branch can be found at doc.icelk.dev.

To document with information on which cargo features enables the code,
set the environment variable

to `RUSTDOCFLAGS`

(e.g. in Fish `--cfg docsrs`

)
and then run `set -x RUSTDOCFLAGS "--cfg docsrs"`

`cargo`` +nightly doc`

.# Performance

This library aims to be as fast as possible while maintaining easily readable code.

## Clusters

As all algorithms are executed in linear time now, this is not as useful, but nevertheless an interesting feature. If you already have clustered data, this feature is great.

When using the clusters feature (turning your list into a

),
calculations are done per `ClusterList`*unique* value.
Say you have a dataset of infant height, in centimeters.
That's probably only going to be some 40 different values, but potentially millions of entries.
Using clusters, all that data is only processed as

, not `O (40)`

`O``(`millions`)`

. (I know that notation isn't right, but you get my point).Creating this cluster involves adding all the values to a map. This takes

time, but is very slow compared to all other algorithms.
After creation, most operations in this library are executed in `O (n)`

`O``(`m`)`

time, where m is the count of unique values.#### Dependencies

~0–12MB

~95K SLoC