5 releases

0.1.2 Aug 21, 2023
0.1.1 Aug 21, 2023
0.1.0 Jul 7, 2023
0.0.2-alpha Jun 8, 2023
0.0.1-alpha May 18, 2023

#818 in Algorithms

Download history 1/week @ 2024-06-16 5/week @ 2024-06-23 1/week @ 2024-06-30 25/week @ 2024-08-11 1/week @ 2024-08-18 2/week @ 2024-08-25 2/week @ 2024-09-01 12/week @ 2024-09-08 9/week @ 2024-09-15 14/week @ 2024-09-22 35/week @ 2024-09-29

70 downloads per month
Used in osm-io

MIT/Apache

140KB
1.5K SLoC

Maintenance

text-file-sort

This crate implements a sort algorithm for text files composed of lines or line records. For example CSV or TSV.

A data file composed of lines or line records, that is lines that are composed of fields separated by a delimiter, can be sorted using this crate. Example for such files are pg_dump, CSV and GTFS data files. The motivation for writing this module was the need to sort pg_dump files of the OpenStreetMap database containing billions of lines by the primary key of each table before converting the data to PBF format.

This implementation can be used to sort very large files, taking advantage of multiple CPU cores and providing memory usage control.

Issues

Issues are welcome and appreciated. Please submit to https://github.com/navigatorsguild/text-file-sort/issues

Benchmarks

Benchmarks generated by benchmark-rs

link

Examples

use std::path::PathBuf;
use text_file_sort::sort::Sort;

// optimized for use with Jemalloc
use tikv_jemallocator::Jemalloc;
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

// parallel record sort
fn sort_records(input: PathBuf, output: PathBuf, tmp: PathBuf) -> Result<(), anyhow::Error> {
   let mut text_file_sort = Sort::new(vec![input.clone()], output.clone());

    // set number of CPU cores the sort will attempt to use. When given the number that exceeds
    // the number of available CPU cores the work will be split among available cores with
    // somewhat degraded performance. The default is to use all available cores.
    text_file_sort.with_tasks(2);

    // set the directory for intermediate results. The default is the system temp dir -
    // std::env::temp_dir(), however, for large files it is recommended to provide a dedicated
    // directory for intermediate files, preferably on the same file system as the output result.
    text_file_sort.with_tmp_dir(tmp);

    text_file_sort.sort()
}

License: MIT OR Apache-2.0

Dependencies

~5–14MB
~188K SLoC