146 releases (breaking)
|0.119.0||Nov 20, 2023|
|0.117.0||Oct 15, 2023|
|0.110.0||Jul 31, 2023|
|0.95.1||Mar 27, 2023|
|0.23.0||Nov 29, 2021|
#33 in Parser implementations
573 downloads per month
qsv: Blazing-fast CSV data-wrangling toolkit
|Table of Contents|
Hi-ho "Quicksilver" away!
|qsv (pronounced "Quicksilver") is a command line program
for querying, indexing, slicing, analyzing, filtering, enriching,
transforming, sorting, validating & joining CSV files.
Commands are simple, fast & composable.
* Installation Options
* Whirlwind Tour / Notebooks
* Performance Tuning
* 👉 Benchmarks 🚀
* Environment Variables
* Feature Flags
* NYC School of Data 2022 slides
Try it out at qsv.dathere.com!
|Apply series of string, date, math & currency transformations to given CSV column/s. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection).|
|applydp is a slimmed-down version of
|behead||Drop headers from a CSV.|
|cat||Concatenate CSV files by row or by column.|
|Count the rows in a CSV file. (15.82 seconds for 15gb, 27m row NYC 311 dataset without an index. Instantaneous with an index.)|
|Remove duplicate rows (See also
|Infer extended metadata about a CSV using a GPT model from OpenAI's API.|
|Find the difference between two CSVs with ludicrous speed!
e.g. compare two CSVs with 1M rows x 9 columns in under 600ms!
|enum||Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.|
|Exports a specified Excel/ODS sheet to a CSV file.|
|Removes a set of CSV data from another set based on the specified columns.|
|explode||Explode rows into multiple ones by splitting a column value based on the given separator.|
||Remove duplicate rows from an arbitrarily large CSV/text file using a memory-mapped, on-disk hash table. Unlike the
|Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.|
|Fetches data from web services for every row using HTTP Get. Comes with HTTP/2 adaptive flow control, jql JSON query language support, dynamic throttling (RateLimit) & caching with optional Redis support for persistent caching.|
|fill||Fill empty values.|
|fixlengths||Force a CSV to have same-length records by either padding or truncating them.|
|flatten||A flattened view of CSV records. Useful for viewing one record at a time.
|fmt||Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)|
|Loop over a CSV to execute shell commands. (not available on Windows)|
|Build frequency tables of each column. Uses multithreading to go faster if an index is present.|
|Geocodes a location against an updatable local copy of the Geonames cities database. With caching and multi-threading, it geocodes up to 360,000 records/sec!|
|Generate test data by profiling a CSV using Markov decision process machine learning.|
|headers||Show the headers of a CSV. Or show the intersection of all headers between many CSV files.|
|index||Create an index (📇) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 15 seconds to index) & provides constant time indexing/random access into the CSV. With an index,
|input||Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands.|
|join||Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast.|
|Inner, outer, cross, anti, semi & asof joins using the Pola.rs engine. Unlike the
|jsonl||Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See
|Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a Luau 0.604 expression/script for every row of a CSV file (sequential mode), or using random access with an index (random access mode).
Can process a single Luau expression or full-fledged data-wrangling scripts using lookup tables with discrete BEGIN, MAIN and END sections.
It is not just another qsv command, it is qsv's Domain-specific Language (DSL) with numerous qsv-specific helper functions to build production data pipelines.
|partition||Partition a CSV based on a column value.|
|pseudo||Pseudonymise the value of the given column by replacing them with an incremental identifier.|
|Create a new computed column or filter rows by evaluating a python expression on every row of a CSV file. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well.|
|rename||Rename the columns of a CSV efficiently.|
|replace||Replace CSV data using a regex. Applies the regex to each field individually.|
|Reverse order of rows in a CSV. Unlike the
||Modify headers of a CSV to only have "safe" names - guaranteed "database-ready"/"CKAN-ready" names.|
|Randomly draw rows (with optional seed) from a CSV using reservoir sampling, using memory proportional to the sample size. If an index is present, using random indexing with constant memory.|
|Infer schema from CSV data, replete with data type & domain/range validation & output in JSON Schema format. Uses multithreading to go faster if an index is present. See
|search||Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.|
|searchset||Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.|
|select||Select, re-order, duplicate or drop columns.|
|Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).|
|Does streaming compression/decompression of the input using Google's Snappy framing format (more info).|
|Quickly sniff & infer CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, average record length, number of records, content length & estimated number of records if sniffing a CSV on a URL, number of fields, field names & data types). It is also a general mime type detector.|
|Sorts CSV data in alphabetical (with case-insensitive option), numerical, reverse, unique or random (with optional seed) order (See also
|Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count.|
|Split one CSV file into many CSV files of N chunks. Uses multithreading to go faster if an index is present.|
|Run blazing-fast Polars SQL queries against several CSVs - converting queries to fast LazyFrame expressions, processing larger than memory CSV files.|
|Compute summary statistics (sum, min/max/range, min/max length, mean, stddev, variance, nullcount, sparsity, quartiles, IQR, lower/upper fences, skewness, median, mode/s, antimode/s & cardinality) & make GUARANTEED data type inferences (Null, String, Float, Integer, Date, DateTime, Boolean) for each column in a CSV.
Uses multithreading to go faster if an index is present (with an index, can compile "streaming" stats on NYC's 311 data (15gb, 28m rows) in less than 20 seconds).
|Show aligned output of a CSV using elastic tabstops. To interactively view CSV files, qsv pairs well with csvlens.|
|Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.|
|Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See
|Transpose rows/columns of a CSV.|
|Validate CSV data blazingly-fast using JSON Schema Validation & put invalid records into a separate file with an accompanying detailed validation error report file (e.g. up to 350,000 rows/second using NYC's 311 schema generated by the
If no JSON schema file is provided, validates if a CSV conforms to the RFC 4180 standard and is UTF-8 encoded.
✨: enabled by a feature flag.
📇: uses an index when available.
🤯: loads entire CSV into memory, though
transpose have "streaming" modes as well.
😣: uses additional memory proportional to the cardinality of the columns in the CSV.
🧠: expensive operations are memoized (cached) with available inter-session Redis caching for fetch commands.
🐻❄️: command powered by Pola.rs engine.
🤖: command uses Natural Language Processing & General AI techniques.
🏎️: multithreaded and/or faster when an index (📇) is available.
🚀: multithreaded even without an index.
: has CKAN-aware integration options.
🌐: has web-aware options.
Option 1: Download Prebuilt Binaries
These prebuilt binaries are also built with CPU optimizations enabled for x86_64 (e.g. SSE4.2, AVX2, AVX512, etc. on Intel and AMD processors) and Apple Silicon processors (ARM64 SIMD NEON) for even more performance gains.
For Windows, an MSI Installer wrapping the x86_64-pc-windows-msvc build is also available for download.
For macOS, "ad-hoc" signatures are used to sign our binaries, so you will need to set appropriate Gatekeeper security settings or run the following command to remove the quarantine attribute from qsv before you run it for the first time:
# replace qsv with qsvlite or qsvdp if you installed those binary variants xattr -d com.apple.quarantine qsv
Verifying the Integrity of the Prebuilt Binaries Zip Archives
# if you don't have zipsign installed yet cargo install zipsign # verify the integrity of the downloaded prebuilt binary zip archive # after downloading the zip archive and the qsv-zipsign-public.key file. # replace <PREBUILT-BINARY-ARCHIVE.zip> with the name of the downloaded zip archive # e.g. zipsign verify zip qsv-0.118.0-aarch64-apple-darwin.zip qsv-zipsign-public.key zipsign verify zip <PREBUILT-BINARY-ARCHIVE.zip> qsv-zipsign-public.key
Option 2: Homebrew
brew install qsv
Option 3: Install with Rust
If you have Rust installed, you can also install from source using Rust's cargo command[^1]:
[^1]: Of course, you'll also need a linker & a C compiler. Linux users should generally install GCC or Clang, according to their distribution’s documentation.
For example, if you use Ubuntu, you can install the
build-essential package. On macOS, you can get a C compiler by running
$ xcode-select --install.
For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++",
the Windows 10 or 11 SDK & the English language pack, along with any other language packs your require.
cargo install qsv --locked --features all_features
The binary will be installed in
# to install qsv with all features enabled cargo install qsv --locked --bin qsv --features feature_capable,apply,generate,luau,fetch,foreach,python,to,self_update,polars # or shorthand cargo install qsv --locked --bin qsv -F all_features # or enable only the apply and polars features cargo install qsv --locked --bin qsv -F feature_capable,apply,polars # or to install qsvlite cargo install qsv --locked --bin qsvlite -F lite # or to install qsvdp cargo install qsv --locked --bin qsvdp -F datapusher_plus,luau
Option 4: Compile from Source
Compiling from source also works similarly[^1]:
git clone https://github.com/jqnatividad/qsv.git cd qsv cargo build --release --locked --bin qsv --features all_features
The compiled binary will end up in
# to compile qsv with all features enabled cargo build --release --locked --bin qsv --features feature_capable,apply,generate,luau,fetch,foreach,python,to,self_update,polars # shorthand cargo build --release --locked --bin qsv -F all_features # or build qsv with only the fetch and foreach features enabled cargo build --release --locked --bin qsv -F feature_capable,fetch,foreach # for qsvlite cargo build --release --locked --bin qsvlite -F lite # for qsvdp cargo build --release --locked --bin qsvdp -F datapusher_plus,luau
NOTE: To build with Rust nightly, see Nightly Release Builds.
There are three binary variants of qsv:
qsv- feature-capable(✨), with the prebuilt binaries enabling all applicable features except Python [^2]
qsvlite- all features disabled (~13% of the size of
qsvdp- optimized for use with DataPusher+ with only DataPusher+ relevant commands; an embedded
applydp, a slimmed-down version of the
--progressbaroption disabled; and the self-update only checking for new releases, requiring an explicit
--update(~12% of the the size of
foreach feature is not available on Windows. The
python feature is not enabled on the prebuilt binaries. Compile qsv with Python development environment installed if you want to enable the
python feature (Python 3.7 & above supported). The
luau feature is enabled by default on the prebuilt binaries if the platform supports it.
Regular Expression Syntax
--select option and several commands (
sqlp) allow the user to specify regular expressions. We use the
regex crate to parse, compile and execute these expressions. [^3]
Its syntax can be found here and "is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched."
If you want to test your regular expressions, regex101 supports the syntax used by the
regex crate. Just select the "Rust" flavor.
qsv recognizes UTF-8/ASCII encoded, CSV (
.csv) & TSV files (
.tab). CSV files are assumed to have "," (comma) as a delimiter,
and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the
--delimiter command-line option or
QSV_DEFAULT_DELIMITER environment variable or automatically detected when
QSV_SNIFF_DELIMITER is set.
When using the
--output option, qsv will UTF-8 encode the file & automatically change the delimiter used in the generated file based on the file extension - i.e. comma for
.csv, tab for
fetchpost commands also produces JSONL files when its invoked without the
--new-column option & TSV files with the
validate commands produce JSON files with their JSON options following the JSON API 1.1 specification, so it can return detailed machine-friendly metadata that can be used by other systems.
schema command produces a JSON Schema Validation (Draft 7) file with the ".schema.json" file extension, which can be used with the
validate command to validate other CSV files with an identical schema.
excel command recognizes Excel & Open Document Spreadsheet(ODS) files (
sqlp command returns query results in CSV, JSON, Parquet & Arrow IPC formats. Polars SQL also supports reading external files directly in various formats with its
read_ipc table functions.
sniff command can also detect the mime type of any file with the
--just-mime options, may it be local or remote (http and https schemes supported).
It can detect more than 120 file formats, including MS Office/Open Document files, JSON, XML,
PDF, PNG, JPEG and specialized geospatial formats like GPX, GML, KML, TML, TMX, TSX, TTML.
Click here for a complete list.
qsv supports automatic compression/decompression using the Snappy frame format. Snappy was chosen instead of more popular compression formats like gzip because it was designed for high-performance streaming compression & decompression (up to 2.58 gb/sec compression, 0.89 gb/sec decompression).
For all commands except the
extsort commands, if the input file has an ".sz" extension, qsv will automatically do streaming decompression as it reads it. Further, if the input file has an extended CSV/TSV ".sz" extension (e.g nyc311.csv.sz/nyc311.tsv.sz/nyc311.tab.sz), qsv will also use the file extension to determine the delimiter to use.
Similarly, if the
--output file has an ".sz" extension, qsv will automatically do streaming compression as it writes it.
If the output file has an extended CSV/TSV ".sz" extension, qsv will also use the file extension to determine the delimiter to use.
Note however that compressed files cannot be indexed, so index-accelerated commands (
tojsonl) will not be multi-threaded. Random access is also disabled without an index, so
slice will not be instantaneous and
luau's random-access mode will not be available.
There is also a dedicated
snappy command with four subcommands for direct snappy file operations — a multithreaded
compress subcommand (4-5x faster than the built-in, single-threaded auto-compression); a
decompress subcommand with detailed compression metadata; a
check subcommand to quickly inspect if a file has a Snappy header; and a
validate subcommand to confirm if a Snappy file is valid.
snappy command can be used to compress/decompress ANY file, not just CSV/TSV files.
snappy command, we can compress NYC's 311 data (15gb, 28m rows) to 4.95 gb in 5.77 seconds with the multithreaded
compress subcommand - 2.58 gb/sec with a 0.33 (3.01:1) compression ratio. With
snappy decompress, we can roundtrip decompress the same file in 16.71 seconds - 0.89 gb/sec.
Compare that to zip 3.0, which compressed the same file to 2.9 gb in 248.3 seconds on the same machine - 43x slower at 0.06 gb/sec with a 0.19 (5.17:1) compression ratio - for just an additional 14% (2.45 gb) of saved space. zip also took 4.3x longer to roundtrip decompress the same file in 72 seconds - 0.20 gb/sec.
RFC 4180 CSV Standard
qsv follows the RFC 4180 CSV standard. However, in real life, CSV formats vary significantly & qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV crate to read/write CSV files.
Click here to find out more about how qsv conforms to the standard using this crate.
When dealing with "atypical" CSV files, you can use the
input command to normalize them to be RFC 4180-compliant.
qsv requires UTF-8 encoded input (of which ASCII is a subset).
Should you need to re-encode CSV/TSV files, you can use the
input command to "lossy save" to UTF-8 - replacing invalid UTF-8 sequences with
� (U+FFFD REPLACEMENT CHARACTER).
Windows Excel Usage Note
Unlike other modern operating systems, Microsoft Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file & trying to open it with Excel (which ignores the comma delimiter, with everything in the first column if the file is UTF16-LE encoded):
qsv stats wcp.csv > wcpstats.csv
Which is weird, since you would think Microsoft's own Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file on Windows, use the
--output option instead:
qsv stats wcp.csv > wcpstats.csv qsv stats wcp.csv --output wcpstats.csv
Alternatively, qsv can add a Byte Order Mark (BOM) to the beginning of a CSV to indicate it's UTF-8 encoded. You can do this by setting the
QSV_OUTPUT_BOM environment variable to
This will allow Excel on Windows to properly recognize the CSV file as UTF-8 encoded.
Note that this problem does not occur on Excel on macOS, as macOS uses UTF-8 as its default encoding.
For complex data-wrangling tasks, you can use Luau and Python scripts.
Luau is recommended over Python for complex data-wrangling tasks as it is faster, more memory-efficient, has no external dependencies and has several data-wrangling helper functions as qsv's DSL.
See Luau vs Python for more info.
qsv supports three memory allocators - mimalloc (default), jemalloc and the standard allocator.
See Memory Allocator for more info.
It also has Out-of-Memory prevention, with two modes - NORMAL (default) & CONSERVATIVE.
See Out-of-Memory Prevention for more info.
Environment Variables & dotenv file support
qsv supports an extensive list of environment variables and supports
.env files to set them.
qsv has several feature flags that can be used to enable/disable optional features.
See Features for more info.
Minimum Supported Rust Version
qsv's MSRV policy is to require the latest Rust version that is supported by Homebrew, currently . However, if the latest Rust stable has been released for more than a week and Homebrew has not yet updated its Rust formula, qsv will go ahead and require the latest Rust stable version.
Goals / Non-Goals
QuickSilver's goals, in priority order, are to be:
- As Fast as Possible - To do so, it has frequent releases, an aggressive MSRV policy, takes advantage of CPU features, employs various caching strategies, uses HTTP/2, and is multi-threaded when possible and it makes sense. See Performance for more info.
- Able to Process Very Large Files - Most qsv commands are streaming, using constant memory, and can process arbitrarily large CSV files. For those commands that require loading the entire CSV into memory (denoted by 🤯), qsv has Out-of-Memory prevention, batch processing strategies and "ext"ernal commands that use the disk to process larger than memory files. See Memory Management for more info.
- A Complete Data-Wrangling Toolkit - qsv aims to be a comprehensive data-wrangling toolkit that you can use for quick analysis and investigations, but is also robust enough for production data pipelines. Its many commands are targeted towards common data-wrangling tasks and can be combined/composed into complex data-wrangling scripts with its Luau-based DSL.
Luau will also serve as the backbone of a whole library of qsv recipes - reusable scripts for common tasks (e.g. street-level geocoding, removing PII, data enrichment, etc.) that prompt for easily modifiable parameters.
- Composable/Interoperable - qsv is designed to be composable, with a focus on interoperability with other common CLI tools like 'awk', 'xargs', 'ripgrep', 'sed', etc., and with well known ETL/ELT tools like Airbyte, Airflow, Pentaho Kettle, etc. Its commands can be combined with other tools via pipes, and it supports other common file formats like JSONL, Parquet, Arrow IPC, Excel, ODS, PostgreSQL, SQLite, etc. See File Formats for more info.
- As Portable as Possible - qsv is designed to be portable, with installers on several platforms with an integrated self-update mechanism. In preference order, it supports Linux, macOS and Windows. See Installation Options for more info.
- As Easy to Use as Possible - qsv is designed to be easy to use. As easy-to-use that is, as command line interfaces go 🤷. Its commands have numerous options but have sensible defaults if a user does not want to use options. The usage text is written for a data analyst audience, not developers; and there are numerous examples in the usage text, with the tests doubling as examples as well. In the future, it will also have a TUI (Terminal User Interface) mode.
- As Secure as Possible - qsv is designed to be secure. It has no external runtime dependencies, is written in Rust, and it's codebase is automatically audited for security vulnerabilities with automated DevSkim and "cargo audit" Github Actions workflows.
It uses the latest stable Rust version, with an aggressive MSRV policy and the latest version of all its dependencies. It has an extensive test suite with more than 1,200 tests, including several property tests which randomly generate parameters for oft-used commands. It also has a Security Policy.
Its prebuilt binary archives are zipsigned, so you can manually verify their integrity. Its self-update mechanism also verifies the integrity of the downloaded binaries archive before applying an update. However, it does not use cryptographically secure random number generators as the performance penalty is too high and qsv's
sampleuse cases do not require it. (search for the codebase for "//DevSkim: ignore DS148264" to find instances where qsv uses a non-cryptographically secure random number generator)
- As Easy to Contribute to as Possible - qsv is designed to be easy to contribute to, with a focus on maintainability. It's architecture allows the easy addition of self-contained commands gated by feature flags, the source code is heavily commented, the usage text is embedded, and there are helper functions that make it easy to create tests. See Features and Contributing for more info.
QuickSilver's non-goals are to be:
- As Small as Possible - qsv is designed to be small, but not at the expense of performance, features, composability, portability, usability, security or maintainability. However, we do have a
qsvlitevariant that is ~13% of the size of
qsvdpvariant that is ~12% of the size of
qsv. Those variants, however, have reduced functionality. Further, several commands are gated behind feature flags, so you can compile qsv with only the features you need.
- Multi-lingual - qsv's usage text and messages are English-only. There are no plans to support other languages. This does not mean it can only process English CSVs.
It can process well-formed CSVs in any language so long as its UTF-8 encoded. Further, it supports alternate delimiters/separators other than comma; the
apply whatlangoperation detects 69 languages; and its
apply thousands, currency and eudexoperations supports different languages and country conventions for number, currency and date parsing/formatting.
Finally, though the default Geonames index of the
geocodecommand is English-only, the index can be rebuilt with the
geocode index-updatesubcommand with the
--languagesoption to return place names in multiple languages (with support for 253 languages).
qsv has ~1,220 tests in the tests directory.
Each command has its own test suite in a separate file with the convention
Apart from preventing regressions, the tests also serve as good illustrative examples, and are often linked from the usage text of each corresponding command.
To test each binary variant:
# to test qsv cargo test --features all_features # to test qsvlite cargo test --features lite # to test all tests with "stats" in the name with qsvlite cargo test stats --features lite # to test qsvdp cargo test --features datapusher_plus,luau # to test a specific command # here we test only stats and use the # t alias for test and the -F shortcut for --features cargo t stats -F all_features # to test a specific command with a specific feature # here we test only luau command with the luau feature cargo t luau -F feature_capable,luau # to test the count command with multiple features cargo t count -F feature_capable,luau,polars # to test using an alternate allocator # other than the default mimalloc allocator cargo t --no-default-features -F all_features,jemallocator
Dual-licensed under MIT or the UNLICENSE.
Quicksilver (qsv) is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. On top of xsv's 20 commands, it adds numerous new features; 37 additional commands; 5
apply subcommands & 36 operations; 5
to subcommands; 3
cat subcommands; 7
geocode subcommands & 4 index operations; and 4
See FAQ for more details.
|qsv was made possible by|
|Standards-based, best-of-breed, open source solutions
to make your Data Useful, Usable & Used.
This project is unrelated to Intel's Quick Sync Video.