6 releases (2 stable)

1.3.0	Apr 24, 2024
1.0.0	May 30, 2023
0.1.3	Aug 4, 2022
0.1.1	Jun 6, 2022
0.1.0	Apr 26, 2022

#1034 in Parser implementations

MIT license

30KB
530 lines

Drug Extraction CLI

Drug Extraction CLI

Demo

demo-gif

Description

This application takes a CSV file and parses text records from another CSV file to detect and extract search term mentions using string similarity algorithms to account for common misspellings. It is named for the drug searching it does most commonly for us at IPOP but is flexible enough to accept any type search terms.

NOTE: In our text-preprocessing, we specifically allow hyphens ("-") to to their frequency in drug terminologies. If you want to see this functionality removed or put behind a feature flag, please file an Issue.

If you are wondering about specific use cases, check out the Examples folder!

Requires

cargo package manager (rust toolchain)
just (optional dev-dependency if you clone this repo)
Valid UTF-8 encoded CSV data

Installation

To install the drug-extraction-cli application, simply:

Python Developers / Data Scientists

Please use pipx since it is designed specifically for this use case of installing Python CLI apps into isolated virtual environments.

pipx install extract-drugs

Rust Developers

cargo install drug-extraction-cli

IMPORTANT! Both of these will install an executable called extract-drugs.

No matter how you install the package from either packaging index, the binary program will be named extract-drugs for more intuitive commands.

INFO: The naming discrepancy is due to to how maturin handles package names and wanting to both keep the same CLI command/name and maintain the Rust namespace. Apologies, but you'll be fine 🙂.

Usage

This application has two commands: interactive and search. Both of these commands have the same underlying functionality, the latter allows you to pass command-line arguments and is better suited to automated processing or advanced users while the former allows interactive declaration of the same configuration options and is better for new or first time users.

API documentation for the library can be found on docs.rs.

Interactive

This will present you with a series of prompts to help you select correct options. Highly recommended for new users or one-off runs.

Usage:

extract-drugs interactive

This command is demoed in the GIF above.

Search

search functions the same as interactive but allows you to declaratively provide the configuration options.

Output Data Dictionary

This tool will output an output.csv file with the following format:

Column Name	Description	Data Type	Limits/Ranges
row_id	Identifier from `--id-col` if provided, else line number of row in `--data-file`	String	None
search_term	The search term, cleaned and normalized. This is the actual term that was compared.	String	None
matched_term	The matched term, cleaned and normalized. This is the actual term that was compared.	String	None
edits	The `osa` edit distance	Integer	0-2 (top limit due to exclusion filter)
similarity_score	The `jaro_winkler` similarity score	Float	0.95-1.0 (bottom limit due to exclusion filter)
search_field	The field that this match was found in, from `--search-cols`	String	None
metadata	The attached metadata to `search_term` in the search_terms file	String or None	None

Examples

For a whole showcase of example runs of this tool check out the shell scripts inside the examples folder.

For a showcase of potential analytical value that can be derived from running this tool, checkout the Jupyter Notebooks in the same folder!

Support

If you encounter any issues or need support please either contact @nanthony007 or open an issue.

Contributing

See CONTRIBUTING.md.

MIT License

LICENSE

Dependencies

~9–17MB
~212K SLoC