6 releases (2 stable)

new 1.3.0 Apr 24, 2024
1.0.0 May 30, 2023
0.1.3 Aug 4, 2022
0.1.1 Jun 6, 2022
0.1.0 Apr 26, 2022

#615 in Parser implementations

34 downloads per month

MIT license

30KB
530 lines

logo

Drug Extraction CLI

Demo

demo-gif

Description

This application takes a CSV file and parses text records from another CSV file to detect and extract search term mentions using string similarity algorithms to account for common misspellings. It is named for the drug searching it does most commonly for us at IPOP but is flexible enough to accept any type search terms.

NOTE: In our text-preprocessing, we specifically allow hyphens ("-") to to their frequency in drug terminologies. If you want to see this functionality removed or put behind a feature flag, please file an Issue.

If you are wondering about specific use cases, check out the Examples folder!

Requires

  • cargo package manager (rust toolchain)
  • just (optional dev-dependency if you clone this repo)
  • Valid UTF-8 encoded CSV data

Installation

To install the drug-extraction-cli application, simply:

Python Developers / Data Scientists

Please use pipx since it is designed specifically for this use case of installing Python CLI apps into isolated virtual environments.

pipx install extract-drugs

Rust Developers

cargo install drug-extraction-cli

IMPORTANT! Both of these will install an executable called extract-drugs.

No matter how you install the package from either packaging index, the binary program will be named extract-drugs for more intuitive commands.

INFO: The naming discrepancy is due to to how maturin handles package names and wanting to both keep the same CLI command/name and maintain the Rust namespace. Apologies, but you'll be fine 🙂.

Usage

This application has two commands: interactive and search. Both of these commands have the same underlying functionality, the latter allows you to pass command-line arguments and is better suited to automated processing or advanced users while the former allows interactive declaration of the same configuration options and is better for new or first time users.

API documentation for the library can be found on docs.rs.

Interactive

This will present you with a series of prompts to help you select correct options. Highly recommended for new users or one-off runs.

Usage:

extract-drugs interactive

This command is demoed in the GIF above.

search functions the same as interactive but allows you to declaratively provide the configuration options.

Output Data Dictionary

This tool will output an output.csv file with the following format:

Column Name Description Data Type Limits/Ranges
row_id Identifier from --id-col if provided, else line number of row in --data-file String None
search_term The search term, cleaned and normalized. This is the actual term that was compared. String None
matched_term The matched term, cleaned and normalized. This is the actual term that was compared. String None
edits The osa edit distance Integer 0-2 (top limit due to exclusion filter)
similarity_score The jaro_winkler similarity score Float 0.95-1.0 (bottom limit due to exclusion filter)
search_field The field that this match was found in, from --search-cols String None
metadata The attached metadata to search_term in the search_terms file String or None None

Examples

For a whole showcase of example runs of this tool check out the shell scripts inside the examples folder.

For a showcase of potential analytical value that can be derived from running this tool, checkout the Jupyter Notebooks in the same folder!

Support

If you encounter any issues or need support please either contact @nanthony007 or open an issue.

Contributing

See CONTRIBUTING.md.

MIT License

LICENSE

Dependencies

~7–18MB
~203K SLoC