#extraction #nlp #text #drug

app drug-extraction-cli

A core library for extracting drugs from text records

4 releases

Uses new Rust 2021

0.1.3 Aug 4, 2022
0.1.2 Aug 2, 2022
0.1.1 Jun 6, 2022
0.1.0 Apr 26, 2022

#34 in Parser tooling

Download history 27/week @ 2022-04-24 6/week @ 2022-05-01 5/week @ 2022-05-08 18/week @ 2022-05-15 5/week @ 2022-05-22 5/week @ 2022-05-29 34/week @ 2022-06-05 2/week @ 2022-06-12 6/week @ 2022-06-19 2/week @ 2022-06-26 2/week @ 2022-07-03 4/week @ 2022-07-10 2/week @ 2022-07-17 6/week @ 2022-07-24 56/week @ 2022-07-31 10/week @ 2022-08-07

74 downloads per month

MIT license


Drug Extraction CLI

This is the CLI application that consumes the Core library.

Full API documentation can be found on docs.rs.



This application takes a CSV file and parses text records to detect and extract drug mentions using string similarity algorithms to account for common misspellings.

In general, we expect users to know which string similarity algorithm they want and the limits on its interpretations. For more resources on string similarity algorithms see the main ToolBox page. When in doubt, the defaults in the interactive command are quite reasonable.

If you are wondering about specific use cases, check out the Workflows section below!


  • cargo package manager (rust toolchain)
  • just (optional dev-dependency if you clone this repo)


Cargo is available as a part of the Rust toolchain and is readily available via curl + sh combo (see here).

To install the drug-extraction-cli application, simply:

cargo install drug-extraction-cli

This will install an executable called extract-drugs.

*IMPORTANT*: The cli package is drug-extraction-cli but the binary program is renamed to extract-drugs for more intuitive commands.


This application has three primary commands: interactive, simple-search, and drug-search. Both of the search commands share similar flags/options while the interactive command guides users through selecting options.

In any of the commands, you can set max-edits to 0 or threshold to 1.0 to return only exact matches 🙂


This will present you with a series of prompts to help you select correct options. Highly recommended for new users or one-off runs.


extract-drugs interactive

This helps users select the correct options by, for example, only providing an edit distance limit when an edit-distance metric is selected.

There is also a series of nice select prompts to select the desired id-column and target-column:


It also provides options for the desired output format.

The output file will always be named extracted_drugs and will be suffixed by either .csv or .jsonl depending on your selected format.

Example: interactive-example

Output: interactive-output

As you can see we even provide some useful comments on the data collected.


Simple Search works great if you have only a few drugs you are interested in and you want to search for them exclusively. As noted above, these search variants are better if you are more familiar with Linux/shell commands and more comfortable passing flags and/or using this for automation purposes or running on multiple files and combing results. For an example of the latter, see Workflows

BONUS: You can use this to search for MORE than just drugs. For example, we frequently use it to search for COVID-19 (and variants), Narcan/Naloxone, and other key words in the healthcare industry.

You should pass in your search-words separated by a "|" symbol.

The usage here is a bit more complicated:

extract-drugs simple-search \
        cli/data/records.csv \
        --algorithm "l" \
        --max-edits 1 \
        --id-column "Case Number" \
        --target-column "Primary Cause" \
        --search-words "coacine|heroin|Fentanil" \
        --format csv \

Remember, you can always use --help after any command to get help and more information.

Help: simple-search-help


For more direct access to drugs of interest, we provide interactivity with the popular RxNorm resource. We utilize RxClass in order to get a group of drugs as opposed to a single/few drugs (for which simple-search should be used).

We specifically use this request, so if you need to know exactly what we are looking for you will find it there in the RxClass documentation.

RxClass/RxNorm Operations

This command requires knowledge of two(2) key data points regarding your target RxClass:

  1. The RxClass ID
  2. The RxClass RelaSource

We can find these by using the NIH/NLM's RxNav explorer. This page contains all of the information that we will need. Below is a screenshot demonstrating usage of the navigator to find the correct parameters to pass to this drug-extraction tool.


A full list of RelaSources can be found here although I recommended just sticking to either ATC or MESH and then using the RxNav explorer to find the target Class ID.

The usage here is very similar to simple-search but replacing search-words with RxClass information.

extract-drugs drug-search cli/data/records.csv \
    --algorithm "d" \
    --max-edits 2 \
    --target-column "Primary Cause" \
    --rx-class-id N02A \
    --rx-class-relasource ATC \
    --format jsonl \

Remember, you can always use --help after any command to get help and more information.

Help: drug-search-help


To get more help use the CLI:

extract-drugs --help



We see two primary workflows for this tool. I do not considering one-offs as need a repeated "workflow" and that is why they are suggested to use the interactive command 😃.

However, sometimes we want more. Maybe we want to run the tool using drug-search on an RxClass and then run it again using simple-search and combine the results. Or maybe we want to run the exact same command twice just switching the --target-column to another text field.

A (very) general workflow:

flowchart LR
    title[General Data Flow]
    A(Get your data) --> B(Preprocess your data)
    B --> C2(Run other tools)
    C2 --> D
    B --> C(Run Drug-Extraction ToolBox)
    B --> D(Analyze your data/output)
    C --> D
    D --> E(Report results)

We provide some convenience scripts on the main Toolbox page for a few key workflows we see. These are written in different languages (Python) but don't have any other dependencies besides the languages themselves and thus should run smoothly. Note that the tools is invoked from Python via the subprocess module and the data is then manipulated inside Python. Python was chosen due to my personal profeciency with it and its commonality as a data science tool.

Workflow Examples

  1. Convert CSV output to wide-form (flag-oriented)
    1. This is useful for researchers/analysts who want to see record-level results of the tool.
    2. This requires that you provided an --id-column to the tool.
    3. This example uses simple-search so be sure to adapt it to drug-search by switching both the command options and the column you access in the csv file
    4. After this you can join both files on the ID columns using an analytic tool like pandas
  2. Running on multiple columns and combining results
  3. Running drug-search then simple-search and combining results
  4. de-workflow
    1. A simple wrapper around this project to automate multi-column runs, wide-form data creation, and report generation.


If you encounter any issues or need support please either contact @nanthony007 or open an issue.


Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

See CONTRIBUTING.md for more details.

MIT License



~347K SLoC