Lib.rs

›

#pre-processor #corpus #word #cli #character #mark #element #text #break

app corpus-preproc

A preprocessor for text and HTML corpora

1 unstable release

0.1.0	Feb 6, 2022

#2342 in Text processing

MIT license

415KB
8K SLoC

Corpus Preprocessor

CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.

Features

Parallel processing of files in a directory (CLI only)
NKFC and whitespace normalization
Removal of modifiers and marks
Lower-case folding
Trimming of punctuation around words
Replace words with <unk> placeholder if they meet any of the following criteria:
- Word has an at sign @
- Word lacks alphabetic characters
- Word has two punctuation chars in a row, such as http://
HTML code is parsed and CSS selectors can be used to:
- Remove undesired elements
- Insert newlines after paragraphs and line breaks
- Extract the main content of an HTML document
Text is automatically converted to UTF-8 if the original encoding is in the Encoding Standard.

Usage

Command Line Interface (CLI)

# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory

USAGE:
    corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>

ARGS:
    <INPUT>     
    <OUTPUT>    

OPTIONS:
    -c
            Clean HTML tags

        --content-selector <CONTENT_SELECTOR>
            CSS selector for main content

        --delete-selector <DELETE_SELECTOR>
            CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
            table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]

    -h, --help
            Print help information

    -l
            Perform case-folding

    -m
            Keep modifiers and marks on normalization

    -n
            Perform NFKC and whitespace normalization

        --nl-append-selector <NL_APPEND_SELECTOR>
            CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]

    -p
            Trim punctuation surrounding words

    -t <THREADS>
            Number of threads to use [default: 4]

HTTP API

Startup

$ corpus-preproc serve 127.0.0.1:8000

Python Example

The requests Python library needs to be installed.

import requests
import json

DEFAULT_CONFIG = {
  "htmlClean": {
    "enabled": True,
    "contentSelector": None,
    "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
    "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
  },
  "charNormalization": {
    "enabled": True,
    "keepModifiersAndMarks": False,
    "lowercase": True,
  },
  "wordNormalization": {
    "enabled": True,
    "replacePii": True,
  }
}

def clean_text(text):
    files = {
        'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
        'data': (None, text, 'text/plain'),
    }
    response = requests.post('http://127.0.0.1:3000/preproc', files=files)
    return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"

TODO

Normalize or remove inner word separators
Replace indicatif with linya
Export and load CLI options as JSON files

Wishlist

Speed

Use the efficient plain text preprocessors of tokenizers
Use a better text data structure such as ropey or tendril
Determine feasibility to process text as a stream instead of loading entire file buffer into memory
- See lol-html and html5ever issue #149

Functionality

Implement quality control (minimum and maximum sentence length)
Implement pdf text extractor with pdf-extract
Implement docx/pptx/odt text extractor with dotext or docx
Implement stemmer with rust-stemmers
Implement sentence filtering based on desired language with fasttext-rs and a language identification model
Automatically concatenate common MWEs with MITIE (Rust bindings missing) or phrase

Interoperability

Python bindings

Dependencies

~24–36MB
~637K SLoC