#preprocessor #text #corpus #cli #mark #modifier

app corpus-preproc

A preprocessor for text and HTML corpora

1 unstable release

0.1.0 Feb 6, 2022

#1705 in Text processing

MIT license

415KB
8K SLoC

Corpus Preprocessor

Build binary

CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.

Features

  • Parallel processing of files in a directory (CLI only)
  • NKFC and whitespace normalization
  • Removal of modifiers and marks
  • Lower-case folding
  • Trimming of punctuation around words
  • Replace words with <unk> placeholder if they meet any of the following criteria:
    • Word has an at sign @
    • Word lacks alphabetic characters
    • Word has two punctuation chars in a row, such as http://
  • HTML code is parsed and CSS selectors can be used to:
    • Remove undesired elements
    • Insert newlines after paragraphs and line breaks
    • Extract the main content of an HTML document
  • Text is automatically converted to UTF-8 if the original encoding is in the Encoding Standard.

Usage

Command Line Interface (CLI)

# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory

USAGE:
    corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>

ARGS:
    <INPUT>     
    <OUTPUT>    

OPTIONS:
    -c
            Clean HTML tags

        --content-selector <CONTENT_SELECTOR>
            CSS selector for main content

        --delete-selector <DELETE_SELECTOR>
            CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
            table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]

    -h, --help
            Print help information

    -l
            Perform case-folding

    -m
            Keep modifiers and marks on normalization

    -n
            Perform NFKC and whitespace normalization

        --nl-append-selector <NL_APPEND_SELECTOR>
            CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]

    -p
            Trim punctuation surrounding words

    -t <THREADS>
            Number of threads to use [default: 4]

HTTP API

Startup

$ corpus-preproc serve 127.0.0.1:8000

Python Example

The requests Python library needs to be installed.

import requests
import json

DEFAULT_CONFIG = {
  "htmlClean": {
    "enabled": True,
    "contentSelector": None,
    "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
    "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
  },
  "charNormalization": {
    "enabled": True,
    "keepModifiersAndMarks": False,
    "lowercase": True,
  },
  "wordNormalization": {
    "enabled": True,
    "replacePii": True,
  }
}

def clean_text(text):
    files = {
        'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
        'data': (None, text, 'text/plain'),
    }
    response = requests.post('http://127.0.0.1:3000/preproc', files=files)
    return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"

TODO

  • Normalize or remove inner word separators
  • Replace indicatif with linya
  • Export and load CLI options as JSON files

Wishlist

Speed

  • Use the efficient plain text preprocessors of tokenizers
  • Use a better text data structure such as ropey or tendril
  • Determine feasibility to process text as a stream instead of loading entire file buffer into memory
    • See lol-html and html5ever issue #149

Functionality

  • Implement quality control (minimum and maximum sentence length)
  • Implement pdf text extractor with pdf-extract
  • Implement docx/pptx/odt text extractor with dotext or docx
  • Implement stemmer with rust-stemmers
  • Implement sentence filtering based on desired language with fasttext-rs and a language identification model
  • Automatically concatenate common MWEs with MITIE (Rust bindings missing) or phrase

Interoperability

  • Python bindings

Dependencies

~23–36MB
~629K SLoC