1 unstable release
0.1.0 | Feb 6, 2022 |
---|
#1303 in Text processing
415KB
8K
SLoC
Corpus Preprocessor
CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.
Features
- Parallel processing of files in a directory (CLI only)
- NKFC and whitespace normalization
- Removal of modifiers and marks
- Lower-case folding
- Trimming of punctuation around words
- Replace words with
<unk>
placeholder if they meet any of the following criteria:- Word has an at sign
@
- Word lacks alphabetic characters
- Word has two punctuation chars in a row, such as
http://
- Word has an at sign
- HTML code is parsed and CSS selectors can be used to:
- Remove undesired elements
- Insert newlines after paragraphs and line breaks
- Extract the main content of an HTML document
- Text is automatically converted to UTF-8 if the original encoding is in the Encoding Standard.
Usage
Command Line Interface (CLI)
# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory
USAGE:
corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>
ARGS:
<INPUT>
<OUTPUT>
OPTIONS:
-c
Clean HTML tags
--content-selector <CONTENT_SELECTOR>
CSS selector for main content
--delete-selector <DELETE_SELECTOR>
CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]
-h, --help
Print help information
-l
Perform case-folding
-m
Keep modifiers and marks on normalization
-n
Perform NFKC and whitespace normalization
--nl-append-selector <NL_APPEND_SELECTOR>
CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]
-p
Trim punctuation surrounding words
-t <THREADS>
Number of threads to use [default: 4]
HTTP API
Startup
$ corpus-preproc serve 127.0.0.1:8000
Python Example
The requests
Python library needs to be installed.
import requests
import json
DEFAULT_CONFIG = {
"htmlClean": {
"enabled": True,
"contentSelector": None,
"deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
"nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
},
"charNormalization": {
"enabled": True,
"keepModifiersAndMarks": False,
"lowercase": True,
},
"wordNormalization": {
"enabled": True,
"replacePii": True,
}
}
def clean_text(text):
files = {
'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
'data': (None, text, 'text/plain'),
}
response = requests.post('http://127.0.0.1:3000/preproc', files=files)
return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"
TODO
- Normalize or remove inner word separators
- Replace
indicatif
withlinya
- Export and load CLI options as JSON files
Wishlist
Speed
- Use the efficient plain text preprocessors of
tokenizers
- Use a better text data structure such as
ropey
ortendril
- Determine feasibility to process text as a stream instead of loading entire file buffer into memory
- See
lol-html
andhtml5ever
issue #149
- See
Functionality
- Implement quality control (minimum and maximum sentence length)
- Implement pdf text extractor with
pdf-extract
- Implement docx/pptx/odt text extractor with
dotext
ordocx
- Implement stemmer with
rust-stemmers
- Implement sentence filtering based on desired language with
fasttext-rs
and a language identification model - Automatically concatenate common MWEs with
MITIE
(Rust bindings missing) orphrase
Interoperability
- Python bindings
Dependencies
~23–35MB
~630K SLoC