#extract #text #html #content #document #boilerplate #port

boilerpipe

Library for text extraction from HTML documents

6 releases (breaking)

0.6.0 Aug 10, 2021
0.5.0 Apr 12, 2021
0.4.0 Mar 9, 2021
0.3.0 Jan 15, 2021
0.1.0 Nov 12, 2020

#1795 in Text processing

MIT license

480KB
1K SLoC

Boilerpipe

This is the Rust port of the Golang port of excellent Java library boilerpipe which cleans up the boilerplate and extracts text content from HTML documents.

This library implements Article Extractor only and text content only (no images, links etc).

Dependencies

~7–15MB
~171K SLoC