#extract #text #html #content #document #boilerplate #port

boilerpipe

Library for text extraction from HTML documents

6 releases (breaking)

0.6.0 Aug 10, 2021
0.5.0 Apr 12, 2021
0.4.0 Mar 9, 2021
0.3.0 Jan 15, 2021
0.1.0 Nov 12, 2020

#1855 in Text processing

Download history 16/week @ 2024-05-20 8/week @ 2024-05-27 18/week @ 2024-06-03 14/week @ 2024-06-10 4/week @ 2024-06-17 7/week @ 2024-06-24 45/week @ 2024-07-01 19/week @ 2024-07-08 19/week @ 2024-07-15 43/week @ 2024-07-22 31/week @ 2024-07-29

140 downloads per month

MIT license

480KB
1K SLoC

Boilerpipe

This is the Rust port of the Golang port of excellent Java library boilerpipe which cleans up the boilerplate and extracts text content from HTML documents.

This library implements Article Extractor only and text content only (no images, links etc).

Dependencies

~7–14MB
~162K SLoC