#golang #text-content #extract

boilerpipe

Library for text extraction from HTML documents

6 releases (breaking)

0.6.0 Aug 10, 2021
0.5.0 Apr 12, 2021
0.4.0 Mar 9, 2021
0.3.0 Jan 15, 2021
0.1.0 Nov 12, 2020

#5 in #text-content

Download history 39/week @ 2024-07-20 31/week @ 2024-07-27 13/week @ 2024-08-03 10/week @ 2024-08-17 26/week @ 2024-08-24 18/week @ 2024-08-31 20/week @ 2024-09-07 2/week @ 2024-09-14 46/week @ 2024-09-21 33/week @ 2024-09-28 7/week @ 2024-10-05 21/week @ 2024-10-12 15/week @ 2024-10-19 10/week @ 2024-10-26 16/week @ 2024-11-02

63 downloads per month

MIT license

480KB
1K SLoC

Boilerpipe

This is the Rust port of the Golang port of excellent Java library boilerpipe which cleans up the boilerplate and extracts text content from HTML documents.

This library implements Article Extractor only and text content only (no images, links etc).

Dependencies

~8–14MB
~181K SLoC