#web-scraping #full-text #readability #article #scrape

article_scraper

Scrap article contents from the web. Powered by fivefilters full text feed configurations & mozilla readability.

11 stable releases

2.1.0 Mar 24, 2024
2.0.0 Jun 23, 2023
2.0.0-alpha.0 Apr 23, 2023
1.1.7 Jan 21, 2021
1.0.0 Apr 28, 2020

#244 in Web programming

Download history 24/week @ 2024-09-02 17/week @ 2024-09-09 14/week @ 2024-09-16 61/week @ 2024-09-23 37/week @ 2024-09-30 28/week @ 2024-10-07 56/week @ 2024-10-14 41/week @ 2024-10-21 10/week @ 2024-10-28 38/week @ 2024-11-04 13/week @ 2024-11-11 34/week @ 2024-11-18 37/week @ 2024-11-25 67/week @ 2024-12-02 87/week @ 2024-12-09 35/week @ 2024-12-16

230 downloads per month
Used in 2 crates

GPL-3.0-or-later

415KB
5K SLoC

article scraper

The article_scraper crate provides a simple way to extract meaningful content from the web. It contains two ways of locating the desired content

1. Rust implementation of Full-Text RSS

This makes use of website specific extraction rules. Which has the advantage of fast & accurate results. The disadvantages however are: the config needs to be updated as the website changes and a new extraction rule is needed for every website.

A central repository of extraction rules and information about writing your own rules can be found here: ftr-site-config. Please consider contributing new rules or updates to it.

article_scraper embeds all the rules in the ftr-site-config repository for convenience. Custom and updated rules can be loaded from a user_configs path.

2. Mozilla Readability

In case the ftr-config based extraction fails the mozilla Readability algorithm will be used as a fall-back. This re-implementation tries to mimic the original as closely as possible.

Example

use article_scraper::ArticleScraper;
use url::Url;
use reqwest::Client;

let scraper = ArticleScraper::new(None);
let url = Url::parse("https://www.nytimes.com/interactive/2023/04/21/science/parrots-video-chat-facetime.html");
let client = Client::new();
let article = scraper.parse(&url, false, &client, None).await.unwrap();

CLI

Various features of this crate can be used via article_scraper_cli.

Usage: article_scraper_cli [OPTIONS] <COMMAND>

Commands:
  all          Use the complete pipeline
  readability  Only use the Readability parser
  ftr          Only use (a subset of) the Ftr parser
  help         Print this message or the help of the given subcommand(s)

Options:
  -d, --debug          Turn debug logging on
  -o, --output <FILE>  Destination of resulting HTML file
  -h, --help           Print help
  -V, --version        Print version

Dependencies

~17–31MB
~555K SLoC