2 releases

0.1.1	Aug 22, 2020
0.1.0	Aug 22, 2020

#2214 in Parser implementations

Used in the-daily-stallman

MIT/Apache

315KB
14K SLoC

extrablatt

Customizable article scraping & curation library and CLI. Also runs in Wasm.

Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/

Inspired by newspaper.

Html Scraping is done via select.rs.

Features

News url identification
Text extraction
Top image extraction
All image extraction
Keyword extraction
Author extraction
Publishing date
References

Customizable for specific news sites/layouts via the Extractor trait.

Documentation

Full Documentation https://docs.rs/extrablatt

Example

Extract all Articles from news outlets.

use extrablatt::Extrablatt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}

Command Line

Install

cargo install extrablatt --features="cli"

Usage

USAGE:
    extrablatt <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

Extract a set of specific articles and store the result as json

extrablatt article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"

License

Licensed under either of these:

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

Dependencies

~11–24MB
~355K SLoC