2 releases
0.1.1 | Aug 22, 2020 |
---|---|
0.1.0 | Aug 22, 2020 |
#1678 in Parser implementations
Used in the-daily-stallman
315KB
14K
SLoC
extrablatt
Customizable article scraping & curation library and CLI. Also runs in Wasm.
Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/
Inspired by newspaper.
Html Scraping is done via select.rs.
Features
- News url identification
- Text extraction
- Top image extraction
- All image extraction
- Keyword extraction
- Author extraction
- Publishing date
- References
Customizable for specific news sites/layouts via the Extractor
trait.
Documentation
Full Documentation https://docs.rs/extrablatt
Example
Extract all Articles from news outlets.
use extrablatt::Extrablatt;
use futures::StreamExt;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let site = Extrablatt::builder("https://some-news.com/")?.build().await?;
let mut stream = site.into_stream();
while let Some(article) = stream.next().await {
if let Ok(article) = article {
println!("article '{:?}'", article.content.title)
} else {
println!("{:?}", article);
}
}
Ok(())
}
Command Line
Install
cargo install extrablatt --features="cli"
Usage
USAGE:
extrablatt <SUBCOMMAND>
SUBCOMMANDS:
article Extract a set of articles
category Extract all articles found on the page
help Prints this message or the help of the given subcommand(s)
site Extract all articles from a news source.
Extract a set of specific articles and store the result as json
extrablatt article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"
License
Licensed under either of these:
- Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
Dependencies
~10–23MB
~332K SLoC