#web-scraping #rss #feed #automatic #web

scrapyard

Automatic web scraper and RSS generator library

6 releases

0.3.1 Nov 3, 2023
0.3.0 Nov 3, 2023
0.2.2 Oct 31, 2023
0.1.0 Oct 30, 2023

#857 in Web programming

AGPL-3.0

50KB
1K SLoC

scrapyard

Automatic web scraper and RSS generator library

Quickstart

Get started by creating an event loop.

#[tokio::main]
async fn main() {
    // initialise values
    scrapyard::init(None).await;

    // load feeds from a config file
    // or create a default config file
    let feeds_path = PathBuf::from("feeds.json");
    let feeds = Feeds::load_json(&feeds_path).await
    .unwrap_or_else(|| {
        let default = Feeds::default();
        default.save_json();
        default
    });

    // start the event loop, this will not block
    feeds.start_loop().await;

    // as long as the program is running
    // the feeds will be updated regularly
    HttpServer::new(|| {})
        .bind(("0.0.0.0", 8080)).unwrap()
        .run().await.unwrap();
}

Configuration

By default, config files can be found in ~/.config/scrapyard (Linux), /Users/[Username]/Library/Application/Support/scrapyard (Mac) or C:\Users\[Username]\AppData\Roaming\scrapyard (Windows).

To change the config directory location, specify the path:

let config_path = PathBuf::from("/my/special/path");
scrapyard::init(Some(config_path)).await;

Here are all the options in the main configuration file scrapyard.json.

{
    "store": String, // i.e. /home/user/.local/share/scrapyard/
    "max-retries": Number, // number of retries before giving up
    "request-timeout": Number, // number of seconds before giving up request
    "script-timeout": Number, // number of seconds before giving up on the extractor script
}

Adding feeds

To add feeds, edit feeds.json.

{
    "origin": String, // origin of the feed
    "label": String, // text id of the feed
    "max-length": Number, // maximum number of items allowed in the feed
    "fetch-length": Number, // maximum number of items allowed to be fetched each interval
    "interval": Number, // number of seconds between fetching,
    "idle-limit": Number, // number of seconds without requests to that feed before fetching stops
    "sort": Boolean, // to sort by publish date or not
    "extractor": [String], // all command line args to run the extractor, i.e. ["node", "extractor.js"]

    "title": String, // displayed feed title
    "link": String, // displayed feed source url
    "description": String, // displayed feed description
    "fetch": Boolean // should the crate fetch the content, or let the script do it
}

You can also include additional fields in PseudoChannel to overwrite default empty values.

Getting feeds

Referencing functions under FeedOption, there are 2 types of fetch functions.

Force fetching always request for a new copy of the feed, ignoring the fetch interval. Lazy fetching only fetched a new copy when the existing copy is out of date. This is particularly relevant when used without the auto-fetch loop.

Extractor scripts

The extractor scripts must accept 1 command line argument and prints out 1 JSON response to stdout, normal console.log() in JS will do. You get the idea.

The first argument would specify a file path, within that file contains the arguments for the scraper.

Command line input:

{
    "url": String, // origin of the info fetched
    "webstr": String?, // response from the url, only if feed.fetch = true
    "preexists": [ PseudoItem ], // don't output these again to avoid duplication
    "lengthLeft": Number // maximum length before the fetch-length quota is met

    // plus everything from feed.json
}

Expected output:

{
    "items": [PseudoItem], // list of items extracted
    "continuation": String? // optionally continue fetching in the next url
}

License: AGPL-3.0

Dependencies

~12–27MB
~445K SLoC