#news #scrape #sites #scraper #cbc #ca #canadian

canadian_news_scraper

A library that provides an api which scrapes 3 Canadian News Sites and returns the data

7 releases

0.1.6 Jan 6, 2021
0.1.5 Jan 6, 2021

#6 in #cbc

MIT license

26KB
576 lines

Canadian News Scraper

by Jace Mattson

This is a Rust library which can be included in other programs to pull data from at the moment three canadian news sites.

Explanation

The library api is fairly simple it works by providing the scrape() function - which is located in the scraper module - which takes a NewsEnum that it then uses to scrape the site associated to that Enum.

pub async fn scrape(news: NewsEnum) -> Vec<News>

scrape() uses a series of functions located in the scraper module to determine which News Site markup it needs to use to effectively scrape the website you've requested.

Data Structure

The underlying data structure that scrape() returns is a Vector of type News. News is a struct which is defined in the News module alongside NewsEnum and NewsSite.

pub struct News {
    pub news_enum: NewsEnum,
    pub news_site: String,
    pub article_link: String,
    pub img_link: String,
    pub title: String,
    pub desc: String,
    pub author: String,
    pub metadata: String,
    pub article_text: String,
    pub article_date: String,
    pub scrape_date: DateTime<Utc>
}

Use Case

The main use case of this program is in a REST API I'm creating which will be used to scrape these sites on a daily basis and store the information in a database which can then be accessed via the REST application.

Example


let news_site = news::NewsEnum::GlobalNews;
let val:Vec<News> = scraper::scrape(news_site);


You can see similiar examples in the tests module located in lib.rs.

Dependencies

~10–20MB
~293K SLoC