14 unstable releases (3 breaking)
0.4.8 | Dec 6, 2024 |
---|---|
0.4.7 | Dec 5, 2024 |
0.4.4 | Nov 30, 2024 |
0.3.1 | Nov 15, 2024 |
0.1.0 | Nov 8, 2024 |
#6 in Finance
1,596 downloads per month
210KB
3K
SLoC
Newslookout
A light-weight web scraping platform built for scanning and processing news and data. It is a rust port of the python application of the same name.
Here's an illustration of this multi-threaded data pipeline:
Retriever 1 Data Processing Module 1 Data Processing Module 2 Retriever 2 Retriever 3
Architecture
This library sets up a web scraping pipeline and executes it as follows:
- Starts the web retriever modules in its own separate thread that run parallely to get the content from the respective websites
- Each page's content is populated into a document struct and transmitted by the web retriever module threads to the data processing chain.
- Simultaneously the data processing modules are started (which form the data processing chain). The retrieved documents are passed to these threads in serial order, based on the priority configured for each data processing module.
- Each data processing module processes the content and may add or modify the document it receives. It then passes it on to the next data processing thread in order of priority
- Popular LLM services are supported by the data processing pipelines such as - ChatGPT, Google Gemini and self-hosted LLMs using Ollama. The relevant API keys need to be configured as environment variables before using these plugins.
- At then end, the document is written to disk as a json file
- The retrieved URLs are saved to an SQLite database table to serve as a reference so these are not retrieved again in the next run.
- Adequate wait times are configured during web retrieval to avoid overloading the target website. All events and actions are logged to a central log file. Multiple instances are prevented by writing and checking for a PID file. Although, if desired multiple instances can be launched by running the application with separate config files.
This package enables building a full-fledged multi-threaded web scraping solution that runs in batch mode with very meagre resources (e.g. single core CPU with less than 4GB RAM).
Quick Start
Add this to your Cargo.toml: [dependencies] newslookout = "0.3.0"
Usage
Get started with just a few lines of code, for example:
use std::env;
use config;
use newslookout;
fn main() {
if env::args().len() < 2 {
println!("Usage: newslookout_app <config_file>");
panic!("Provide config file as a command line parameter, (expect 2 parameters, but got {})",
env::args().len()
);
}
let config_file: String = env::args().nth(1).unwrap();
println!("Loading configuration from file: {}", config_file);
let app_config: config::Config = newslookout::utils::read_config(config_file);
let docs_retrieved: Vec<newslookout::document::DocInfo> = newslookout::run_app(app_config);
// use this collection of retrieved document-information structs for any further custom processing
}
Create your own custom plugins and run these in the Pipeline
Declare custom retriever plugin and add these to the pipeline to fetch data using your customised logic.
fn run_pipeline(config: &config::Config) -> Vec<Document> {
newslookout::init_logging(config);
newslookout::init_pid_file(config);
log::info!("Starting the custom pipeline");
let mut retriever_plugins = newslookout::pipeline::load_retriever_plugins(config);
let mut data_proc_plugins = newslookout::pipeline::load_dataproc_plugins(config);
// add custom data retriever:
retriever_plugins.push(my_plugin);
let docs_retrieved = newslookout::pipeline::start_data_pipeline(
retriever_plugins,
data_proc_plugins,
config
);
log::info!("Data pipeline completed processing {} documents.", docs_retrieved.len());
// use docs_retrieved for any further custom processing.
newslookout::cleanup_pid_file(&config);
}
Similarly, you can also declare and use custom data processing plugins, e.g.:
data_proc_plugins.push(my_own_data_processing);
Note that for data processing, these type of plugins are run in serial order of priority defined in the config file.
There are a few pre-built modules provided for a few websites. These can be readily extended for other websites as required.
Refer to the source code of these in the plugins folder and roll out your own plugins.
Configuration
The entire application is driven by its config file. Refer to the example config file in the repository.
Dependencies
~76MB
~1.5M SLoC