2 releases
0.1.1 | Jul 13, 2024 |
---|---|
0.1.0 | Jul 6, 2024 |
#25 in #crawler
14KB
172 lines
Stream crawler
stream-scraper
is a Rust crate that provides an asynchronous web crawling utility. It processes URLs, extracts content and child URLs, and handles retry attempts for failed requests. It uses the tokio
runtime for asynchronous operations and the reqwest
library for HTTP requests.
Features
- Asynchronous crawling using
tokio
- Extracts URLs from
<a>
tags in HTML - Retries failed requests up to a specified number of attempts
- Limits the number of concurrent requests using a semaphore
Installation
Add this to your Cargo.toml
:
[dependencies]
stream_crawler = "0.1.0"
tokio = { version = "1", features = ["full"] }
reqwest = { version = "0.11", features = ["json"] }
scraper = "0.12"
Usage
use stream_crawler::scrape;
use tokio_stream::StreamExt;
#[tokio::main]
async fn main() {
let urls = vec![
String::from("https://www.google.com"),
String::from("https://www.twitter.com"),
];
let mut result_stream = scrape(urls, 3, 5, 10).await;
while let Some(data) = result_stream.next().await {
println!("Processed URL: {:?}", data);
}
}
Functionality
scrape
function :
- Takes a vector of URLs, a retry attempt limit, and a maximum number of concurrent processes.
- Returns a stream of
ProcessedUrl
structures.
ProcessedUrl
structure :
- Contains the original URL, the parent URL (if any), the HTML content of the page, and a list of child URLs extracted from
<a>
tags.
Example
This example demonstrates how to use the scrape
function to process a list of URLs.
use stream_crawler::scrape;
use tokio_stream::StreamExt;
#[tokio::main]
async fn main() {
let urls = vec![
String::from("https://www.google.com"),
String::from("https://www.twitter.com"),
];
let mut result_stream = scrape(urls, 3, 5, 10).await;
while let Some(data) = result_stream.next().await {
println!("Processed URL: {:?}", data);
}
}
Documentation
Refer to the inline documentation for detailed usage and examples.
ProcessedUrl
#[derive(Debug, PartialEq)]
pub struct ProcessedUrl {
pub parent: Option<String>,
pub url: String,
pub content: String,
pub children: Vec<String>,
}
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
This project is licensed under the MIT License.
This README.md
provides an overview of the crate, its features, installation instructions, and usage examples. You can customize it further based on your specific requirements.
Dependencies
~9–21MB
~291K SLoC