6 releases

0.2.0 May 4, 2022
0.1.31 Jan 7, 2022
0.1.4 Mar 18, 2022

#7 in #driven

MIT license

34KB
580 lines

waxy

crawler in the works for rust. NOTE: Use github for most recent docs as changes to the readme require another cargo publish.

This is a work in progress.

The "presser" is the crawler. The crawler is being built out to generate or "press" different docs like "HtmlRecord" / "XMLRecord" and so on. General document formats. The HTMLPresser presses HtmlRecords

The specific records will implement methods to parse themselves.

This is a slow process for crawling, and calling blind. The last thing anyone wants with a crawler is to not be able to crawl.

You can currently:

  • crawl any site that doesn't have a specific js-only content
  • parse documents for things like - headers/metadata/links (domain/non domain)

If you need a more "premium" crawler, check back later, maybe. To incorporate "puppeteer like" functionality will require work with a web driver. This adds a lot of layers of complexity for general use.

This will be a tool used by the oddjob server.

Scraping functionality is coming. However, other parser options may not be available.

main dependencies of waxy:

  1. reqwest https://docs.rs/reqwest/latest/reqwest/
  2. scraper https://crates.io/crates/scraper
  3. tokio-test https://crates.io/crates/tokio-test

There are other more minor dependencies, please refer to the Cargo.toml for info.

Notes:

03/18/22

  • Changed record methods to return options. Pressers remain the same use wise. I also made changes to curate method.
  • Changed HTML to be parsed once when Record is created
  • Changed domain checking to host checking.

Moving forward, I am going to start making more methods for searching record strings and returning search weights. Every record is self contained and the HTML is private as any mutation wouldn't be good outside the record itself.

This pretty much uses strings for everything.

how to use:

[dependencies]
waxy = "0.2.0"
tokio = { version = "1", features = ["full"] }
use waxy::pressers::HtmlPresser;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    

    //Wax worker

    /*
    
    create a single record from url

    */
    match HtmlPresser::press_record("https://example.com").await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

    /*
    
    crawl a vector or urls for a vector of documents

    */

    match HtmlPresser::press_records(vec!["https://example.com"]).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

   /*
    
    crawl a domain, the "1" is the limit of pages you are willing to crawl

    */

    match HtmlPresser::press_records_blind("https://funnyjunk.com",1).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    /*
    blind crawl a domain for links, 
    inputs:
    url to site
    link limit, limit of the number of links willing to be grabbed
    page limit, limit of the number of pages to crawl for links
    */

    match HtmlPresser::press_urls("https://example.com",1,1).await{
        Ok(res)=>{
            println!("{:?}", res.len());
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

    /*
    blind crawl a domain for links that match a pattern, 
    inputs:
    url to site
    pattern the  url should match
    link limit, limit of the number of links willing to be grabbed
    page limit, limit of the number of pages to crawl for links
    */
    match HtmlPresser::press_curated_urls("https://example.com", ".", 1,1).await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();

        /*
    blind crawl a domain for document whose urls that match a pattern, 
    inputs:
    url to site
    pattern the  url should match
    page limit, limit of the number of pages to crawl for links
    */
    match HtmlPresser::press_curated_records("https://example.com", ".", 1).await{
        Ok(res)=>{
            println!("{:?}", res);
        },
        Err(..)=>{
            println!("went bad")
        }

    }

    println!();
    println!("----------------------");
    println!();
    
    //get doc
    let record = HtmlPresser::press_record("https://funnyjunk.com").await.unwrap();

    //get anchors
    println!("{:?}",record.anchors().unwrap());
    println!();
    println!("{:?}",record.anchors_curate(".").unwrap());
    println!();
    println!("{:?}",record.domain_anchors().unwrap());
    println!();
    //call headers
    println!("{:?}",record.headers);
    println!();
    //call meta data
    println!("{:?}",record.html_meta().unwrap());
    println!();
    //tag text and html
    println!("{:?}",record.tag_html("title").unwrap());
    println!();
    println!("{:?}",record.tag_text("title").unwrap());
    println!();
    println!();
    //get all emails contained in the string.
    if let Some(emails) = record.get_emails() {
        println!("{:?}",emails);
    }else{
        println!("no emaila")
    }

    Ok(())


}


Dependencies

~12–25MB
~362K SLoC