2 releases (1 stable)

1.0.0 Oct 13, 2024
0.1.0 Feb 17, 2023

#1304 in Network programming

Download history 175/week @ 2024-10-12 8/week @ 2024-10-19 8/week @ 2024-11-02

191 downloads per month

MIT/Apache

1.5MB
2.5K SLoC

Concurrent Tor

A comprehensive scraping runtime.

Features

  • Multiple Tor clients
  • Persistent job store across restarts
  • Concurrent requests
  • Supported request types (all in the same runtime):
    • HTTP
    • Headless browser
    • Headed browser
  • Custom job scheduling
  • Event monitoring
  • Request timeouts
  • Client renewals (new IP) on max requests
  • Configurable by config file

See an Example

# Try it out!
git clone https://github.com/Sean-McConnachie/concurrent_tor.git
cd concurrent_tor/examples/basic
cargo run --release --features use_tor_backend
# Or use it as a dependency!
concurrent_tor = "1.0.0"

Architecture

Architecture

Things to watch out for

  • Check the example if you are unsure about how to organise your code.
  • Ensure your hashing function for a request type is replicable if you want to prevent duplicate requests.
  • Ensure you use the correct flags for all of your request types.
    • In process_job(...) you need to use job.request.as_any().downcast_ref().unwrap();
    • This will crash if you don't receive the correct request type due to setting the platform wrong somewhere else!
  • You must return the job passed by reference in process_job(...).
    • Use QueueJob::Completed(job.into()) or another variant.
    • The program will rightfully panic if you don't return the job.
  • The target_circulation will determine how many jobs to pass to the de-queuer.
    • This must be greater than the number of workers.
    • Preferably keep a slight excess so there are some jobs in the queue, and you don't need to wait for a round trip.
  • Ensure your Monitor implementation receives every event.
    • Ensure you check the AtomicBool flagged passed to your monitor on each iteration (see example).
    • Or use the provided EmptyMonitor if you don't care about events.
  • Do not send jobs for any of the http workers, headed browser workers, or headless browser workers if you do not have at least one active one. This will cause a panic due to no there being no receiving channels!
  • The HTTP backend relies on hyper and it is relatively low-level.
  • If you find any, or want to report bugs, please let me know through a Github issue :)

geckodriver

If you want to use the browser, you'll need to provide path to your local geckodriver

Tests

There is currently a single test.

  • It uses the non-tor client (i.e. reqwest)
  • Spawns an actix web server
  • Organises events from:
    • Concurrent Tor backend using a custom monitor
    • The web server
    • The user implementations
  • Sorts all events by time
  • Ensures the order of execution is correct (since this is essentially a state machine)
  • Ensures clients actually get renewed

It's pretty beefy, so good luck if you read through it!

Dependencies

~82MB
~1.5M SLoC