#crawler #amqp #queue #configured #thread #flatcrawl #flats

app flatcrawl-crawler

This is a rust implementation of a set of webpage crawlers. New crawlers can be easily configured and the output can be written to an AMQP queue.

1 stable release

1.0.0 Jan 24, 2020

#10 in #configured

Custom license

44KB
1.5K SLoC

flatcrawl-crawlers

This repository is part of my flatcrawl project. It contains a Rust implementation of crawlers/scrapers for different real estate websites. It will scan those websites in scheduled cycles and extract information on new flats. Those new flats are then parsed into a consistent layout and sent away for further processing.

I chose Rust for this project, because I wanted to learn the language and also it seemed to be a good fit because of its capabilities like speed and thread safetiness.

The flatcrawl project

The purpose of the project is to collect flats from different rental sites and expose them in a consistent shape. Eventually it lets users define custom searches and provides them with instant updates on new matching flats.

Clarification: flats are not stored on the server. The purpose is not to create a competing portal, but to extend usability and help users find the right flat quickly by receiving updates from several sites without the hassle to setup and maintain different searches.

Infrastructure

Flats that are found by this tool and its set of crawlers will be transmitted via AMQP to a message broker (in my case RabbitMQ) to be picked up by different processors. Those processors can be found in their own repository and can be anything from email notifications to instant messaging bots. Currently there's only an implementation of a telegram bot, but you could imagine all kinds of different services that will listen to the queue and push new flats to interested users.

Setup & Requirements

The application can be setup easily, all you will have to do is to copy the config.sample.toml to a file called config.toml. Now you can edit the settings within the file. The thread_count will specify how many threads will be used for the different crawlers and indirectly how many TCP connections will be created in parallel. The amqp section defines the endpoint where the message broker can be found. I simply ran an existing docker image on my domain with some PLAIN authetication.

To actually run the application on your machine, you will need to compile it first. Installing Rust is quite easy, find the instructions on their website.

Run

Once Rust is installed and the program is configured via the config.toml, you can start it up via

cargo run

On the first run it will download and compile all the dependencies as well. This might take up to a few minutes even.

Dependencies

~28MB
~582K SLoC