4 releases
0.1.3 | Jun 18, 2022 |
---|---|
0.1.2 | Apr 20, 2022 |
0.1.1 | Apr 3, 2022 |
0.1.0 | Mar 27, 2022 |
#344 in HTTP server
839 downloads per month
Used in ngxav
1.5MB
174 lines
isbot
Rust library to detect bots using a user-agent string.
Features
- Focused on speed, simplicity, and ensuring real browsers don't get falsely identified as bots
- Tested on over 12k bot user-agents and 180k browser user-agents - updated bot and browser lists are downloaded as part of the integration test suite
- Easy to plugin as middleware to Actix, Rocket, or other Rust web frameworks
- Includes a default collection of bot user-agent regular expressions, optionally included at compile time
- Allows user-agent patterns to be manually added and removed at runtime
Usage
Add this to your Cargo.toml
:
[dependencies]
isbot = "0.1.2"
The example below uses the default bot patterns to correctly identify the Googlebot-Image
user-agent as a bot and the Opera
user-agent as a browser.
use isbot::Bots;
let bots = Bots::default();
assert_eq!(bots.is_bot("Googlebot-Image/1.0"), true);
assert_eq!(bots.is_bot("Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1"), false);
Middleware: Actix or Rocket
isbot
can be added as middleware to enable global or per-handler rejections of known bots.
Actix Example
There are multiple ways to use isbot
in Actix. One way is to pass a Bots
instance into the app data as state. For example:
use isbot::Bots;
struct AppState {
bots: Bots,
}
let state = AppState {
bots: Bots::default(),
};
let app = App::new()
.app_data(web::Data::new(state))
.route("/", web::get().to(index));
Request handlers can use the Bots
data to filter out bots:
async fn index(req: HttpRequest, data: web::Data<AppState>) -> HttpResponse {
if let Some(user_agent) = get_user_agent(req.headers()) {
if data.bots.is_bot(user_agent) {
return HttpResponse::Forbidden().body("Bots not allowed");
}
}
HttpResponse::Ok().body("Home")
}
Another option is to use add the middleware using the Actix wrap_fn
function and globally reject all requests from bots. These 2 examples plus other options and details can be seen in the test examples:
Customizing
Bot user-agent patterns can be customized by adding or removing patterns, using the append
and remove
methods.
Add bot pattern
To add new bot patterns, use append
to specify an array of regular expression patterns. For example:
let mut bots = isbot::Bots::default();
assert_eq!(bots.is_bot("Mozilla/5.0 (CustomNewTestB0T /1.2)"), false);
bots.append(&[r"CustomNewTestB0T\s/\d\.\d"]);
assert_eq!(bots.is_bot("Mozilla/5.0 (CustomNewTestB0T /1.2)"), true);
Remove bots
To remove bot patterns, use remove
and specify an array of existing patterns to remove. For example, to remove the Chrome Lighthouse user-agent pattern to indicate it is not a bot:
let mut bots = isbot::Bots::default();
bots.remove(&["Chrome-Lighthouse"]);
assert_eq!(bots.is_bot("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36 Chrome-Lighthouse"), false);
Custom Bot list
The default user-agent regular expression patterns are managed in the bot_regex_patterns.txt file.
If you don't want to use the default bot patterns you can supply your own list. Since the default bot patterns are automatically added to the library at compile time you should first disable the default feature. The include-default-bots
feature is enabled by default so the patterns defined in bot_regex_patterns.txt
are included in the library at compile time.
You can exclude the patterns by disabling the default features and then including your own bot regular expressions. To do that set default-features
to false in your Cargo.toml
dependency definition. For example:
[dependencies]
isbot = { version = "0.1.1", default-features = false }
And then use Bots::new()
to supply a newline delimited list of regular expressions. For example:
use isbot::Bots;
let custom_user_agent_patterns = r#"
^Googlebot-Image/
bingpreview/"#;
let bots = Bots::new(custom_user_agent_patterns);
assert_eq!(bots.is_bot("Googlebot-Image/1.0"), true);
Testing
Some of the test fixture data is download from multiple sources to ensure the latest user-agents are validated.
To download the latest test data fixures, run the download_fixture_data.rs
executable:
cargo run --bin download_fixture_data --features="download-fixture-data"
This will update files in the fixtures directory.
Unit and integration tests
To run all unit and integration tests:
cargo test
Actix tests
To validate changes to the Actix examples run the following:
cargo test --example actix_example
Rocket tests
To validate changes to the Rocket examples run the following:
cargo test --example rocket_example
Philosophy
Bot detection is a gray area since there are no clear lines on what defines a bot user-agent and a real browser user-agent. Some libraries focus on broadly classifying bots and trying to identify as many as possible, with the risk that real user browser may be caught and falsely flagged as bots.
This library's focus is on identifying known bots while primarily ensuring no real users or browsers are falsely flagged. All of the bot user-agent patterns are validated against a large number of real browsers and bot patterns to eliminate false positives.
For example, the user-agent string below is identified as both a bot and a real browser by various libraries and data sources:
Mozilla/5.0 (Linux; Android 4.2.1; CUBOT GT99 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19
Credits
There are many excellent bot detection libraries available for other languages and awesome developers maintaining bot and user-agent identification data. This library draws inspiration from many of them, especially:
Library | Language |
---|---|
https://github.com/omrilotan/isbot | JavaScript |
https://github.com/JayBizzle/Crawler-Detect/ | PHP |
https://github.com/matomo-org/device-detector | PHP |
https://github.com/fnando/browser | Ruby |
https://github.com/biola/Voight-Kampff | Ruby |
The following data sources are used directly or as inspiration for the static test data and downloaded user-agent identification:
Data Source | Notes |
---|---|
user-agents.net | User-Agents Database |
myip.ms | List of known web bots & spiders |
monperrus | Collection of user-agents used by robots, crawlers, and spiders |
ua-core | Regex file and data to build language ports of Browserscope's user-agent parser |
Contributing
See the Contributing guide.
License
isbot
is distributed under the terms of the MIT license. See LICENSE for details.
Dependencies
~2–4MB
~69K SLoC