13 unstable releases (5 breaking)
0.6.1 | Mar 7, 2024 |
---|---|
0.5.0 | Jul 31, 2023 |
0.2.0 | Mar 31, 2023 |
#1389 in Web programming
108 downloads per month
71KB
1.5K
SLoC
robotxt
Also check out other spire-rs
projects
here.
The implementation of the robots.txt (or URL exclusion) protocol in the Rust
programming language with the support of crawl-delay
, sitemap
and universal
*
match extensions (according to the RFC specification).
Features
parser
to enablerobotxt::{Robots}
. Enabled by default.builder
to enablerobotxt::{RobotsBuilder, GroupBuilder}
. Enabled by default.optimal
to optimize overlapping and global rules, potentially improving matching speed at the cost of longer parsing times.serde
to enableserde::{Deserialize, Serialize}
implementation, allowing the caching of related rules.
Examples
- parse the most specific
user-agent
in the providedrobots.txt
file:
use robotxt::Robots;
fn main() {
let txt = r#"
User-Agent: foobot
Disallow: *
Allow: /example/
Disallow: /example/nope.txt
"#;
let r = Robots::from_bytes(txt.as_bytes(), "foobot");
assert!(r.is_relative_allowed("/example/yeah.txt"));
assert!(!r.is_relative_allowed("/example/nope.txt"));
assert!(!r.is_relative_allowed("/invalid/path.txt"));
}
- build the new
robots.txt
file in a declarative manner:
use robotxt::RobotsBuilder;
fn main() -> Result<(), url::ParseError> {
let txt = RobotsBuilder::default()
.header("Robots.txt: Start")
.group(["foobot"], |u| {
u.crawl_delay(5)
.header("Rules for Foobot: Start")
.allow("/example/yeah.txt")
.disallow("/example/nope.txt")
.footer("Rules for Foobot: End")
})
.group(["barbot", "nombot"], |u| {
u.crawl_delay(2)
.disallow("/example/yeah.txt")
.disallow("/example/nope.txt")
})
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.footer("Robots.txt: End");
println!("{}", txt.to_string());
Ok(())
}
Links
- Request for Comments: 9309 on RFC-Editor.com
- Introduction to Robots.txt on Google.com
- How Google interprets Robots.txt on Google.com
- What is Robots.txt file on Moz.com
Notes
- The parser is based on Smerity/texting_robots.
- The
Host
directive is not supported.
Dependencies
~2–4MB
~69K SLoC