1 unstable release
0.1.0 | Apr 24, 2024 |
---|
#661 in Command line utilities
85 downloads per month
320KB
3K
SLoC
Url Cleaner
Websites often put unique identifiers into URLs so that, when you send a tweet to a friend, twitter knows it was you who sent it to them.
As 99% of people do not understand and therefore cannot consent to this, it is polite to remove the malicious query parameters before sending URLs to people.
URL Cleaner is an extremely versatile tool designed to make this process as fast and easy as possible.
URL Cleaner's default configuration also has a number of options to, for example, change twitter links to vxtwitter links or fandom.com links to breezewiki.com links.
Basic usage
By default, compiling URL Cleaner includes the default-config.json
file in the binary. Because of this, URL Cleaner can be used simply with url-cleaner "https://example.com/of?a=dirty#url"
.
The default config shouldn't ever change the semantics of a URL. Opening a URL before and after cleaning should always give the same result. (except for stuff like the categories amazon puts in one of its 7 billion navbars but do you really care about that?)
Because websites tend to not document what parts of their URLs are and aren't necessary, the default config almost certainly runs into issues when trying to clean niche URLs like advanced search queries or API endpoints.
If you find any instance of the default config changing the meaning/result of a URL, please open an issue.
Additionally, if you find any example of a malformed URL that can be unambiguously transformed into what was intended (https://abc.tumblr.com.tumblr.com
-> https://abc.tumblr.com
and https://bsky.app/profile/abc
-> https://bsky.app/profile/abc.bsky.social
), please open an issue.
Since these are somewhat common when social media sites have dedicated fields for other social medias, it's worth handling these.
Anonymity
In theory, if you're the only one sharing posts from a website without URL trackers, the website could realize that and track you in the same way.
In practise you are very unlikely to be the only one sharing clean URLs. Search engines generally provide URLs without trackers[citation needed], some people manually remove trackers, and some websites like vxtwitter.com automatically strip URL trackers.
However, for some websites (such as amazon) URL Cleaner strips more stuff than search engines. In this case anonymity does fall back to many people using URL Cleaner and providing cover for each other.
As with Tor, protests, and anything else where privacy matters, safety comes in numbers.
Variables
Variables let you specify behaviour with the --var name=value --var name2=value2
command line syntax.
Various variables are included in the default config for things I want to do frequently.
twitter-embed-domain
: The domain to use for twitter URLs when thediscord-compatibility
flag is specified. Defaults tovxtwitter.com
.breezewiki-domain
: The domain to use to turnfandom.com
URLs into BreezeWiki URLs. Defaults tobreezewiki.com
tor2web-suffix
: The suffix to append to the end of.onion
domains if the flagtor2web
is set. Should not start with.
as that's added automatically. Left unset by default.
If a variable is specified in a config's "params"
field, it can be unspecified using --unvar var1 --unvar var2
.
Flags
Flags let you specify behaviour with the --flag name --flag name2
command line syntax.
Various flags are included in the default config for things I want to do frequently.
no-unmangle
: Disable turninghttps://user.example.com.example.com
intohttps://user.example.com
andhttps://example.com/https://example.com/abc
andhttps://example.com/xyz/https://example.com/abc
intohttps://example.com/abc
.no-https-upgrade
: Disable replacinghttp://
URLs withhttps://
URLs.unmobile
: Converthttps://m.example.com
,https://mobile.example.com
,https://abc.m.example.com
, andhttps://abc.mobile.example.com
intohttps://example.com
andhttps://abc.example.com
.youtube-unshort
: Turnshttps://youtube.com/shorts/abc
URLs intohttps://youtube.com/watch?v=abc
URLs.discord-external
: Replaceimages-ext-1.discordapp.net
URLs with the original images they refer to.discord-compatibility
: Sets the domain oftwitter.com
URLs to the domain specified by thetwitter-embed-domain
variable.breezewiki
: Sets the domain offandom.com
and BreezeWiki URLs to the domain specified by thebreezewiki-domain
variable.unbreezewiki
: Turn BreezeWiki URLs intofandom.com
URLs.onion-location
: Send an HTTP GET request to the url and apply theOnion-Location
response header if found.tor2web
: Append the suffix specified by thetor2web-suffix
variable to.onion
domains.tor2web2tor
: Replace**.onion.**
domains with**.onion
domains.
Flags can be added to configs by using the FlagSet
condition and specified at runtime by doing --flag flag1 --flag flag2
.
If a flag is set in a config's "params"
field, it can be unset using --unflag flag1 --unflag flag1
.
Custom rules
Although proper documentation of the config schema is pending me being bothered to do it, the url_cleaner
crate itself is well documented and the structs and enums are (I think) fairly easy to understand.
The main files you want to look at are conditions.rs
and mappers.rs
.
Additionally url_part.rs
, string_location.rs
, and string_modification.rs
are very important for more advanced rules.
Until I publish URL Cleaner on crates.io/docs.rs, you can read the documentation by git clone
ing this repository and running cargo doc --no-deps
in the root directory then viewing the files in target/doc/url_cleaner
in a web browser.
If the URL in your web browser looks like file:///run/...
and the webpage is white with numbers on the left side, you should run python3 -m http.server
in the root directory and open http://localhost:8000/target/doc/url_cleaner/
.
Tips for people who don't know Rust's syntax:
Option<...>
just means a value can benull
in the JSON.{"abc": "xyz"}
and{"abc": null}
are both valid states for aabc: Option<String>
field.Box<...>
has no bearing on JSON syntax or possible values. It's just used so Rust can put types inside themselves.Vec<...>
andHashSet<...>
are written as lists.HashMap<..., ...>
andHeaderMap
are written as maps in JSON.HeaderMap
keys are always lowercase.
u8
,u16
,u32
,u64
,u128
, andusize
are unsigned (never negative) integers.i8
,i16
,i32
,i64
,i128
, andisize
are signed (maybe negative) integers.usize
is au32
on 32-bit computers andu64
on 64-bit computers. Likewiseisize
isi32
andi64
under the same conditions. In practice, if a number makes sense to be used in a field then it'll fit.- If a field starts with
r#
(liker#else
) you write it without ther#
(like"else"
).r#
is just Rust syntax for "this isn't a keyword". StringSource
,GlobWrapper
,RegexWrapper
,RegexParts
, andCommandWrapper
can be written as both strings and maps.RegexWrapper
andRegexParts
don't do any handling of/.../i
-style syntax.CommandWrapper
doesn't do any argument parsing.
#[serde(default)]
and#[serde(default = "...")]
allow for a field to be omitted when the desired value is almost always the same.#[serde(skip_serializing_if = "...")]
lets the--print-config
CLI flag omit uneccesary details (like when a field's value is its default value).#[serde(from = "...")]
,#[serde(into = "...")]
,#[serde(remote = "...")]
,#[serde(serialize_with = "...")]
,#[serde(deserialize_with = "...")]
, and#[serde(with = "...")]
are implementation details that can be mostly ignored.#[serde(remote = "Self")]
is a very strange way to allow a struct to be deserialized from a map or a string. See serde_with#702 for details.
Additionally, regex support uses the regex crate, which doesn't support look-around and backreferences.
Certain common regex operations are not possible to express without those, but this should never come up in practice.
Custom rule performance
A few commits before the one that added this text, I moved the "always remove these query parameters" rule to the bottom.
That cut the runtime for amazon URLs in half.
The reason is fairly simple: Instead of removing some of the query then removing all of it, if you remove all of it first then the "remove these parameters" does nothing.
While I have done my best to ensure URL Cleaner is as fast as I can get it, that does not mean you shouldn't be careful with rule order.
I know to most people in most cases, 10k URLs in 120ms versus 10k URLs in 60ms is barely noticeable, but that kind of thinking is why video games require mortgages.
MSRV
The Minimum Supported Rust Version is the latest stable release. URL Cleaner may or may not work on older versions, but there's no guarantee.
Untrusted input
Although URL Cleaner has various feature flags that can be disabled to make handling untrusted input safer, no guarantees are made. Especially if the config file being used is untrusted.
That said, if you find something to be unnecessarily unsafe, please open an issue so it can be fixed.
(Note that URL Cleaner doesn't use any unsafe
code. I mean safety in terms of IP leaks and stuff.)
Backwards compatibility
URL Cleaner is currently in heavy flux so expect library APIs and the config schema to change at any time for any reason.
Command line details
Parsing output
Unless Mapper::(e|)Print(ln|)
or a Debug
variant is used, the following should always be true:
-
Input URLs are a list of URLs starting with URLs provided as command line arguments then each line of the STDIN.
-
The nth line of STDOUT corresponds to the nth input URL.
-
If the nth line of STDOUT is empty, then something about reading/parsing/cleaning the URL failed.
-
The nth non-empty line of STDERR corresponds to the nth empty line of STDOUT.
- Currently empty STDERR lines are not printed when a URL succeeds. While it would make parsing the output easier it would cause visual clutter on terminals. While this will likely never change by default, parsers should be sure to follow 4 strictly in case this is added as an option.
JSON output
There is currently no support for JSON output. This will be added once I can conceptualize it.
Panic policy
URL Cleaner should only ever panic under the following circumstances:
-
Parsing the CLI arguments failed.
-
Loading/parsing the config failed.
-
Printing the config failed.
-
Testing the config failed.
-
Reading from/writing to STDIN/STDOUT/STDERR has a catastrophic error.
-
Running out of memory resulting in a standard library function/method panicking. This should be extremely rare.
Outside of these cases, URL Cleaner should never panic. However as this is equivalent to saying "URL Cleaner has no bugs", no actual guarantees can be made.
Funding
URL Cleaner does not accept donations. If you feel the need to donate please instead donate to The Tor Project and/or The Internet Archive.
Default config sources
The people and projects I have stolen various parts of the default config from.
Dependencies
~7–25MB
~415K SLoC