19 releases
0.1.18 | Sep 7, 2024 |
---|---|
0.1.16 | Jul 15, 2024 |
0.1.12 | Mar 4, 2024 |
0.1.7 | Oct 10, 2023 |
0.1.6 | Jul 22, 2023 |
#1193 in Web programming
94KB
2K
SLoC
wdict
Create dictionaries by scraping webpages or crawling local files.
Similar tools (some features inspired by them):
Take it for a spin
# build with nix and run the result
nix build .#
./result/bin/wdict --help
# just run it directly
nix run .# -- --help
# run it without cloning
nix run github:pyqlsa/wdict -- --help
# install from crates.io
# (nixOS users may need to do this within a dev shell)
cargo install wdict
# using a dev shell
nix develop .#
cargo build
./target/debug/wdict --help
# ...or a release version
cargo build --release
./target/release/wdict --help
Usage
Create dictionaries by scraping webpages or crawling local files.
Usage: wdict [OPTIONS] <--url <URL>|--theme <THEME>|--path <PATH>|--resume|--resume-strict>
Options:
-u, --url <URL>
URL to start crawling from
--theme <THEME>
Pre-canned theme URLs to start crawling from (for fun)
Possible values:
- star-wars: Star Wars themed URL <https://www.starwars.com/databank>
- tolkien: Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
- witcher: Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
- pokemon: Pokemon themed URL <https://www.smogon.com>
- bebop: Cowboy Bebop themed URL <https://cowboybebop.fandom.com/wiki/Cowboy_Bebop>
- greek: Greek Mythology themed URL <https://www.theoi.com>
- greco-roman: Greek and Roman Mythology themed URL <https://www.gutenberg.org/files/22381/22381-h/22381-h.htm>
- lovecraft: H.P. Lovecraft themed URL <https://www.hplovecraft.com>
-p, --path <PATH>
Local file path to start crawling from
--resume
Resume crawling from a previous run; state file must exist; existence of dictionary is optional; parameters from state are ignored, instead favoring arguments provided on the command line
--resume-strict
Resume crawling from a previous run; state file must exist; existence of dictionary is optional; 'strict' enforces that all arguments from the state file are observed
-d, --depth <DEPTH>
Limit the depth of crawling URLs
[default: 1]
-m, --min-word-length <MIN_WORD_LENGTH>
Only save words greater than or equal to this value
[default: 3]
-x, --max-word-length <MAX_WORD_LENGTH>
Only save words less than or equal to this value
[default: 18446744073709551615]
-j, --include-js
Include javascript from <script> tags and URLs
-c, --include-css
Include CSS from <style> tags and URLs
--filters <FILTERS>...
Filter strategy for words; multiple can be specified (comma separated)
[default: none]
Possible values:
- deunicode: Transform unicode according to <https://github.com/kornelski/deunicode>
- decancer: Transform unicode according to <https://github.com/null8626/decancer>
- all-numbers: Ignore words that consist of all numbers
- any-numbers: Ignore words that contain any number
- no-numbers: Ignore words that contain no numbers
- only-numbers: Keep only words that exclusively contain numbers
- all-ascii: Ignore words that consist of all ascii characters
- any-ascii: Ignore words that contain any ascii character
- no-ascii: Ignore words that contain no ascii characters
- only-ascii: Keep only words that exclusively contain ascii characters
- none: Leave the word as-is
--site-policy <SITE_POLICY>
Site policy for discovered URLs
[default: same]
Possible values:
- same: Allow crawling URL, only if the domain exactly matches
- subdomain: Allow crawling URLs if they are the same domain or subdomains
- sibling: Allow crawling URLs if they are the same domain or a sibling
- all: Allow crawling all URLs, regardless of domain
-r, --req-per-sec <REQ_PER_SEC>
Number of requests to make per second
[default: 5]
-l, --limit-concurrent <LIMIT_CONCURRENT>
Limit the number of concurrent requests to this value
[default: 5]
-o, --output <OUTPUT>
File to write dictionary to (will be overwritten if it already exists)
[default: wdict.txt]
--append
Append extracted words to an existing dictionary
--output-state
Write crawl state to a file
--state-file <STATE_FILE>
File to write state, json formatted (will be overwritten if it already exists)
[default: state-wdict.json]
-v, --verbose...
Increase logging verbosity
-q, --quiet...
Decrease logging verbosity
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Lib
This crate exposes a library, but for the time being, the interfaces should be considered unstable.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Dependencies
~17–29MB
~476K SLoC