#tags #command-line-tool #word-analysis #writing-analysis

bin+lib tagalyzer

A CLI tool to gather statistics on collections of plaintext-adjacent files

6 releases (3 breaking)

0.3.0 Oct 23, 2023
0.2.0 Oct 20, 2023
0.1.1 Oct 9, 2023
0.0.2 Sep 22, 2023

#1535 in Text processing

43 downloads per month

MIT/Apache

40KB
679 lines

Tagalyzer

This is a CLI tool that counts words in files, then prints the counts in an easily human-readable format. I made it to help me analyze my own writing to help me pick tags for blog posts.

This tool will eventually be a word relative frequency analyzer. The eventual intended goal is to point it at a directory or list of files, and it will analyze statistical values for a sum total of all words in all files, as well as breaking out the how word frequency varies by file.

Install

CLI

If you want to analyze writing samples yourself, install the command line tool:

cargo install tagalyzer

After that, try running tagalyzer --help to see the usage and checking out the examples below.

Library

If you want to use the library to do text analysis in your own project, use Cargo to add Tagalyzer as a dependency:

cargo add tagalyzer

Examples

$ tagalyzer LICENSE-* # Glob matching, case-insensitive text processing
Sorted wordcount for LICENSE-MIT
software       : 10
without        : 4
including      : 4
--- [snip] ---
Sorted wordcount for LICENSE-APACHE
work           : 33
any            : 30
license        : 26
--- [snip] ---
$ tagalyzer LICENSE-MIT -c # Case sensitive when counting, not when filtering
Sorted wordcount for LICENSE-MIT
Software       : 6
SOFTWARE       : 4
ANY            : 3
this           : 3
including      : 3
OTHER          : 3
--- [snip] ---
$ tagalyzer LICENSE-MIT -ci # Case sensitive, filters "or" but not "OR"
Sorted wordcount for LICENSE-MIT
OR             : 8
THE            : 7
Software       : 6
OF             : 5
IN             : 5
SOFTWARE       : 4
--- [snip] ---

Long-Term Plans

I plan on developing this tool into both a CLI binary and a parallel library to provide an out-of-the-box solution and high customization respectively. It will fit into my workflow by providing frequency of words and phrases (e.g. strings of up to n words or characters) of the directory where I keep all my blog posts, which I can use to help me decide on a set of applicable tags.

License

This work is licensed under either the MIT or Apache 2.0 license at the choice of the user.

Contributions are assumed to be licensed under MIT unless otherwise stated.

The Rust language and various libraries are used in this project under the MIT license.

Contributing

Contributions are always welcome! The project is hosted on GitLab. Bug reports, commits, or even just suggestions are appreciated.

If you do want to contribute code, I'm more familiar with merging branches than forks. I have gating tests and lints in CI, which should be equivalent to the code block below. If the code or results ever differ between this block running locally and what happens in CI, please open an issue.

cargo fmt &&
cargo test &&
cargo clippy --no-deps -- \
-Dclippy::pedantic \
-Dclippy::nursery \
-Dclippy::style \
-Dclippy::unwrap_used \
-Dclippy::expect_used \
-Dclippy::missing_docs_in_private_items \
-Dclippy::single_char_lifetime_names \
-Dclippy::use_self \
-Dclippy::str_to_string \
-Ddead_code \
-Aclippy::needless_return \
-Aclippy::tabs_in_doc_comments \
-Aclippy::needless_raw_string_hashes \
-Dwarnings

Dependencies

~3.5–5MB
~88K SLoC