#markdown #command-line-tool #word #markdown-syntax #unicode-aware #counting #document

bin+lib count-md

A simple, configurable command-line tool and Rust library for Unicode-aware, Markdown-aware, HTML-aware word counting in Markdown documents

1 unstable release

0.1.0 Aug 2, 2024

#6 in #unicode-aware

BlueOak-1.0.0

27KB
730 lines

count-md

A simple, configurable command-line tool and Rust library for Unicode-aware, Markdown-aware, HTML-aware word counting in Markdown documents.

That is: this tool will correctly count words in a Unicode-aware way, without incorrectly including Markdown syntax or HTML tags. It can include or exclude content like blockquotes, footnotes, code blocks, and so on, and ships with reasonable defaults out of the box for each!

Example

You might have a file with content like this:

# Title

This is some text!

> Here is a quote from someone else.

Here is more text.

If you wanted to know the number of non-quoted words, including the title but not including the blockquote, you would simply run count-md <path to the file>, and it will helpfully report that there are 9 words total. By contrast, wc -w will report that there are 18 words: it includes the blockquote, of course, but it also includes the # for the title and the > for the blockquote, neither of which is desirable!

Status

Support for including or or excluding the following Markdown features:

  • Headings
  • Blockquotes
    • Nested blockquotes
    • Admonitions[^admonitions]
  • Code blocks
  • Inline code
  • Block HTML 🚧 Partial
  • Footnotes
  • Tables
  • Math

[^admonitions]: Admonitions are not blockquotes, but they are listed here because that is how they work syntactically.

Library

The core functionality here can be used as a Rust library.[^c] There are two main entry points:

  • count: accepts a &str and counts it with the default set of options, equivalent to running count-md with zero options on the command line.

  • count_with_options: accepts a &str and an Options value (a bitmask), which allows you to configure each option directly. For the equivalent to running count-md with some option, use Options::DEFAULT and combine it with other flags:

    • With bitmasking directly: Options::DEFAULT | Options::IncludeBlockquotes

    • With the methods supplied by the bitflags library, insert and remove:

      let mut options = Options::DEFAULT;
      options.insert(Options::IncludeBlockquotes);
      options.remove(Options::IncludeHeadings);
      

See the documentation for more!

[^c]: In the future, I may also supply C bindings, but those need quite a bit of vetting before I am comfortable doing that!

Dependencies

~5–13MB
~155K SLoC