2 releases

0.1.1	Oct 26, 2021
0.1.0	Oct 26, 2021

#6 in #trail

MIT license

36KB
797 lines

Fetch MediaWiki site configuration

Helper script to properly configure parse_wiki_text for different wikis.

Originally at https://github.com/portstrom/fetch_mediawiki_configuration (Internet Archive snapshot), the repo is now deleted, along with all the user's related wiki repos. The libraries are preserved in various github forks and on docs.rs. However, no public copies of fetch_mediawiki_configuration are available.

This project is a recreation of the functionality of the original script. Inferences and assumptions made are documented under Implementation notes.

Usage

The project uses the cargo package manager. Build and run in the usual manner.

Pass command line argument --help for more usage information. For example:

cargo run -- --help

Implementation notes

All information needed for the ConfigurationSource is fetched from the MediaWiki Action API instance at the given domain. We use the query siteinfo metadata endpoint, with siprop set to the categories we need. See https://www.mediawiki.org/wiki/API:Siteinfo for more detailed documentation of the response.

Normalization

parse_wiki_text stores most configuration values using a trie, which implicitly inserts new values with all possible case-folded variants. For extension_tags, a case-sensitive data structure is used, but ASCII characters in the wiki text are converted to lowercase before comparisons. As extension tags are expected to be ASCII-only, this leaves us with effectively case-insensitive comparison. For link_trail, a HashSet<char> is used and the characters in it are compared to the wiki text directly.

In other words, all fields of ConfigurationSource are case-insensitive, except for link_trail. This implementation normalizes all configuration values, apart from link-trail characters, to lowercase. In addition, duplicates are removed and values are sorted in ascending order in each field; although not strictly necessary for correctness, it removes clutter and improves reproducibility.

Namespaces

category_namespaces and file_namespaces are extracted from siprop=namespaces and siprop=namespacealiases.

These fields must contain the primary and canoncial namespace names in addition to the aliases, despite the misleading names and lacking documentation.

Extension tags and protocols

extension_tags is straightforwardly extracted from siprop=extensiontags.

protocols is straightforwardly extracted from siprop=protocols.

Link trail

link_trail is extracted from siprop=general under key linktrail.

This is a PHP PCRE pattern containing two groups:

the trailing section, to be a part of the link it follows
everything after

We do some simple parsing of the pattern using the regex-syntax crate. The modifiers are also parsed and used where applicable. If group 1 is empty, the link trail has no characters. If it is a repetiton structure, the repeated part is extracted recursively, only allowing constructs that yield single-character sequences. Otherwise, the regex is considered invalid.

Limitations

This approach only accepts regexes with a specific structure, and does not take into account differences between PHP PCREs and rust regexes. However, the patterns are not expected to be diverse enough to cause problems with respect to structure, nor complex enough for the syntax differences to matter. These differences are for the most part minor or edge cases, since both syntaxes derive directly from Perl regexes.

A more serious limitation is that link_trail cannot store anything more complicated than a simple set of characters. If the repeated part in the regex contains concatenations, lookaheads, or similar, it cannot be represented in the field, and a fatal error results. This does affect a few actual wiki instances (try e.g. ca.wiktionary.org or se.wikipedia.org), but as it is a limitation of parse_wiki_text there is currently no way around it.

Magic words

magic_words is extracted from siprop=magicwords.

All magic words and their aliases are searched. We only accept those both prefixed and suffixed by __, and remove the prefix and suffix.

Redirect

redirect_magic_words is extracted from the same category.

The aliases of the magic word with name redirect are collected. Since the parse_wiki_text parser performs a lookup among these only after it has already found the starting #, we remove any starting #. In addition, redirect itself must also be included.

License

This project is licensed under the MIT license.

Dependencies

~9–21MB
~320K SLoC