#annotation #linguistics #nlp #standoff

app stam-tools

Command-line tools for working with standoff annotations on text (STAM)

5 releases

0.2.0 Jun 7, 2023
0.1.3 Apr 19, 2023
0.1.2 Apr 2, 2023
0.1.1 Mar 27, 2023
0.1.0 Mar 25, 2023

#890 in Text processing

GPL-3.0-only

100KB
2.5K SLoC

stam logo

Crate GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

STAM Tools

A collection of command-line tools for working with STAM.

Various tools are grouped under the stam tool, and invoked with a subcommand:

  • stam annotate - Add annotations (or datasets or resource) from STAM JSON files
  • stam info - Return information regarding a STAM model.
  • stam init - Initialize a new STAM annotationstore
  • stam import - Import STAM data in tabular from from a simple TSV (Tab Separated Values) format.
  • stam print - Output the text of any resources in the model.
  • stam export - Export STAM data in tabular form to a simple TSV (Tab Separated Values) format. This is not lossless but provides a decent view on the data. It provides a lot of flexibility by allowing you to configure the output columns.
  • stam validate - Validate a STAM model.
  • stam save - Write a STAM model to file(s). This can be used to switch between STAM JSON and STAM CSV output, based on the extension.
  • stam tag - Regular-expression based tagger on plain text.

For many of these, you can set --verbose for extra details in the output.

Installation

From source

$ cargo install stam-tools

Usage

Add the --help flag after the subcommand for extensive usage instructions.

Most tools take as input a STAM JSON file containing an annotation store. Any files mentioned via the @include mechanism are loaded automatically.

Instead of passing STAM JSON files, you can read from stdin and/or output to stdout by setting the filename to -, this works in many places.

These tools also support reading and writing STAM CSV.

Tools

stam export

The stam export tool is used to export STAM data into a tabular data format (TSV, tab separated values). You can configure precisely what columns you want to export using the --colums parameter. See stam export --help for a list of supported columns.

One of the more powerful functions is that you can specify custom columns by specifying a set ID, a delimiter and a key ID (the delimiter by default is a slash), for instance: my_set/part_of_speech. This will then output the corresponding value in that column, if it exist.

This export function is not lossless, that is, it can not encode everything that STAM supports, unlike STAM JSON and STAM CSV. It does, however, give you a great deal of flexibility to quickly output only the data relevant for whatever your specific purpose is.

stam import

The stam import tool is used to import tabular data from a TSV (Tab Separated Values) file into STAM. Like stam export, you can configure precisely what columns you want to import, using the --columns parameter. By default, the import function will attempt to parse the first line of your TSV file as the header and use that to figure out the column configuration. You will often want to set --annotationset to set a default annotation set to use for custom columns. If you set --annotationset my_set then a column like part_of_speech will be interpreted in that set (same as if you wrote my_set/part_of_speech explicitly).

Here is a simple example of a possible import TSV file (with --annotationset my_set):

Text	TextResource	BeginOffset	EndOffset	part_of_speech
Hello	hello.txt	0	5	interjection
world	hello.txt	6	10	noun

The import function has some special abilities. If your TSV data does not mention specific offsets in a text resource(s), they will be looked up automatically during the import procedure. If the text resources don't even exist in the first place, they can be reconstructed (within certain constraints, the output text will likely be in tokenised form only). If your data does not explicitly reference a resource, use the --resource parameter to point to an existing resource that will act as a default, or --new-resource for the reconstruction behaviour.

By setting --resource hello.txt or --new-resource hello.txt you can import the following much more minimal TSV:

Text	part_of_speech
Hello	interjection
world	noun

The importer supports empty lines within the TSV file. When reconstructing text, these will map to (typically) a newline in the to-be-constructed text (this configurable with --outputdelimiter2). Likewise, the delimiter between rows is configurable with --outputdelimiter, and defaults to a space.

Note that stam import can not import everything it can itself export. It can only import rows exported with --type Annotation (the default), in which each row corresponds with one annotation.

stam tag

The stam tag tool can be used for matching regular expressions in text and subsequently associated annotations with the found results. It is a tool to do for example tokenization or other tagging tasks.

The stam tag command takes a TSV file (example) containing regular expression rules for the tagger. The file contains the following columns:

  1. The regular expressions follow the this syntax. The expression may contain one or or more capture groups containing the items that will be tagged, in that case anything else is considered context and will not be tagged.
  2. The ID of annotation data set
  3. The ID of the data key
  4. The value to set. If this follows the syntax $1,$2,etc.. it will assign the value of that capture group (1-indexed).

Example:

#EXPRESSION	#ANNOTATIONSET	#DATAKEY	#DATAVALUE
\w+(?:[-_]\w+)*	simpletokens	type	word
[\.\?,/]+	simpletokens	type	punctuation
[0-9]+(?:[,\.][0-9]+)	simpletokens	type	number

Dependencies

~7–9MB
~155K SLoC