|0.2.0||Jun 7, 2023|
|0.1.3||Apr 19, 2023|
|0.1.2||Apr 2, 2023|
|0.1.1||Mar 27, 2023|
|0.1.0||Mar 25, 2023|
#890 in Text processing
A collection of command-line tools for working with STAM.
Various tools are grouped under the
stam tool, and invoked with a subcommand:
stam annotate- Add annotations (or datasets or resource) from STAM JSON files
stam info- Return information regarding a STAM model.
stam init- Initialize a new STAM annotationstore
stam import- Import STAM data in tabular from from a simple TSV (Tab Separated Values) format.
stam print- Output the text of any resources in the model.
stam export- Export STAM data in tabular form to a simple TSV (Tab Separated Values) format. This is not lossless but provides a decent view on the data. It provides a lot of flexibility by allowing you to configure the output columns.
stam validate- Validate a STAM model.
stam save- Write a STAM model to file(s). This can be used to switch between STAM JSON and STAM CSV output, based on the extension.
stam tag- Regular-expression based tagger on plain text.
For many of these, you can set
--verbose for extra details in the output.
$ cargo install stam-tools
--help flag after the subcommand for extensive usage instructions.
Most tools take as input a STAM JSON file containing an annotation store. Any
files mentioned via the
@include mechanism are loaded automatically.
Instead of passing STAM JSON files, you can read from stdin and/or output to
stdout by setting the filename to
-, this works in many places.
These tools also support reading and writing STAM CSV.
stam export tool is used to export STAM data into a tabular data format
(TSV, tab separated values). You can configure precisely what columns you want
to export using the
--colums parameter. See
stam export --help for a
list of supported columns.
One of the more powerful functions is that you can specify custom columns by
specifying a set ID, a delimiter and a key ID (the delimiter by default is a
slash), for instance:
my_set/part_of_speech. This will then output the
corresponding value in that column, if it exist.
This export function is not lossless, that is, it can not encode everything that STAM supports, unlike STAM JSON and STAM CSV. It does, however, give you a great deal of flexibility to quickly output only the data relevant for whatever your specific purpose is.
stam import tool is used to import tabular data from a TSV (Tab Separated
Values) file into STAM. Like
stam export, you can configure precisely what
columns you want to import, using the
--columns parameter. By default, the
import function will attempt to parse the first line of your TSV file as the
header and use that to figure out the column configuration. You will often
want to set
--annotationset to set a default annotation set to use for
custom columns. If you set
--annotationset my_set then a column like
part_of_speech will be interpreted in that set (same as if you wrote
Here is a simple example of a possible import TSV file (with
Text TextResource BeginOffset EndOffset part_of_speech Hello hello.txt 0 5 interjection world hello.txt 6 10 noun
The import function has some special abilities. If your TSV data does not
mention specific offsets in a text resource(s), they will be looked up
automatically during the import procedure. If the text resources don't even
exist in the first place, they can be reconstructed (within certain
constraints, the output text will likely be in tokenised form only). If your
data does not explicitly reference a resource, use the
to point to an existing resource that will act as a default, or
--new-resource for the reconstruction behaviour.
--resource hello.txt or
--new-resource hello.txt you can import the following much more minimal TSV:
Text part_of_speech Hello interjection world noun
The importer supports empty lines within the TSV file. When reconstructing
text, these will map to (typically) a newline in the to-be-constructed text
(this configurable with
--outputdelimiter2). Likewise, the delimiter
between rows is configurable with
--outputdelimiter, and defaults to a space.
stam import can not import everything it can itself export. It can only import rows
--type Annotation (the default), in which each row
corresponds with one annotation.
stam tag tool can be used for matching regular expressions in text and
subsequently associated annotations with the found results. It is a tool to do
for example tokenization or other tagging tasks.
stam tag command takes a TSV file (example) containing regular expression rules for the tagger.
The file contains the following columns:
- The regular expressions follow the this syntax. The expression may contain one or or more capture groups containing the items that will be tagged, in that case anything else is considered context and will not be tagged.
- The ID of annotation data set
- The ID of the data key
- The value to set. If this follows the syntax $1,$2,etc.. it will assign the value of that capture group (1-indexed).
#EXPRESSION #ANNOTATIONSET #DATAKEY #DATAVALUE \w+(?:[-_]\w+)* simpletokens type word [\.\?,/]+ simpletokens type punctuation [0-9]+(?:[,\.][0-9]+) simpletokens type number