#data-model #file-format #annotations #linguistic #research #testing #convert

bin+lib annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests

7 releases (breaking)

new 0.6.0 Apr 22, 2024
0.5.0 Jan 19, 2024
0.4.0 Nov 13, 2023
0.3.1 Aug 4, 2023
0.1.0 Apr 12, 2023

#315 in Parser implementations

Download history 5/week @ 2024-01-17 5/week @ 2024-02-14 21/week @ 2024-02-21 1/week @ 2024-02-28 4/week @ 2024-03-27 3/week @ 2024-04-03 105/week @ 2024-04-17

112 downloads per month

Apache-2.0

685KB
16K SLoC

docs.rs codecov

Annatto

This software aims to test and convert data within the RUEG research group at Humboldt-Universität zu Berlin. Tests aim at continuously evaluating the state of the RUEG corpus data to early identify issues regarding compatibility, consistency, and integrity to facilitate data handling with regard to annotation, releases and integration.

For efficiency annatto relies on the graphANNIS representation and already provides a basic set of data handling modules.

Installing and running annatto

Annatto is a command line program, which is available pre-compiled for Linux, Windows and macOS. Download and extract the latest release file for your platform.

The main usage of annatto is through the command line interface. Run

annatto --help

to get more help on the sub-commands. The most important command is annatto run <workflow-file>, which runs all the modules as defined in the given [workflow] file.

Modules

Annatto comes with a number of modules, which have different types:

Importer modules allow importing files from different formats. More than one importer can be used in a workflow, but then the corpus data needs to be merged using one of the merger manipulators. When running a workflow, the importers are executed first and in parallel.

Graph operation modules change the imported corpus data. They are executed one after another (non-parallel) and in the order they have been defined in the workflow.

Exporter modules export the data into different formats. More than one exporter can be used in a workflow. When running a workflow, the exporters are executed last and in parallel.

To list all available formats (importer, exporter) and graph operations run

annatto list

To show information about modules for the given format or graph operation use

annatto info <name>

Creating a workflow file

Annatto workflow files list which importers, graph operations and exporters to execute. We use an TOML file with the ending .toml to configure the workflow. TOML files can be as simple as key-value pairs, like config-key = "config-value". But they allow representing more complex structures, such as lists. The TOML website has a great "Quick Tour" section which explains the basics concepts of TOML with examples.

Import

An import step starts with the header [[import]], and a configuration value for the key path where to read the corpus from and the key format which declares in which format the corpus is encoded. The file path is relative to the workflow file. Importers also have an additional configuration header, that follows the [[import]] section and is marked with the [import.config] header.

[[import]]
path = "textgrid/exampleCorpus/"
format = "textgrid"

[import.config]
tier_groups = { tok = [ "pos", "lemma", "Inf-Struct" ] }
skip_timeline_generation = true
skip_audio = true
skip_time_annotations = true
audio_extension = "wav"

You can have more than one importer, and you can simply list all the different importers at the beginning of the workflow file. An importer always needs to have a configuration header, even if it does not set any specific configuration option.

[[import]]
path = "a/mycorpus/"
format = "format-a"

[import.config]

[[import]]
path = "b/mycorpus/"
format = "format-b"

[import.config]

[[import]]
path = "c/mycorpus/"
format = "format-c"

[import.config]

# ...

Graph operations

Graph operations use the header [[graph_op]] and the key action to describe which action to execute. Since there are no files to import/export, they don't have a path configuration.

[[graph_op]]
action = "check"

[graph_op.config]
# Empty list of tests
tests = []

Export

Exporters work similar to importers, but use the keyword [[export]] instead.

[[export]]
path = "output/exampleCorpus"
format = "graphml"

[export.config]
add_vis = "# no vis"
guess_vis = true

Full example

You cannot mix import, graph operations and export headers. You have to first list all the import steps, then the graph operations and then the export steps.

[[import]]
path = "conll/ExampleCorpus"
format = "conllu"
config = {}

[[graph_op]]
action = "check"

[graph_op.config]
report = "list"

[[graph_op.config.tests]]
query = "tok"
expected = [ 1, inf ]
description = "There is at least one token."

[[graph_op.config.tests]]
query = "node ->dep node"
expected = [ 1, inf ]
description = "There is at least one dependency relation."

[[export]]
path = "grapml/"
format = "graphml"

[export.config]
add_vis = "# no vis"
guess_vis = true

Developing annatto

You need to install Rust to compile the project. We recommend installing the following Cargo subcommands for developing annis-web:

Execute tests

You can run the tests with the default cargo test command. To calculate the code coverage, you can use cargo-llvm-cov:

cargo llvm-cov --open --all-features --ignore-filename-regex 'tests?\.rs'

Performing a release

You need to have cargo-release installed to perform a release. Execute the follwing cargo command once to install it.

cargo install cargo-release

To perform a release, switch to the main branch and execute:

cargo release [LEVEL] --execute

The level should be patch, minor or major depending on the changes made in the release. Running the release command will also trigger a CI workflow to create release binaries on GitHub.

This will also trigger a CI workflow to create release binaries on GitHub.

Funding

Die Forschungsergebnisse dieser Veröffentlichung wurden gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334 sowie FOR 2537, 313607803, GZ LU 856/16-1.

This research was funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) – SFB 1412, 416591334 and FOR 2537, 313607803, GZ LU 856/16-1.

Dependencies

~66MB
~1M SLoC