2 releases
0.1.2 | Dec 6, 2022 |
---|---|
0.1.1 | Dec 3, 2022 |
0.1.0 |
|
0.0.0 |
|
#2051 in Encoding
24 downloads per month
140KB
250 lines
agldt
author: Caio Geraldes caio.geraldes@usp.br
Tools for parsing treebanks from AGLDT
Basic usage
use serde_xml_rs::from_str;
use std::fs::read_to_string;
use agldt::parser::*;
fn main() {
let src = read_to_string("/path/to/agldt/tlg0007.tlg004.perseus-grc1.tb.xml").unwrap();
let doc = from_str::<Treebank>(&preprocess(&src)).unwrap();
assert_eq!(doc.count_words(), 9451);
assert_eq!(doc.count_tokens(), 10709);
}
Description of parsing stages
Preprocessing
Pre-processes the source .xml
code to allow for serialization of the treebank.
There are some oddities in the scheme used in AGLDT's xml
header and body,
that otherwise make serializing it to a struct
quite messy.
This is kind of a bodge, but should do the trick.
Oddities
The main oddity on AGLDT use of xml
occurs inside the tag <respStmt>
, where the
tag <persName>
might contain either a single string value or a series of tags:
<respStmt>
<persName>Bridget Almas</persName>
<resp>responsible for the annotation environment and cts:urn technology</resp>
<address>Tufts University</address>
</respStmt>
<respStmt>
<persName>
<short>Vanessa Gorman</short>
<name>Vanessa Gorman</name>
<address>vbgorman@gmail.com</address>
<uri>http://data.perseus.org/sosol/users/Vanessa%20Gorman</uri>
</persName>
<resp>annotator of the text</resp>
</respStmt>
To solve this oddity, we apply two regex replacements so as to move the
<name>
and <address>
tags inside <persName>
.
A handful of other oddities concern the use of the tags <primary>
,
<secondary>
and <annotator>
inside the tag <sentence>
.
Those are also removed by the regex in the current version.
Finally, the head
value is sometimes an empty string, which is still an issue
for me to serialize. As 0
is not used anywhere else, I replace empty strings
for "0"
.
Serialization
Uses serde
for serializing the data. I did my best to keep the metadata
accessible, but there are still some missing fields that will later be included.
Dependencies
~3–4.5MB
~88K SLoC