6 releases (3 breaking)
0.5.2 | Apr 1, 2021 |
---|---|
0.5.0 | Feb 5, 2021 |
0.4.1 | Feb 3, 2021 |
0.2.2 | Aug 9, 2020 |
0.1.0 | Aug 4, 2020 |
#716 in Text processing
36 downloads per month
22KB
395 lines
Text span utilities for Rust and Python
- Rust doc: https://docs.rs/textspan
Usage (Python)
Install: pip install pytextspan
align_spans
def align_spans(spans: List[Tuple[int, int]], text: str, original_text: str) -> List[List[Tuple[int, int]]]: ...
Converts the spans defined in text
to those defined in original_text
.
This is useful, for example, when you want to get the spans in the original text of spans obtained in the normalized text.
>>> import textspan
>>> spans = [(0, 3), (3, 6)];
>>> text = "foobarbaz";
>>> original_text = "FOo.BåR baZ";
>>> textspan.align_spans(spans, text, original_text)
[[(0, 3)], [(4, 7)]]
align_spans_by_mapping
def align_spans_by_mapping(spans: List[Tuple[int, int]], mapping: List[List[int]]) -> List[List[Tuple[int, int]]]: ...
Converts the spans by the given mapping
.
Generally speaking, the character correspondence between two texts is not
necessarily surjective, not injective, not even a methematical map -
some character in textA
may not have a correspondence in textB
,
or may have multiple correspondences in textB
. Thus, you should
provide mapping
as List[List[Tuple[int,int]]]
.
>>> import textspan
>>> spans = [(0, 2), (3, 4)]
>>> mapping = [[0, 1], [], [2], [4, 5, 6]]
>>> textspan.align_spans_by_mapping(spans, mapping)
[[(0, 2)], [(4, 7)]]
get_original_spans
def get_original_spans(tokens: List[str], original_text: str) -> List[List[Tuple[int, int]]]: ...
Returns the span indices of original_text
from the tokens based on the shortest edit script (SES).
This is useful, for example, when you want to get the spans in the original text of tokens obtained in the normalized text.
>>> import textspan
>>> tokens = ["foo", "bar"]
>>> textspan.get_original_spans(tokens, "FO.o BåR")
[[(0, 2), (3, 4)], [(6, 9)]]
lift_span_index
def lift_span_index(span: Tuple[int, int], target_spans: List[Tuple[int, int]]) -> Tuple[Tuple[int, bool], Tuple[int, bool]]: ...
Examples:
import textspan spans = [(0, 3), (3, 4), (4, 9), (9, 12)] assert textspan.lift_spans_index((2, 10), spans) == (0, 4)
lift_spans_index
def lift_spans_index(spans: List[Tuple[int, int]], target_spans: List[Tuple[int, int]]) -> List[Tuple[Tuple[int, bool], Tuple[int, bool]]]: ...
remove_span_overlaps
def remove_span_overlaps(tokens: List[Tuple[int, int]]) -> List[Tuple[int, int]]: ...
Remove overlapping spans from given spans
.
First, longest spans are remained - if the two spans are overlapped, the first span will be remained. If the two spans are overlapped and their start positions are same, the longer span will be remained.
>>> import textspan
>>> spans = [(0, 2), (0, 3), (2, 4), (5, 7)]
>>> assert textspan.remove_span_overlaps(spans) == [(0, 3), (5, 7)]
remove_span_overlaps_idx
def remove_span_overlaps_idx(tokens: List[Tuple[int, int]]) -> List[int]: ...
Remove overlapping spans from given spans
, and returns remained span indices.
First, longest spans are remained - if the two spans are overlapped, the first span will be remained. If the two spans are overlapped and their start positions are same, the longer span will be remained.
>>> import textspan
>>> spans = [(0, 2), (0, 3), (2, 4), (5, 7)]
>>> assert textspan.remove_span_overlaps_idx(spans) == [1, 3]
Dependencies
~1MB
~35K SLoC