#indentation #token #tokenizer #tokenize #text-processing

indent_tokenizer

Generate tokens based on indentation

4 releases (breaking)

Uses old Rust 2015

0.4.0 Jan 29, 2018
0.3.0 Jan 28, 2018
0.2.0 Apr 22, 2017
0.1.0 Apr 17, 2017

#8 in #indentation

GPL-3.0 license

56KB
1K SLoC

Identation tokenizer

A small an simple indentation tokenizer

A similar project is indentation_flattener. In this case we flatten the input adding PUSH_INDENT, and POP_INDENT. This looks better for PEG grammars.

Usage

Add to cargo.toml [source, toml]

[dependencies]
indent_tokenizer = "0.2.0"

See example below

Modifs

0.2.0:: Removed general types (as String, u32 or usize) + Using concrete types (new types)

Indentation format

Tabs are no valid on indentation grouping.

Let's see by example.

.Simple valid input

.....
...
    ....
        ....
        ....
    ....
    ....
....
....
    ....

Indentation groups can have any number of spaces

.Valid indentation different spaces

.....               level0
  ....              level1  <--
        ....        level2
  ....              level1  <--
  ....              level1  <--
....                level0
....                level0
      ....          level1  <--

It's not a good idea to have same level with different spaces, but it's allowed when you are creating a new level.

In this example, last level1 is idented with more spaces than previus ones

.Invalid indentation

.....
...
    ....
        ....
       ....     <--  incorrect indentation
    ....        <--  correct previous ident level
    ....
....
....
    ....

In order to go back a level, the indentation has to match with the previous on this level.

As we saw in previous example, increasing level is free indentation.

.Start line indicator

|.....
    |.....
    |......
        |......

You can start lines with |, but it's optional.

.Indentation indicator is optional

.....
    |.....
     ......
     ......
        ......

Look that | is one position previous to indentation level.

It is usefull when you need to start with spaces.

.I want to start a line with spaces

.....
    | .....     <- This line starts with an space
    |  ......   <- Starting with 2 spaces
    |.....      <- starts with no spaces
     .....      <- starting with no spaces
     ...        <- starting with no spaces

.My line starts with a |

.....
    ||.....     This line starts with a `|`
    |......     This one starts with `.`

A line is empty when there are no content or it only has spaces.

.Empty lines

.....
    .....
    .....
    .....
    .....     next line is empty

    .....     next line is empty

.....
.....         next line is empty

What if I want represent empty lines?

.Representing empty lines

.....
    .....
    .....
    .....
    .....     I want new line after this line
   |

    .....     and three new lines, please
   |
   |
   |

What if I want to represent spaces at end of line?

Spaces at end of line will not be erased, therefore, you don't need to do anything about it.

But could be intesting to represent it because some editors can run trailing or just because you can visualize it.

.Representing spaces at end line

.....
    .....
    .....
    .....
    This line keeps 2 spaces and end  |
    and you know it

    Next line is properly indented and only has spaces
   |   |

In fact, you can write | at end of all lines. It will be removed.

Next strings, are equivalent.

.| it's optional at end of line

.....|
    .....|
    .....|
    .....|


.....
    .....
    .....
    .....

But I could need a pipe | at end of line

.pipe at end of line

.....
    .....
    .....
    .....
    This line ends with a pipe||

.Pitfall

|.....
.....   <- Invalid, remember, indentation mark | is previus to real indentation


|.....
 .....   <- This is OK, but not elegant


| ....   <- I want to start with an space
|.....   <- This is redundant, but more clear
 

Tokens

  • Each change of leven represent an end of token.
  • An empty line, is used to separate tokens on same level
  • A token contain lines and a list of tokens

.Tokens

This is the first token
    This is another token, because it's on a different level
        And another token
    This is also a different token

A token can contain
multiple lines
    This is another token
    with three
    lines

Empty lines can be used to
separate tokens
    This is a token,
    that continues
    here. Next empty line define
    a token division

    And this is a different one
    with a couple of lines

Identation tokenizer API

Version 0.2 removed general types as String, usize, u32...

Instead, it's created an specific type on each context.

Concrete types:: [source, rust]

#[derive(Debug, PartialEq, Copy, Clone)]
pub struct LineNum(u32);

#[derive(Debug, PartialEq, Clone, Eq)]
pub struct SLine(String);
  • LineNum to represent the line number
  • SLine to respresent the line string

Internally, the system uses more new types as NSpaces to represent number of spaces

Function to call:: [source, rust]

pub fn tokenize(input: &str) -> Result<Vec<Token>, Error> 

Token type:: [source, rust]

#[derive(Debug, PartialEq)]
pub struct Token {
    pub lines: Vec<SLine>,
    pub tokens: Vec<Token>,
}

Error type:: [source, rust]

#[derive(Debug, PartialEq)]
pub struct Error {
    pub line: LineNum,
    pub desc: String,
}

Thats all

Look into lib.rs to see the api and tests.rs to se examples

Examples

You can look into tests.rs, there are several tests.

.Complex example [source, rust]

    let tokens = tokenize("
0
    || 01a
     01b
     01c

     02a
     02b

        |020a
        ||020b

        |  021a
        |021b
1a
1b
    11a
   ||11b
    11c

    12a  ||
   |12b  ||
2a
    21a
    21b
   |
   |

")

The result will be

[source, rust]

   vec![Token {
            lines: vec![SLine::from("0")],
            tokens: vec![Token {
                            lines: vec![SLine::from("| 01a"),
                                        SLine::from("01b"),
                                        SLine::from("01c")],
                            tokens: vec![],
                        },
                        Token {
                            lines: vec![SLine::from("02a"), SLine::from("02b")],
                            tokens: vec![Token {
                                            lines: vec![SLine::from("020a"),
                                                        SLine::from("|020b")],
                                            tokens: vec![],
                                        },
                                        Token {
                                            lines: vec![SLine::from("  021a"),
                                                        SLine::from("021b")],
                                            tokens: vec![],
                                        }],
                        }],
        },
        Token {
            lines: vec![SLine::from("1a"), SLine::from("1b")],
            tokens: vec![Token {
                            lines: vec![SLine::from("11a"),
                                        SLine::from("|11b"),
                                        SLine::from("11c")],
                            tokens: vec![],
                        },
                        Token {
                            lines: vec![SLine::from("12a  |"), SLine::from("12b  |")],
                            tokens: vec![],
                        }],
        },
        Token {
            lines: vec![SLine::from("2a")],
            tokens: vec![Token {
                            lines: vec![SLine::from("21a"),
                                        SLine::from("21b"),
                                        SLine::from(""),
                                        SLine::from("")],
                            tokens: vec![],
                        }],
        }];

No runtime deps