#lua #text-document #markup #document-generation #content-tree

bin+lib litua

Read a text document, receive its tree in Lua and manipulate it before representing it as string

4 stable releases

2.0.0 Apr 12, 2023
1.1.1 Feb 10, 2023
1.1.0 Feb 4, 2023
1.0.0 Jan 27, 2023

#720 in Text processing

27 downloads per month

MIT license

1MB
2.5K SLoC

Rust 2K SLoC // 0.1% comments Lua 533 SLoC // 0.2% comments

litua

author:
tajpulo
version:
2.0.0
badges:
state of the release process state of the build process

Read a text document, receive its tree in Lua and manipulate it before representing it as string.

What is it about?

The input

Text documents occur in many contexts. Actually, we like them as a simple means to document ideas and concepts. They help us communicate. But sometimes, we want to transform them to other text formats or process its content. litua helps with that in a particular way.

You can write a text document like this:

In olden times when wishing still helped one, there lived a king whose daughters were all beautiful; and the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face.

But this text is boring. You usually care about markup. Markup are special instructions which annotate text:

In olden times when wishing still helped one, there lived a {bold king} whose daughters were all {italic beautiful}; and the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face.

In this case, the text {bold X} and {italic Y} has some special meaning. For example, it could mean that the text is represented with a special style (e.g. X in a bold font and Y in cursive script). In general, we define litua input syntax in the following manner:

{element[attr1=value1][attribute2=val2] text content of element}
  • element is the name of the markup element. Its name indicates its semantics. In terms of litua, we can define the semantics ourselves.
  • attr1 and attr2 are attributes of this markup element. It gives more details about the markup element. For example, it could name the fontface used to represent these markup elements. In essence, we have an attribute attr1 here which is associated with the value value1. We also have attribute attr2 which is associated with val2.
  • content is the text affected by this markup.

And finally, I will tell you a secret: value1, val2, and text content of element need not be text, but can also be an element itself. Thus, the following is permitted in litua input syntax:

{bold[font-face=Bullshit Sans] {italic Blockchain managed information density}}

In this sense, litua input syntax is very similar to XML (<element attr1="value1" attribute2="val2">text content of element</element>), LISP (e.g. (element :attr1 "value1" :attribute2 "val2" "text content of element")), and markup languages in general. By the way, if you literally need a { or } in your document, you can escape these semantics by writing {left-curly-brace} or {right-curly-brace} respectively instead. litua input syntax files must always be encoded in UTF-8.

Processing the document

Let us put the element-example in litua input syntax into a text document (doc.lit). Then we can invoke litua:

bash$  litua doc.lit

The output is in the file with extension out: doc.out. And it is super-boring: It is exactly the input:

bash$  cat doc.out
{element[attr1=value1][attribute2=val2] text content of element}

It becomes interesting, if I tell you that there is a representation of this element in Lua:

local node = {
    -- the string giving the node type
    ["call"] = "element",
    -- the key-value pairs of arguments.
    -- values are sequences of strings or nodes
    ["args"] = { ["attr1"] = { [1] = "value1" }, ["attribute2"] = { [1] = "val2" } },
    -- the sequence of elements occuring in the body of a node.
    -- the items of content can be strings or nodes themselves
    ["content"] = {
        [1] = "text content of element"
    },
}

For example, node.call allows you to access the name of the markup element. node.content[1] allows you to access the string which is the first and only content member of element in Lua. Remember that in Lua, the first element in a collection type is stored at index 1 (not 0 as in the majority of programming languages).

Now create a Lua file hooks.lua in the same directory (the name must start with hooks and must end with .lua) with the following content:

Litua.convert_node_to_string("element", function (node)
    return "The " .. tostring(node.call) .. " said: " .. tostring(node.content[1])
end)

Now let us invoke litua again:

bash$  litua doc.lit
[]
bash$  cat doc.out
The element said: text content of element

Wow, we just modified the behavior how to process the document 😍

Hooks

In fact, we used a concept called hook to modify the behavior. We register a hook with convert_node_to_string to trigger the hook whenever litua tries to convert a node to a string. A hook is a Lua function. Let us read the Lua syntax:

Litua.convert_node_to_string("element", function (node)
    return "The " .. tostring(node.call) .. " said: " .. tostring(node.content[1])
end)
  • Litua.convert_node_to_string is a function, which is defined by litua whenever you run litua.
  • The first argument must be a string, namely "element", which tells litua when to call the second argument.
  • The second argument starts with the keyword function and ends with the keyword end. This is the hook. It is a function and takes one argument called node. It can run arbitrary code and specifically it returns a string in the end which is built from the data in the node variable. .. is the string concatenation operator in Lua and tostring is a builtin Lua function which converts any value into a string object.

The complete set of hooks is given here:

  • Litua.on_setup
    purpose: registers a hook which is run initially and meant to optionally initialize the Litua.global variable as you need it
    default behavior: does nothing
    hook: The hook takes no argument, and returns nil
  • Litua.modify_initial_string
    purpose: registers a hook which is run after on_setup and meant to optionally pre-process the source code of the text document
    default behavior: returns the original source code
    hook: The hook takes the source code as a string, the filepath of the source file as a string, and returns the updated source code as string
  • Litua.read_new_node
    purpose: registers a hook which is run after turning the document into a hierarchy of elements. It allows you to look at some node before modifying it
    default behavior: does nothing
    hook: The hook takes a copy of the current node, the tree depth as integer, and returns nil
  • Litua.modify_node
    purpose: registers a hook which is run after read_new_node and allows you to actually modify a node
    default behavior: returns the original node
    hook: The hook takes the current node, the tree depth as integer, and the filter name, and returns (some node or string) and nil
  • Litua.read_modified_node
    purpose: registers a hook which is run after modify_node. It allows you to look at some node after modifying it
    default behavior: does nothing
    hook: The hook takes a copy of the current node, the tree depth as integer, and returns nil
  • Litua.convert_node_to_string
    purpose: registers a hook which defines how to represent a node as a string
    default behavior: returns its original string representation in litua input syntax
    hook: The hook takes the current node, the tree depth as integer, and the filter name, and returns a string and nil
  • Litua.modify_final_string
    purpose: registers a hook once the hierarchy has been converted into a string and meant to optionally post-process the source code of the text document
    default behavior: returns the provided string representation
    hook: The hook takes the string representation as a string, and returns a string
  • Litua.on_teardown
    purpose: registers a hook which is run finally and meant to tear down variables in Litua.global as you need it
    default behavior: does nothing
    hook: The hook takes no argument, and returns nil

Be aware that the document always lives within one invisible top-level node called document. So if you use a document element in your input file and define a hook for the element document as well, don't be surprised about the additional invocation of this hook.

Examples

I highly recommend to go through the examples in this order to get an idea how to use the hooks:

  1. enumeration – replace a call with an incrementing counter
  2. replacements – first define substitution pairs and then apply them
  3. literate-programming – define documentation and code block and write them to different files
  4. markup – serialize the tree to HTML5

Why should I use it?

Litua is a simple text processing utility for text documents with a hierarchical structure. It reminds of tools like XSLT, but people often complain about XSLT being too foreign to common programming languages. As an alternative, I provide litua with a parser for the litua input syntax, a map of data from rust to Lua, a runtime in Lua, and writer for text files.

How to install

This is a single static executable. It only depends on basic system libraries like pthread, math and libc. It ships the entire Lua 5.4 interpreter with the executable. I expect it to work out-of-the-box on your operating system.

How to run

Call the litua executable with -h to get information about additional arguments:

litua -h

Litua input specification

The following document defines the syntax (see also design/litua-lexer-state-diagram.jpg):

Node       = (Text | RawString | Function){0,}
Text       = (NOT the symbols "{" or "}"){1,}
RawString  = "{<" Whitespace (NOT the string Whitespace-and-">}") Whitespace ">}"
           | "{<<" Whitespace (NOT the string Whitespace-and-">>}") Whitespace ">>}"
           | "{<<<" Whitespace (NOT the string Whitespace-and-">>>}") Whitespace ">>>}"continue up to 126 "<" characters
Function   = "{" Call "}"
           | "{" Call Whitespace "}"
           | "{" Call Whitespace Node "}"
           | "{" Call ( "[" Key "=" Node "]" ){1,} "}"
           | "{" Call ( "[" Key "=" Node "]" ){1,} Whitespace "}"
           | "{" Call ( "[" Key "=" Node "]" ){1,} Whitespace Node "}"

Call       = (NOT the symbols "}", "[" or "<")(NOT the symbols "[" or "<"){0,}
Key        = (NOT the symbol "="){1,}
Whitespace = any of the 25 Unicode Whitespace characters

In essence, don't use "<" or "[" in function call names, or "=" in argument keys. Keep the number of opening and closing braces balanced (though this is not enforced by the syntax).

Improvements

The following parts can be improved:

  • the .parsed.expected files are not checked in the testsuite, because rust's HashMap representation is not consistent across builds.
  • verify that error handling for user code works well
  • improved error reporting for syntax errors (tokens position, etc)

Source Code

The source code is available at Github.

License

See the LICENSE file (Hint: MIT license).

Changelog

0.9
first public release with raw strings and four examples
1.0.0
improves stdout/stderr, improved documentation, CI builds, upload to crates.io
1.1.0
bugfix third argument of modify-node hook, modify-hook may now also return strings
1.1.1
bugfix: interrupted '>' sequences inside raw string content can be used again, removed hook checks from testsuite
2.0
improved docs, require whitespace before ">" in raw strings

Issues

Please report any issues on the Github issues page.

Dependencies

~2.4–5MB
~85K SLoC