#syntax-highlighting #natural #nlp #speech #part #language #text

bin+lib natural_syntax_ls

Natural language syntax highlighting

2 releases

0.0.1 Jul 12, 2024
0.0.0 Jul 11, 2024

#52 in #speech

MIT license

60KB
1K SLoC

Natural Language Syntax Highlighting

Natural-Syntax-LS is a language server that highlights different parts of speech (POS) in plain text.

Installation

  1. Download libtorch v2.1 as per Rust-BERT's documentation.

    Tips.

    You can figure out the URL to download libtorch in tch-rs' build script. The LIBTORCH variable should be the torch/ directory.

    Why automatic installation does not work.

    Rust-BERT has an "automatic installation" option that uses tch-rs' build script to download libtorch. However, the binary produced this way does not run because that libtorch is not on LD_LIBRARY_PATH. Alternatively, you could statically link libtorch, but that would require you to download libtorch yourself anyway.

  2. Install the natural_syntax_ls package with Cargo or friends to get the natural-syntax-ls binary:

    cargo install natural_syntax_ls --default-features=false
    

    Setting the default-features to false disables downloading libtorch (automatic installation).

    Why automatic installation is the default.

    Because otherwise it would be a pain to run the continuous integration.

Editor setup

✅ NeoVim setup with LSPConfig

Please paste the below natural_syntax_ls_setup function in your Nvim configuration and call it with your client's capabilities. Please see my config for an example.

The natural_syntax_ls_setup function.
local function natural_syntax_ls_setup(capabilities)
    local lspconfig = require('lspconfig')
    require('lspconfig.configs')['natural_syntax_ls'] = {
        default_config = {
            cmd = { 'natural-syntax-ls' },
            filetypes = { 'text' },
            single_file_support = true,
        },
        docs = {
            description = [[The Natural Syntax Language Server for highlighting parts of speech.]],
        },
    }
    lspconfig['natural_syntax_ls'].setup {
        capabilities,
        init_options = {
            token_map_update = {
                -- Customize your POS-token mapping here. E.g.:
                --[[
                -- Disable coordinating conjunctions highlighting.
                CC = vim.NIL, -- `nil` does not work because it gets ignored.
                -- Highlight wh-determiners as enum members without any modifiers.
                WDT = { type = "enumMember" },
                -- Highlight determiners as read-only classes.
                DT = { type = "class", modifiers = { "readonly" } },
                ]]
            },
        },
    }
end

Customizations:

  • I only set the filetypes field to text, but you can enable natural-syntax-ls for any other file types as well. Note that, though, the language server's semantic tokens supersede Tree-sitter highlighting by default.
  • By specifying the token_map_update field in init_options, you can customize the mapping between parts of speech and semantic tokens.
    • The default mapping is in the pos2token_bits function in semantic_tokens.rs.
    • Part of speech tags are the variants of the PartOfSpeech enum in lib.rs.
    • Token types and modifiers are variants of TokenType and TokenModifier in semantic_tokens.rs, all in camelCase.

❓ Visual Studio Code and other editor setup

No official support, but community plugins are welcome.

I do not currently use VSCode and these other editors, so I do not wish to maintain plugins for them.

However, it should be straightforward to implement plugins for them since Natural-Syntax-LS implements the Language Server Protocol (LSP). So, please feel free to make a plugin yourself and create an issue for me to link it here.

Selected specification

Prediction Scheduling

For a single document, only one prediction is scheduled at a time. When a prediction is ongoing, new updates are queued and the latest update replaces any previous updates queued.

Debugging

We use tracing-subscriber with the env-filter feature to emit logs^tracing-env-filter. Please configure the log level by setting the RUST_LOG environment variable.

On macOS, you may need to set DYLD_LIBRARY_PATH to run the tests.

Future work

  • Customizing the mapping between part of speech and semantic token.
  • Support languages other than English. This simply requires a new model.
  • Incremental updates and semantic token ranges.
  • Do not overwrite Markdown/LaTeX syntax highlighting.

Dependencies

~44MB
~761K SLoC