7 releases (breaking)
Uses new Rust 2024
new 0.9.0 | Apr 11, 2025 |
---|---|
0.8.0 | Apr 8, 2025 |
0.7.0 | Apr 6, 2025 |
0.6.0 | Apr 4, 2025 |
0.3.0 | Apr 2, 2025 |
#210 in Math
526 downloads per month
52KB
780 lines
Tictonix
Description
This crate provides functionality for working with vector representations of words (embeddings) and positional encoding. It is intended for use in NLP tasks, deep learning, and your custom projects.
Also, this project is the second step (step 1 tokenizer) towards
own implementation of LLM
on Transformer
architecture.
Provided functionality:
- Structure of Embeddings
- Creating a new embedding matrix by various methods such as:
Gaussian
,Xavier
,Uniform
. - Constructing the resulting embedding matrix for an array of tokens (indices), and obtaining a specific embedding for a token (index).
- Updating (replacing) the embedding for a particular token (index).
- Structure of PositionalEncoding
- Creating a new positional encoding matrix by various methods such as:
Sinusoidal PE
,Relative PE
,Rotary PE
. - Applying positional encodings to the embedding matrix.
- Returning a part of the positional encoding matrix for a sequence, and a particular positional encoding by its position.
- Structure of MatrixIO
- Saving to a file, and retrieving the embedding matrix from the file. Available formats are .safetensors and .npy.
UPD: Important clarification. In this implementation, the embedding matrix has columns corresponding to tokens (each column is an embedding for one token).
Installing
Add to your Cargo.toml
:
[dependencies]
tictonix = "0.9.0"
Usage
See examples for usage.
Documentation
See documentation for the project.
Glossary
- Tokenization is the process of breaking text into separate elements called tokens. Tokens can be words, characters, sub-words, or other units, depending on the chosen tokenization method. This process is an important step in text preprocessing for Natural Language Processing (NLP) tasks.
- LLMs (large language models) are large language models based on deep learning architectures (e.g., Transformer) that are trained on huge amounts of textual data. They are designed to perform a wide range of tasks related to natural language processing, such as text generation, translation, question answering, classification, and others. LLMs are capable of generalizing knowledge and performing tasks on which they have not been explicitly trained (zero-shot or few-shot learning).
- Transformer is a neural network architecture proposed in 2017 that uses the attention mechanism to process sequences of data such as text. The main advantage of Transformer is its ability to process long sequences and take context into account regardless of the distance between elements of the sequence. This architecture is the basis for most modern LLMs (such as GPT, BERT and others).
- Embedding is a numerical (vector) representation of text data (tokens, words, phrases or sentences).
- Positional Encoding is a technique used in the Transformer architecture to convey information about the order of elements in a sequence. Since Transformer has no built-in order information (unlike recurrent networks), positional encoding adds special signals to token embeddings that depend on their position in the sequence. sequence. This allows the model to take into account the order of words or other elements in the input data.
P.S.
Don't forget to leave a ⭐ if you found this project useful.
Dependencies
~7–9.5MB
~176K SLoC