#pdf #document #search #command-line-tool #management #dms #fuzzy-search

app dmslite

Local and fast Document Management System CLI-tool for storing and searching PDF documents

3 releases

0.1.2 Mar 9, 2024
0.1.1 Mar 9, 2024
0.1.0 Mar 9, 2024

#1798 in Command line utilities

MIT license

29KB
520 lines

crates.io

DMSLite

DMSLite is a secure and lightweight command-line tool for document management. It provides efficient document indexing, searching, and AI-based categorization, ensuring fast performance even with large document volumes, all while keeping operations entirely local on your machine for maximum privacy and security.

Usage

If your roots bin folder is in $PATH you can type dmslite everywhere to:

  1. Consume Documents: Add documents to a specified folder to process. (E.g. with the command c)
  2. Search Documents: Use the CLI to search for documents by content, title, or creation date (fuzzy word similarity search). (E.g. with the command s followed by the search phrase)
  3. Open Documents: Open a Document found with the search with its default application right from the cli tool. (E.g. with the command o followed by the id found out by a search before)
  4. Delete Documents: Delete a Document found with the search by its id. (E.g. with the command d followed by the id found out by a prior search.)

Installation and Setup

cargo install dmslite

Prerequisites

  1. PostgreSQL database
  2. Tesseract installed in your local language
  3. Ollama setup with a local Model.
  4. pdftoppm (Installed with sudo apt install poppler-utils)
  5. xdg-open. To be able to open Docuemnts right from the terminal.

PostgreSQL Database Setup

  1. CREATE USER dmslite WITH PASSWORD 'dmslite';
  2. As a psql superuser, create a PostgreSQL databaseand Schema:
    psql -U postgres
    CREATE DATABASE dmslite OWNER dmslite;
    CREATE SCHEMA dmslite;
  1. Write your Password under src/settings.rs in the String PSQL_PASSWD
  2. As the dmslite User, create search Indices, main_table and document_content table
    CREATE EXTENSION pg_trgm;
    
    -- create indices
    CREATE INDEX idx_content_trgm ON document_content USING gin (content gin_trgm_ops);
    CREATE INDEX idx_summary_trgm ON document_content USING gin (summary    gin_trgm_ops);
    CREATE INDEX idx_buzzwords_trgm ON document_content USING gin (buzzwords gin_trgm_ops);
    
    -- create tables
    CREATE TABLE dmslite.main_table (
        id SERIAL PRIMARY KEY,
        upload_date DATE,
        filepath VARCHAR(255),
        title TEXT
    );
    
    CREATE TABLE dmslite.document_content (
        id SERIAL PRIMARY KEY,
        -- Other columns in table2
        content TEXT,
        summary TEXT,
        buzzwords TEXT,
        -- Add more columns as needed
        FOREIGN KEY (id) REFERENCES main_table(id) ON DELETE CASCADE
    );
    

Ollama Custom Models Setup

Build custom Ollama models:
For Memory restricted machines the gemma:2b Model is recommended (Default).
Otherwise Choose llama2.
Set this inside the Modelfiles as the FROM <model> Command. 
```
ollama create doc_buzzword_generator -f doc_buzzword_generator
ollama create doc_summarizer -f doc_summarizer
ollama create doc_title_generator -f doc_title_generator
```

Settings

  1. Make a folder for consumation of documents.
  2. Make a folder for indexed storage of documents.
  3. Write the two absolute folder paths to the strings CONSUME_PATH and STORAGE_PATH in the file src/settings.rs.
    They must be absolute paths starting with /home/<user>/...
  4. Set the String TESSERACT_LANG to your tesseract Language flag. (E.g. "eng" or "deu")

Uninstall/Delete

Postgres

DROP TABLE document_content;
DROP TABLE main_table;
DROP INDEX IF EXISTS idx_content_trgm;
DROP INDEX IF EXISTS idx_summary_trgm;
DROP INDEX IF EXISTS idx_buzzwords_trgm;
DROP FUNCTION fuzzy_search_document_content;

Ollama

ollama rm doc_buzzword_generator
ollama rm doc_summarizer
ollama rm doc_title_generator

Dependencies

~11–25MB
~382K SLoC