3 releases

0.0.3	Dec 15, 2023
0.0.2	Nov 19, 2023
0.0.1	Nov 13, 2023

#248 in Text processing

77 downloads per month

Custom license

25KB
236 lines

project-banner

Text Summarization With TF-IDF In Rust

Implementation of an extractive text summarization system which uses TF-IDF scores of words present in the text to rank sentences and generate a summary

[!NOTE] Do read the blog on TowardsDataScience

Contents

Usage
Contributing
Useful External Resources

Usage

Usage in Rust

Compiling this project requires Rust's nightly build (required by punkt, a dependency of this project) that can be added with rustup,

$> rustup toolchain install nightly
$> cargo new <crate-name>
$> cd <crate-name>
$> crate-name> rustup override set nightly

You may check the official guides for Nightly builds and overrides

Add the dependency tfidf-text-summarizer = "0.0.1" in your project's Cargo.toml,

[package]
...

[dependencies]
tfidf-text-summarizer = "0.0.1"

and then execute cargo build to download the crate (and its dependencies punkt and rayon).

The crate provides two functions to extract summaries from a given text. Both functions take two parameters as input, text: &str and reduction_factor: f32 where text is the document whose summary has to be generated and reduction_factor is the relative proportion of sentences that would be included in the generated summary.

For instance, if reduction_factor = 0.4 and the number of sentences in text is 20, then the extracted summary will contain the top-8 (40% of 20) most-informative sentences from text.

tfidf-text-summarizer::summarize: Computes the TF-IDF score of each word in text and then uses the normalized sum of TF-IDF scores of all words present in the document to rank each sentence. The normalization factor used is the number of tokens present in the sentence. It returns a Strign representing the extracted summary.
tfidf-text-summarizer::par_summarize: It is similar to summarize but uses Rayon to parallelize some operations in the summarization pipeline. For larger texts, par_summarize out-performs summarize on a multi-core system.

use summarizer::{summarize,par_summarize} ; 
use std::fs as fs ; 

fn main() {
    let text: String = fs::read_to_string( "wiki.txt" )
                            .expect( "Could not read wiki.txt" ) ;
    let reduction_factor: f32 = 0.4 ; 

    // Use summarize of par_summarize here
    let summary: String = summarize( text.as_str() , reduction_factor ) ; 
    
    println!( "Summary is {}" , summary ) ;
}

Usage with C/C++ codebases and with a Debian package

Building an executable with GCC

Static libraries could be generated by setting crate_type = [ "staticlib" ] in Cargo.toml. Libraries (.a archives) along with C header files (generated with cbindgen) will help us use summarize and par_summarize methods in C/C++ projects.

Using the Rust-generated static library with C

Using the summarize method in C code (See examples/c for a complete example):

#include "summarizer.h"
#include <stdlib.h>
#include <stdio.h>

int main( int argc , char** argv ) {
    char* filename = argv[ 1 ] ; 
    FILE* file_ptr = fopen( filename , "r" ) ;
    fseek( file_ptr , 0 , SEEK_END ) ; 
    long size = ftell( file_ptr ) ; 
    fseek( file_ptr , 0 , SEEK_SET ) ; 
    char* buffer = (char*) calloc( size , sizeof(char) );  
    fread( buffer , sizeof( char ) , size , file_ptr ) ;
    fclose( file_ptr ) ;
    const char* summarized_text = (char*) summarize( buffer , size , 0.5f ) ;
    printf( "%s \n" , summarized_text ) ;
    return 0 ;
}

Building the Debian package

Following the steps mentioned in Using the static library with C/C++, we can copy the C header file summarizer.h and static library in the debian directory,

$> cp target/x86_64-unknown-linux-gnu/release/libsummarizer.a debian/summarizer/
$> cp examples/c/summarizer.h debian/summarizer/

A. Packaging the header and library

We can now build a Debian package which will perform the following tasks after its installation on the user's system,

Copy libsummarizer.a to /usr/local/lib/
Copy summarizer.h to /usr/include/

These two steps are accomplished with the postinst script in debian/summarizer/DEBIAN/

#!/bin/bash
cp ../libsummarizer.so /usr/local/lib/
cp ../summarizer.h /usr/include/

the control script in the same directory provides information about the package,

Package: Summarizer
Version: 0.0.1
Maintainer: Shubham Panchal
Architecture: amd64
Description: A text summarizer based on TF-IDF

To build the package with dpkg-deb utility and then rename it, we can write a simple Bash script build_package.sh,

#!/bin/bash
dpkg-deb --build summarizer
mkdir -p packages
mv summarizer.deb packages/summarizer-v0.0.1-amd64.deb

To build the package, execute build_package.sh,

$> cd debian
$ debian> bash build_package.sh

The package summarizer-v0.0.1-amd64.deb will be generated in debian/packages directory.

Installing the Debian package

To install the Debian Package, use the dpkg utility,

$> sudo dpkg -i summarizer-v0.0.1-amd64.deb

Usage in Android

We can compile the Rust code to shared libraries targeting armeabi-v7a and arm64 architectures. After installing the Android NDK package and necessary toolchains with rustup, we can compile the .so libraries. See the android module in src/lib.rs for the JNI functions.

See examples/android/README.md for more details.

Contributing

The project can be improved on the following points (taken from the blog):

The current implementation requires the nightly build of Rust, only because of a single dependency punkt . punkt is a sentence tokenizer which is required to determine sentence boundaries in the text, following which other computations are made. If punkt can be built with stable Rust, the current implementation will no more require nightly Rust.
Adding newer metrics to rank sentences, especially which capture inter-sentence dependencies. TFIDF is not the most accurate scoring function and has its own limitations. Building sentence graphs and using them for scoring sentences has greatly enhance the overall quality of the extracted summary.
The summarizer has not been benchmarked against a known dataset. Rouge scores R1 , R2 and RL are frequently used to assess the quality of the generated summary against standard datasets like the New York Times dataset or the CNN Daily mail dataset. Measuring performance against standard benchmarks will provide developers more clarity and reliability towards the implementation.
Completing the Python implementation in examples/python.

Useful External Resources

Dependencies

~11–22MB
~155K SLoC