10 releases
0.1.9 | Nov 17, 2023 |
---|---|
0.1.8 | Nov 16, 2023 |
#28 in #pest-parser
11KB
98 lines
Telegram CSV Parser
This Rust program parses a CSV file containing data collected from Telegram. The primary goal is to extract messages sent by users.
Technical Description
The parser uses the telegram_csv_parser
crate, which is based on the Pest parser generator. It follows a set of grammar rules defined in the csv.pest
file. The CSV file is expected to have a specific structure where messages are identified based on the presence of "PeerUser(user_id=" in the row.
Parsing Process
- The program reads the CSV file (
example_collected_data_from_telegram.csv
) into memory. - The
CSVParser
parses the content based on the specified grammar rules. - For each row, it checks if the row contains "PeerUser(user_id=" indicating a user's message.
- If a message is found, the program counts the number of messages and words in the message.
Parsing Logic
Grammar rules to parse fields of CSV file
quoted_string = { "\"" ~ (!"\"" ~ ANY)* ~ "\"" }
This rule defines a quoted string within double quotes. ' " ' matches the opening double quote. ' (!""" ~ ANY)* ' matches any sequence of characters that is not a double quote, capturing everything between the double quotes. ' " ' matches the closing double quote.
value = { quoted_string | (!"," ~ (!"\n" ~ ANY))* }
This rule defines a value, which can be either a quoted string or any sequence of characters that is not a comma. Quoted_string is an alternative option. ' (!"," ~ (!"\n" ~ ANY))* ' matches any sequence of characters that is not a comma, capturing everything until a comma or the end of the line.
row = { value ~ ("," ~ value)* }
This rule defines a row, which consists of one or more values separated by commas. Value matches the first value. ' ( "," ~ value)* ' matches zero or more occurrences of a comma followed by another value.
file = { SOI ~ (row ~ ("\r\n" | "\n"))* ~ EOI }
This rule defines a file, which starts with the start of input (SOI). (row ~ ("\r\n" | "\n"))* matches zero or more occurrences of a row followed by either a Windows-style line ending (\r\n) or a Unix-style line ending (\n). It ends with the end of input (EOI).
Usage
Ensure you have Rust and Cargo installed. Then run the following commands:
cargo run -- -f name_of_the_file_to_parse.csv
Or
cargo run -- --file name_of_the_file_to_parse.csv
By default it will create an output file where the results are stored.
However, you can specify the path to the output file where you want to save the results
cargo run -- -f examples\example_collected_data_from_telegram.csv -o my_file_to_save.txt
Or
cargo run -- -f examples\example_collected_data_from_telegram.csv --output my_file_to_save.txt
When need help, use
cargo run -- -help
Documentation
In case if you need a documentation, follow the command:
cargo doc --open
or use the following link.
Example
Running the following command:
Or
Or
Result:
Dependencies
~4.5MB
~83K SLoC