#reddit #dump #search #tool #io #command #pushshift

app reddit-search

A search tool for the pushshift.io Reddit dumps

34 releases (9 breaking)

new 0.10.0 Apr 22, 2024
0.8.4 Feb 20, 2024
0.8.1 Nov 29, 2023

#231 in Command line utilities

Download history 106/week @ 2024-02-14 166/week @ 2024-02-21 178/week @ 2024-02-28 2/week @ 2024-03-06 4/week @ 2024-03-13 1/week @ 2024-03-27 4/week @ 2024-04-03 118/week @ 2024-04-10 414/week @ 2024-04-17

537 downloads per month

MIT license

32KB
602 lines

reddit-search

a tool for searching the pushshift reddit dumps written in rust. Available from crates.io via cargo install reddit-search

if you do not have cargo or rustc, please follow the steps outlined in the official rust documentation: https://www.rust-lang.org/tools/install

The dumps are available via torrent from here: https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee

usage

To see command line parameters, use reddit-search -h or --help

Sample usage commands

Basic Usage

reddit-search --input <input file path> --output <output file path> --fields <field:value> ...

Presets

Preset Name Description
en_news Subreddits focused on global and regional news and current events.
en_politics A range of subreddits covering various political discussions, humor, and memes, including general politics and specific political orientations.
en_science Subreddits dedicated to general science, scientific inquiries, and discussions on scientific advancements.
en_hate_speech Subreddits known for promoting hate speech and controversial content.
controversial Content with high levels of controversy across various themes.

Each preset is a collection of filters designed to target specific themes. Should you be interested in using this and would like additional filters to be added, do not hesitate to contact me.

Descriptions of the fields contained within reddit dumps

Note that not all data contains all of these fields (for example, a comment from 2007 would not have the "gilded" field since that system was not implemented until later.)

Boolean values are saved numerically (0 is false, 1 is true)

Field Description
archived Boolean indicating if the item is archived
id Unique identifier of the item
controversiality Boolean indicating if the item is controversial
body Text content of the item
ups Number of upvotes
score_hidden Boolean indicating if the score is hidden
edited Boolean indicating if the item has been edited
distinguished Status of the item (e.g., null, moderator)
created_utc UTC timestamp of item creation
name Another unique identifier (?)
gilded Number indicating how many times the item was gilded
score Total score of the item
subreddit_id Identifier of the subreddit
link_id Identifier of the link to the comment
author_flair_text Text of the author's flair
subreddit Name of the subreddit
retrieved_on UTC timestamp of when the item was retrieved
parent_id Identifier of the parent item
downs Number of downvotes
author_flair_css_class CSS class of the author's flair
author Name of the author

Versioning

Older versions of the program can be downloaded using version overrides with cargo. Tags are not carried over to github.

Disclaimer

"reddit-search" is an independent open-source tool designed for searching Zstandard (zst) dumps of Reddit posts and comments. It is not affiliated, associated, authorized, endorsed by, or in any way officially connected with Reddit, Inc. The use of the Reddit name and any related trademarks in this project is purely for descriptive purposes. The trademarks and product names belong to their respective owners, who are not affiliated with, do not endorse, and do not sponsor "reddit-search". This project is developed under the principles of fair use and open source collaboration.

Dependencies

~8–17MB
~190K SLoC