#json-schema #parquet #data #csv #generate #parquet-file #schema-file

bin+lib csv_generator

A tool that generates CSV or Parquet files with synthetic data based on a provided JSON schema

1 stable release

new 1.0.0 Nov 19, 2024

#2589 in Parser implementations

Download history 55/week @ 2024-11-14

55 downloads per month

MIT/Apache

20KB
312 lines

datahobbit - CSV or Parquet Generator

A Rust command-line tool that generates CSV or Parquet files with synthetic data based on a provided JSON schema. It supports custom delimiters for CSV, displays a progress bar during generation, and efficiently handles large datasets using parallel processing.

Table of Contents

Features

  • Flexible Schema Definition: Define your data structure using a JSON schema file.
  • Synthetic Data Generation: Generates realistic data for various data types.
  • CSV and Parquet Support: Output data in CSV or Parquet format.
  • Parallel Processing: Utilizes multi-threading for fast data generation.
  • Custom Delimiters: Supports optional delimiters for CSV, defaulting to a comma.
  • Progress Indicator: Displays a progress bar during data generation.
  • Error Handling: Provides clear error messages for unsupported data types or invalid input.

Installation

To build and run the CSV and Parquet Generator, you need to have Rust installed on your system.

  1. Clone the Repository

    git clone https://github.com/yourusername/datahobbit.git
    cd datahobbit
    
  2. Build the Project

    cargo build --release
    

    This will create an executable in the target/release directory.

Usage

Command-Line Options

Run the executable with the following options:

USAGE:
    datahobbit [OPTIONS] <input> <output>

ARGS:
    <input>     Sets the input JSON schema file
    <output>    Sets the output file (either .csv or .parquet)

OPTIONS:
    -d, --delimiter <DELIMITER>       Sets the delimiter to use in the CSV file (default is ',')
    -h, --help                        Print help information
    -r, --records <RECORDS>           Sets the number of records to generate
    --format <FORMAT>                 Sets the output format: either "csv" or "parquet" (default is "csv")
    --max-file-size <MAX_FILE_SIZE>   Sets the maximum file size for Parquet files in bytes (default is 512 MB)
    -V, --version                     Print version information

Schema Definition

The JSON schema defines the structure of the output file, including column names and data types. Here is an example schema:

{
  "columns": [
    { "name": "id", "type": "integer" },
    { "name": "first_name", "type": "first_name" },
    { "name": "last_name", "type": "last_name" },
    { "name": "email", "type": "email" },
    { "name": "phone_number", "type": "phone_number" },
    { "name": "age", "type": "integer" },
    { "name": "bio", "type": "sentence" },
    { "name": "is_active", "type": "boolean" }
  ]
}

Examples

Generate a CSV with Default Settings

cargo run -- schema.json output.csv --records 100000
  • Generates 100,000 records.
  • Uses the default comma delimiter.

Generate a Parquet File

cargo run -- schema.json output.parquet --records 100000 --format parquet
  • Generates 100,000 records.
  • Outputs data in Parquet format.

Generate a Parquet File with Custom Size Limit

cargo run -- input_schema.json output.parquet --records 1000000 --format parquet --max-file-size 10485760

Generates 1,000,000 records. Outputs data in Parquet format. Uses a maximum file size of 10 MB, creating additional files as needed.

Generate a CSV with a Custom Delimiter

cargo run -- input_schema.json output.csv --records 100000 --delimiter ';'
  • Generates 100,000 records.
  • Uses a semicolon (;) as the delimiter.

Display Help Information

cargo run -- --help

Supported Data Types

The following data types are supported in the schema:

  • integer: Generates random integers between 0 and 1000.
  • float: Generates random floating-point numbers between 0.0 and 1000.0.
  • string: Generates random words.
  • boolean: Generates random boolean values (true or false).
  • name: Generates full names.
  • first_name: Generates first names.
  • last_name: Generates last names.
  • email: Generates email addresses.
  • password: Generates passwords with lengths between 8 and 16 characters.
  • sentence: Generates sentences containing 5 to 10 words.
  • phone_number: Generates phone numbers.

Example Usage in Schema

{ "name": "age", "type": "integer" }
{ "name": "description", "type": "sentence" }
{ "name": "is_verified", "type": "boolean" }

License

This project is licensed under the MIT License.


Author: Daniel Beach (dancrystalbeach@gmail.com)

Version: 1.0


Dependencies

~46MB
~882K SLoC