#database-table #fake-data #avro #date #min-max #generate #json

bin+lib datagen

An easy to use tool to generate fake data in bulk and export it as Avro, Parquet or directly into your database as tables

4 releases

0.1.4 Nov 16, 2019
0.1.3 Jun 19, 2019
0.1.2 Jun 19, 2019
0.1.1 Jun 5, 2019
0.1.0 Jun 5, 2019

#893 in Encoding

MIT license

36KB
815 lines

DataGen

An easy to use tool to generate fake/dummy data in bulk and export it as Avro, CSV, Json or directly into your database as tables (coming soon!).

Build Status

DataGen is a command line application written in Rust that generates dummy data for provides a means of interacting with the social Web from your personal desktop.

Features

  • Export Data as Files
    • CSV
    • Avro
    • Json
    • Parquet
  • Export Data into Database
    • Postgres
    • MySQL
  • Supports Int, Long, Double, Float, String, Date, DateTime
  • Supports one_of to generate random values from a list
  • Supports min and max for numeric and date fields
  • Supports mean and std for numeric fields
  • Supports custom date formatting for Date and DateTime datatypes
  • Generate unique records by respecting the primary key attribute
  • Generate multiple datasets with PrimaryKey/ForeignKey
  • Support Richer types - Date, Map, Arrays, Nested Records

Installation

At the moment, the installation is done only through Cargo. Please install Cargo by following the instructions from https://www.rust-lang.org/tools/install.

Once cargo is installed, you could pull the binary from crates.io using :


cargo install datagen

Note: The binary would have been placed in your <HOME_DIR>/.cargo/bin/ which the Cargo installation would have placed in your PATH. If not, please add it to your PATH.

Usage example

CSV

datagen csv "<output_dir>/output.csv" "<schema_yaml_dir>/schema.yaml" 100 "^"

asciicast

Avro

datagen avro "<output_dir>/output.avro" "<schema_yaml_dir>/schema_simple.yaml" 100

asciicast

Json

datagen json "<output_dir>/output.json" "<schema_yaml_dir>/all_examples.yaml" 100

Schema YAML

---
name: person_schema
dataset:
  name: person_table
  columns:
    - {name: id, not_null: false, dtype: int}
    - {name: name, dtype: name}
    - {name: age, dtype: age}
    - {name: adult, default: 'false', dtype: boolean}
    - {name: gender, dtype: string, one_of: ["M", "F"]}
    - {name: dob, dtype: "date", min: "01/01/1950" , max: "03/01/2014", format: "%d/%m/%Y"}
    - {name: event_date, dtype: "datetime", min: "2014-11-28 12:00:09" , max: "2014-11-30 12:00:09", format: "%Y-%m-%d %H:%M:%S"}
    - {name: score, dtype: "int", mean: 1.00, std: 0.36}
    - {name: distance, dtype: "int", min: 19000, max: 221377}
    - {name: weight, dtype: "float", min: 1.00, max: 500.00}

Date format specifiers could be sourced from : https://docs.rs/chrono/0.4.9/chrono/format/strftime/index.html#specifiers

An example for the schema YAML is located at <PROJECT_ROOT>/test_data/schema_options.yaml

Development setup

  1. Clone the repo
  2. Run cargo build
  3. Run cargo test -- --color always --nocapture
  4. Run program (& Profit!)
CSV
cargo run -- "csv" "<output_dir>/output.csv" "<schema_yaml_dir>/schema.yaml" 100 ";"
Avro
cargo run -- "avro" "<output_dir>/output.avro" "<schema_yaml_dir>/schema_simple.yaml" 100
Json
cargo run -- "json" "<output_dir>/output.json" "<schema_yaml_dir>/schema.yaml" 100

Release History

  • 0.1.0

    • Support for CSV (no headers)
    • Support for Avro (primitive types)
  • 0.1.1

    • Support for custom delimiters for CSV
  • 0.1.3

    • Json support added
    • Support for semantic types (name, date, latitude, phone etc)
  • 0.1.4

    • Supports one_of eg.

              - {name: "day_of_week", dtype: "string", one_of:["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]} 
      
    • Support for min and max for numeric columns

              - {name: "age", dtype: "int", min: 1 , max: 130}
      
    • Support for Date and Datetime (along with min and max)

             - {name: "event_time", dtype: "datetime", min: "2014-11-28 12:00:09" , max: "2014-11-30 12:00:09", format: "%Y-%m-%d %H:%M:%S"}
             - {name: "dob", dtype: "date", min: "01/01/1920" , max: "03/01/2019", format: "%d/%m/%Y"}
      
    • Support for semantic types (name, date, latitude, phone etc)

Meta

Arun Manivannan – @arunmaarun@arunma.com

Distributed under the MIT license. See LICENSE for more information.

https://github.com/arunma/datagen

Contributing

You want to help out? Awesome!

  1. This is my first Rust project. If you are an experienced Rust programmer, I can't thank enough for doing a code review.
  2. If you are interested in adding new sinks to the project/report bugs/add features/add docs, thank you in advance. Your efforts are very much appreciated.

Dependencies

~9MB
~153K SLoC