Skip to content
/ porter Public

Porter is a data cleaning tool designed to assist with full data extraction from MySQL, MongoDB, and text files (CSV/TSV/JSON) and push them into Redis queues. It supports features like resumable uploads, customizable wait delays, and configurable batch sizes.

License

Notifications You must be signed in to change notification settings

zhiweio/porter

Repository files navigation

中文

Porter

Porter is a data cleaning tool designed to assist with full data extraction from MySQL, MongoDB, and text files (CSV/TSV/JSON) and push them into Redis queues. It supports features like resumable uploads, customizable wait delays, and configurable batch sizes.

Installation

$ git clone https://github.com/zhiweio/porter.git && cd porter
$ python3 setup.py install --user

Usage

Configuration Parameters

Reader Types

Specify the data source type:

reader: [mysql|mongo|json|file|csv]

Redis Configuration

Configure Redis connection and data queue settings:

redis:
  host: Redis server address
  port: Redis port
  db: Redis database number
  password: Redis password (optional)
  key: Task name
  queue_key_prefix: Prefix for data queue names (defaults to `porter.queue.` if left empty)
  cache_key_prefix: Prefix for cache names (defaults to `porter.cache.` if left empty)

MySQL Configuration

mysql:
  host: MySQL server address
  port: MySQL port
  db: Database name
  table: Table name
  user: MySQL username
  password: MySQL password
  pk: Primary key (defaults to `id` if left empty)
  column: Columns to upload (uploads all columns if left empty)
  append_db_info: Optional, upload database and table names (true or false, defaults to false)
  appendices: Optional, additional fields to upload (refer to the template file for details)

MongoDB Configuration

mongo:
  host: MongoDB server address
  port: MongoDB port
  db: Database name
  collection: Collection name
  user: MongoDB username
  password: MongoDB password
  column: Columns to upload (uploads all columns if left empty)
  appendices: Same as above

Text File Configuration

file:
  path: File path
  delimiter: File delimiter
  header: true if the first row is a header (uploads data as JSON); false if not (uploads raw data)
  appendices: Same as above

JSON File Configuration

json:
  path: File path
  appendices: Same as above

Usage Examples

MySQL Configuration Example

---
reader: mysql

redis:
  host: "127.0.0.1"
  port: 6379
  db: 56
  password: "123456"
  key: task_read_from_mysql
  queue_key_prefix: porter.queue.
  cache_key_prefix: porter.cache.

mysql:
  host: "127.0.0.1"
  port: 3306
  db: db_economic
  table: t_macroindex
  user: test
  password: "123456"
  pk: id
  column: 

# Additional fields can be uploaded to Redis.
# The format of appendices should be ^\w+:\w+$, interpreted as key-value pairs.
# Set append_db_info to true to also upload database and table names.

MongoDB Configuration Example

---
reader: mongo
...

mongo:
  host: "127.0.0.1"
  port: 7055
  db: iEnterprise
  collection: TLDetailData
  user: root
  password: "123456"
  column: 

JSON File Example

---
reader: json
...

json:
  path: /path/to/data/test.json
#  appendices:
#    - field_name:"hello world"

CSV File Example

---
reader: file
...

file:
  path: /path/to/data/test.csv
  delimiter: ","
  header: true
#  appendices:
#    - field_name:"hello world"

Plain Text File Example

Upload raw data without JSON formatting:

---
reader: file
...

file:
  path: /path/to/data/test.csv
  delimiter: 
  header: false
#  appendices:
#    - "hello world"

Command Usage

Usage: porter [OPTIONS] [sync|monitor|clear|new]

  A command-line tool for extracting data and loading it into Redis.

Options:
  -V, --version                   Show the version and exit.
  -f, --config-file PATH          Path to the task config file.
  -l, --limit INTEGER             Limit the number of records read from the data source per batch [default: 1000].
  --limit-scale INTEGER           Maximum Redis queue size is (limit * scale) [default: 3].
  --blocking / -B, --no-blocking  Enable blocking mode [default: True].
  -t, --time-sleep INTEGER        Time (in seconds) to wait when the queue reaches the maximum limit [default: 10].
  -C, --clean-type [status|queue|all]
                                  Type of Redis cache to clear.
  -T, --task-type [mysql|mongo|json|file|csv]
                                  Type of task template.
  -o, --output-task-file PATH     Save task template to a file.
  -v, --verbose                   Enable verbose mode (print debug information).
  --debug-file PATH               Path to a file for DEBUG logging.
  -h, --help                      Show help information and exit.
  1. sync: Synchronize data.
  2. monitor: Monitor the task status.
  3. clear: Clear the task status.
  4. new: Create a new task template.

Example Commands

Sync 100 records at a time and stop when the maximum queue limit is reached:

$ porter sync -f task_template.yaml -l 100

Sync all data without blocking:

$ porter sync -f task_template.yaml --no-blocking

Sync data with verbose logging:

$ porter sync -f task_template.yaml -l 100 -v --debug-file /tmp/porter.log

Monitor the task progress:

$ porter monitor -f task_template.yaml
{
    "db": "db_economic",
    "table": "t_macroindex",
    "count": 61,
    "page": 6,
    "record": { ... }
}

Clear cache data (options: all, status, or queue):

$ porter clear -f task_template.yaml --clean-type all

Create a task configuration template (options: mysql, mongo, json, file, csv):

$ porter new -T mysql

Create a task template and save it to a file:

$ porter new -T mysql -o /tmp/mysql.yaml

Notes

Each data sync task must have a unique combination of cache_key_prefix + key. However, multiple tasks can push data to the same queue, meaning they can share the same queue_key_prefix + key.

About

Porter is a data cleaning tool designed to assist with full data extraction from MySQL, MongoDB, and text files (CSV/TSV/JSON) and push them into Redis queues. It supports features like resumable uploads, customizable wait delays, and configurable batch sizes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages