Skip to content

[New Feature] Faster Bulk-Data Loading in YugabyteDB #11765

@ymahajan

Description

@ymahajan

Jira Link: DB-4641

Description

Master issue to track improvements to make it easier and faster to get large amounts of data into YugabyteDB.

Phase 1

Status Feature GitHub Issue Comments
Faster non transactional writes during bulk load #7809 Allowing faster writes on copy command by using session variable "yb_force_non_transactional_writes".
Disable transactional writes during bulk data loading for indexes #11266 Add yb_disable_transactional_writes session to improve the latency performance of bulk data loading for index tables such as when COPY command is used which goes into the insert write path (not delete or update).
Implement Async Flush for COPY command #11628 Currently, we synchronously wait for a flush response every time we flush. We want to make this asynchronous to reduce the time spent waiting and improve the performance of COPY.
Speed up YSQL inserts by skipping lookup of keys being inserted #11269 During bulk load (for example inserts by Copy command), skip lookup of the key being inserted, to speed up the inserts. This is similar to the upsert mode that is supported for YCQL.
Optimize memory allocation/deallocation in bulk insert/copy using Protobuf's arena #11720 Currently when running bulk insert / copy command, in the PostgreSQL backend for, about 15 percent of CPU time is spent on memory allocation / deallocation.
Perf improvement by eliminating serialization to the WAL format. #11409 When writing data to the RocksDb layer, there are additional steps of serializing to the WAL format which is unnecessary and leads to wasted work.
Tuning parameters for faster copy performance #12293 Tuning parameters for faster copy performance
Pack columns in DocDB storage format for better performance #3520 Packing columns into a single RocksDB entry per row instead of one per column (as we do currently) improves YSQL performance
⬜️ Parallelize copy command #11453 Distribute copy operation internally using multiple workers

Phase 2

Status Feature GitHub Issue Comments
⬜️ Streaming ingest to YugabyteDB without using JDBC Inserting around 1 billion records through the streaming interface every day. It will be inefficient to transfer this huge volume of records over the JDBC interface. It could be implementing Spark RDD write interface.

Metadata

Metadata

Labels

area/ycqlYugabyte CQL (YCQL)kind/enhancementThis is an enhancement of an existing featurepriority/mediumMedium priority issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions