Skip to content

Add configurable insert_batch_size JDBC session property#8434

Merged
hashhar merged 1 commit intotrinodb:masterfrom
starburstdata:sm/batch-size
Jul 6, 2021
Merged

Add configurable insert_batch_size JDBC session property#8434
hashhar merged 1 commit intotrinodb:masterfrom
starburstdata:sm/batch-size

Conversation

@sergey-melnychuk
Copy link

@sergey-melnychuk sergey-melnychuk commented Jun 30, 2021

Example: set session postgresql.insert_batch_size = 42;

Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash first 3 commits together. Also a good rule is to have cleanup commits before actual change, so it is easier to extract them to separate PR and when you address review comments then it lower chance for conflicts.

@hashhar hashhar self-requested a review July 1, 2021 10:55
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

Some minor comments.

@sergey-melnychuk sergey-melnychuk changed the title Add configurable JDBC batchSize property Add configurable insert_batch_size JDBC session property Jul 1, 2021
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good % a couple of nitpicks.

A suggestion about one of the tests.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments.

@wendigo Should we add a test in BaseJdbcConnectorTest which tries to set the session property and run an insert? Does that actually test anything?

@wendigo
Copy link
Contributor

wendigo commented Jul 2, 2021

@hashhar I don't know whether it adds value since we are not able to determine how many batches were executed.

@sergey-melnychuk Maybe let's add insert test with a small and large batch sizes (and insert e.g. 100 rows with a batch size = 1, 50, 64, 100 and 10000) to see how JdbcPageSink handles multiple and single batch inserts.

@sergey-melnychuk
Copy link
Author

@hashhar @wendigo WDYT about slightly refactored JdbcPageSink in 2f00609 ? I have extracted factory methods into JdbcPageSinkProvider (the only user of JdbcPageSink ctor) to de-couple extended initialisation logic. This allows introducing simple and convenient way to test if batch size property is respected during page appends.

Such test can be extended to other methods of JdbcPageSink, but IMO it would be out of scope of this PR.

@sergey-melnychuk sergey-melnychuk force-pushed the sm/batch-size branch 3 times, most recently from 675baa9 to 265354f Compare July 3, 2021 08:44
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed

@sergey-melnychuk
Copy link
Author

Thanks @kokosing , will put refactoring and unit-test out to separate PR and then add e2e test similar to existing ones.

IMO such refactoring still introduces SRP and separation of concerns, seems like lack of them makes testing of JdbcPageSink unnecessary complicated. But I will 'disagree and commit' anyway :)

@sergey-melnychuk
Copy link
Author

Leaving commit with refactoring out of this PR, now preparing the test using BaseJdbcConnectorTest as a blueprint.

@sergey-melnychuk sergey-melnychuk force-pushed the sm/batch-size branch 2 times, most recently from 04aaab2 to 51825fd Compare July 5, 2021 11:29
@sergey-melnychuk
Copy link
Author

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comment.

Looks good to go otherwise.

@sergey-melnychuk sergey-melnychuk force-pushed the sm/batch-size branch 3 times, most recently from 2a88fd0 to d9d232f Compare July 5, 2021 15:05
@wendigo
Copy link
Contributor

wendigo commented Jul 6, 2021

LGTM, please squash commits into single one @sergey-melnychuk

Co-authored-by: Ashhar Hasan <hashhar_dev@outlook.com>
@hashhar hashhar merged commit 70e849c into trinodb:master Jul 6, 2021
@hashhar hashhar deleted the sm/batch-size branch July 6, 2021 17:56
@hashhar hashhar added this to the 360 milestone Jul 6, 2021
@hashhar hashhar mentioned this pull request Jul 7, 2021
11 tasks
@findepi
Copy link
Member

findepi commented Jul 12, 2021

Looking at the code comments (none), it's not clear why we want to have this configurable.

As in #3775 (comment) -- we should understand why this needs to be configurable before making it so.
Didn't read PR comments, but ideally this knowledge should be retained as a code comment.

@sergey-melnychuk
Copy link
Author

@findepi PTAL #8527
cc @wendigo

return insertBatchSize;
}

@Config("insert.batch-size")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it applicable to CREATE TABLE AS as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The PageSink gets used there too. write.batch-size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// between performance and pushdown capabilities
private int domainCompactionThreshold = 32;

private int insertBatchSize = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong to JdbcMetadataConfig, since it's not consumed in JdbcMetadata layer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly because there were no session properties contributed by BaseJdbcConfig.
Maybe this should change? WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hashhar, @findepi,

Thank you for the discussion. I'm currently working with Trino 465 (OSS) and using the JDBC connector to insert data into SQL Server.

I've observed that even after setting write_batch_size = 5000, the inserts are still performed row-by-row in SQL Server Profiler (sp_execute per row), which impacts performance significantly for large datasets.

From my analysis:

  • The property insert_batch_size was introduced in PR Add configurable insert_batch_size JDBC session property #8434 to control batch size.
  • However, in Open Source Trino, this property is not exposed as a session property (SET SESSION insert_batch_size = ...).
  • Instead, write_batch_size is used, but it seems to be ignored or not applied effectively in some cases.

My question:
Is there a known reason why write_batch_size doesn't lead to true batched inserts (e.g., multi-row VALUES or bulk operations) when using the JDBC connector?

I'd appreciate any insights or guidance on how to achieve better batching behavior.

Thanks for your time and contributions to the project!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi !! The property was renamed write_batch_size so we could SET SESSION <catalog_name>.write_batch_size-... for configuring it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_batch_size is more of a JDBC specific generic property but it doesn't guarantee that writes operations are performed in bulk - it depends on how the JDBC driver is being implemented. In case of SQLServer sp_execute could be invoked per row. Have we tried by setting bulkInsertCopy property in SQLServer - https://learn.microsoft.com/en-us/sql/connect/jdbc/using-bulk-copy-with-the-jdbc-driver?view=sql-server-ver17#sqlserverbulkcopyoptions

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Praveen2112,

Thank you for the clarification!

I understand that write_batch_size is a JDBC-specific property and doesn't guarantee true bulk inserts — it only controls batch size for parameterized statements.

However, I'm trying to achieve real bulk insert performance (like BULK INSERT or bcp) when inserting 1.5M rows from Trino to MSSQL.

My goal is to see in SQL Server Profiler:

BULK INSERT [trino_test].[aaaaaa].[orders] 
FROM 'virtual_stream' 
WITH (TABLOCK, BATCHSIZE = 1000)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please advise what configuration changes are required in Trino 465 (Open Source) to enable this behavior?

I've already confirmed that:
dbs-ai-sqldev_trino-test.bulk_copy_for_write = true is set in the catalog
retry_policy = 'NONE' is set in the session
Despite this, I still observe row-by-row inserts via sp_execute in SQL Profiler.

Is there any additional configuration or known limitation in Open Source Trino that prevents the use of SQLServerBulkCopy API?

public static final String AGGREGATION_PUSHDOWN_ENABLED = "aggregation_pushdown_enabled";
public static final String TOPN_PUSHDOWN_ENABLED = "topn_pushdown_enabled";
public static final String DOMAIN_COMPACTION_THRESHOLD = "domain_compaction_threshold";
public static final String INSERT_BATCH_SIZE = "insert_batch_size";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here -- it shouldn't be in this class, since it's not consumed by metadata layer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

8 participants