Skip to content

Conversation

@brendanstennett
Copy link
Contributor

@brendanstennett brendanstennett commented May 26, 2025

Description

Fixes #24546

This PR aims to allow the use of COPY FROM statements when sinking data into Redshift.

The Redshift connector inherits BaseJdbcConnector which uses batched INSERT statements to execute sink operations. Even when using non transactional mode, this can only push about 1000 rows per second. This change stages the rows to a parquet file first, then issues a COPY FROM statement to load the table. We are noticing 250K rows per second or more using this method.

This has been running in production for 2+ months on our own branch.

This functionality needs to be enabled by specifying the following config option:

redshift.batched-inserts-copy-location=s3://my-bucket/my-prefix

The following options are also required when specifying the above:

redshift.batched-inserts-copy-iam-role=arn:aws:iam::123456789000:role/redshift_iam_role
s3.region=region
s3.aws-access-key=KEY
s3.aws-secret-key=SECRET

A suggested IAM Policy to for this role and user:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket"
			],
			"Resource": "arn:aws:s3:::my-bucket"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:PutObject",
				"s3:DeleteObject"
			],
			"Resource": "arn:aws:s3:::my-bucket/*"
		}
	]
}

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Redshift
* Add support for Redshift COPY FROM statements for batch insert operations ({issue}`24546`)

@cla-bot cla-bot bot added the cla-signed label May 26, 2025
@github-actions github-actions bot added the redshift Redshift connector label May 26, 2025
@brendanstennett brendanstennett force-pushed the redshift-copy branch 2 times, most recently from a2827db to 9187510 Compare May 26, 2025 18:31
@ebyhr ebyhr added the needs-docs This pull request requires changes to the documentation label May 26, 2025
@brendanstennett brendanstennett changed the title [WIP] Redshift batch inserts using COPY FROM operation Redshift batch inserts using COPY FROM operation May 30, 2025
@github-actions github-actions bot added the docs label May 30, 2025
@brendanstennett
Copy link
Contributor Author

@ebyhr Added requested documentation. We need the ENV var set on the repo for the tests to pass.

REDSHIFT_S3_COPY_ROOT which would be a similar location to the ENV var already set of REDSHIFT_S3_UNLOAD_ROOT

@brendanstennett brendanstennett marked this pull request as ready for review June 2, 2025 13:19
@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Jun 23, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a toggle to control that we enable the COPY FROM or not, otherwise we only control it on session level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention current implementation set the value to true by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that we enable this to true by default or just highlight that it is. As implemented it's actually false by default and only true if redshift.batched-inserts-copy-location is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docs to mention that this is disabled by default.

@github-actions github-actions bot removed the stale label Jun 24, 2025
@brendanstennett
Copy link
Contributor Author

@chenjian2664 Thank you for having a look. I dropped some replies to your comments. What is the best way to get the additional env var added to GHA? Once that's is in, I can rebase and we can rerun the test suite.

@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Jul 21, 2025
@github-actions
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed docs needs-docs This pull request requires changes to the documentation redshift Redshift connector stale

Development

Successfully merging this pull request may close these issues.

Support Redshift COPY for bulk loads

3 participants