- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.4k
Redshift batch inserts using COPY FROM operation #25866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a2827db    to
    9187510      
    Compare
  
    9187510    to
    0fca059      
    Compare
  
    0fca059    to
    0d4c2da      
    Compare
  
    | @ebyhr Added requested documentation. We need the ENV var set on the repo for the tests to pass. 
 | 
| This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. | 
        
          
                ...rino-redshift/src/main/java/io/trino/plugin/redshift/RedshiftBatchedInsertsCopyPageSink.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a toggle to control that we enable the COPY FROM or not, otherwise we only control it on session level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the functionality is only enabled when the redshift.batched-inserts-copy-location is set.  See:
https://github.com/trinodb/trino/pull/25866/files/0d4c2da0b150293a8c3b005aa9b9260b5ee098ae#diff-17fd662125cb688e826b7ba43089b3fe313d2f28d124b8378e893f58f3d924ecR80
This is then SESSION set able back to false I believe based on how this is implemented:
https://github.com/trinodb/trino/pull/25866/files/0d4c2da0b150293a8c3b005aa9b9260b5ee098ae#diff-332ac3ff4e15ae6df7a4cbdf8aecb64cfaa8ef4ac6613861b0fba9149a613563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention current implementation set the value to true by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that we enable this to true by default or just highlight that it is.  As implemented it's actually false by default and only true if redshift.batched-inserts-copy-location is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docs to mention that this is disabled by default.
| @chenjian2664 Thank you for having a look. I dropped some replies to your comments. What is the best way to get the additional env var added to GHA? Once that's is in, I can rebase and we can rerun the test suite. | 
0d4c2da    to
    4b04b86      
    Compare
  
    | This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. | 
| Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. | 
Description
Fixes #24546
This PR aims to allow the use of
COPY FROMstatements when sinking data into Redshift.The Redshift connector inherits
BaseJdbcConnectorwhich uses batched INSERT statements to execute sink operations. Even when using non transactional mode, this can only push about 1000 rows per second. This change stages the rows to a parquet file first, then issues aCOPY FROMstatement to load the table. We are noticing 250K rows per second or more using this method.This has been running in production for 2+ months on our own branch.
This functionality needs to be enabled by specifying the following config option:
The following options are also required when specifying the above:
A suggested IAM Policy to for this role and user:
Additional context and related issues
REDSHIFT_S3_COPY_ROOTsimilar toREDSHIFT_S3_UNLOAD_ROOTRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: