Skip to content

Conversation

@duoxoud
Copy link

@duoxoud duoxoud commented Mar 27, 2025

(Reopen #11334)
Closes #7584

This PR addresses a feature request for improving the Glue Schema generation process. It introduces a new configuration option that allows users to exclude non-current fields from the Glue Schema, providing clarity and reducing confusion for Athena users who primarily query current data.

In PR #3888, the Glue schema generation was modified to include all historical fields. This was intended to help users recognize previously used columns and avoid duplicating column names. However, in practice, this approach has led to confusion among users (for example, the same issue explained in #7584 ).

The current behaviour remains unchanged.
(introduced GLUE_NON_CURRENT_FIELDS_DISABLED_DEFAULT = false to keep the current behaviour)

@github-actions github-actions bot added the AWS label Mar 27, 2025
@duoxoud duoxoud force-pushed the option-to-disable-non-current-fields-in-glue branch from e216cf9 to c65ca7f Compare March 27, 2025 11:00
@duoxoud duoxoud force-pushed the option-to-disable-non-current-fields-in-glue branch from c65ca7f to 1b83550 Compare March 27, 2025 11:11
@duoxoud duoxoud marked this pull request as ready for review March 27, 2025 11:14
@nastra nastra requested a review from jackye1995 March 27, 2025 11:49
@duoxoud duoxoud changed the title [AWS] Add parameter of excluding non-current fields in Glue AWS: Add parameter of excluding non-current fields in Glue Mar 27, 2025
@borjagonzal
Copy link

This change fixes an issue we have faced during some time in our stack.
Thank you for pushing this one @duoxoud!

@xiaoxuandev
Copy link
Contributor

Displaying non-current columns is intentional in Glue, as users may use LakeFormation and need to access dropped columns. Users should not rely on Glue for the latest table status, Iceberg metadata should always be considered the source of truth.

@duoxoud
Copy link
Author

duoxoud commented Apr 1, 2025

Displaying non-current columns is intentional in Glue, as users may use LakeFormation and need to access dropped columns. Users should not rely on Glue for the latest table status, Iceberg metadata should always be considered the source of truth.

Hello Xiaoyuan 👋,
Thank you for your thoughtful feedback on this PR 😃.
I completely understand your point about displaying non-current columns and there are legit use cases.
However, as discussed in issue #7584, there are important use cases where the current behavior creates challenges:

  • When integrating with Athena, Redshift Spectrum, or third-party data catalogs that don't support this behavior
  • In my specific case, we're using CDC logs to reconcile Postgres tables where we often don't control schema changes, and our end users rely exclusively on the latest Glue table status

To be clear, this PR doesn't aim to change the default behavior, but rather to add a new configuration option that would allow users to choose whether non-current columns are displayed. This provides flexibility for both use cases:

  • Users who need access to dropped columns can maintain the current behavior
  • Users who need Glue to reflect only the current schema can opt into that alternative

Would you be open to considering this approach since it preserves the existing functionality while adding an option for users with different requirements?
Thanks again for your review 🙇

@xiaoxuandev
Copy link
Contributor

@duoxoud I still believe this change isn't necessary and could break LF integration. For your use case, could you explain why end users rely solely on the latest Glue table status? This approach isn't reliable, since the Glue schema can be modified independently, while the Iceberg schema might remain unchanged. Even if they do depend on Glue, it's still possible to filter out the column using iceberg.field.current, again, this is not recommended.

@duoxoud
Copy link
Author

duoxoud commented May 7, 2025

Hello @jackye1995 👋, do you happen to have time to check this PR? Many thanks 🫡

@github-actions
Copy link

github-actions bot commented Jun 7, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 7, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jun 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS: provide option to hide old fields in Glue table

3 participants