Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Migrate AWS IAM Instance Profiles to UC Storage Credentials #862

Closed
1 task done
Tracked by #893
nfx opened this issue Jan 30, 2024 · 0 comments · Fixed by #973
Closed
1 task done
Tracked by #893

[FEATURE]: Migrate AWS IAM Instance Profiles to UC Storage Credentials #862

nfx opened this issue Jan 30, 2024 · 0 comments · Fixed by #973
Labels
enhancement New feature or request feat/account-level cross-workspace installations migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step

Comments

@nfx
Copy link
Collaborator

nfx commented Jan 30, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

Many customers are using the AWS Instance Profiles, and we need to ensure that the relevant UC Storage Credential exists to map onto an instance profile.

Related issues:

Proposed Solution

  1. check all instance profiles
  2. check all storage credentials
  3. see which instance profiles have matching storage credentials
  4. report what credentials are missing
  5. prompt-confirm creation of storage credential from instance profile
  6. prompt for trust relationship between an instance profile and UC for prod environment
  7. give user three options: terraform config, invoke AWS CLI, pick an existing role

Additional Context

No response

@nfx nfx added enhancement New feature or request needs-triage labels Jan 30, 2024
@nfx nfx added this to UCX Jan 30, 2024
@github-project-automation github-project-automation bot moved this to Triage in UCX Jan 30, 2024
@nfx nfx added feat/account-level cross-workspace installations credentials migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step and removed needs-triage labels Jan 30, 2024
@nfx nfx closed this as completed in #973 Mar 8, 2024
nfx pushed a commit that referenced this issue Mar 8, 2024
…ls` command (#973)

## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary -->
A few more things to be done
- [x] Added `load` function to `AWSResourcePermissions` to return
identified instance profiles
- [x] Added `IamRoleMigration` class under `aws/credentials.py` to
migrate AWS instance profiles identified

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #862

Related PR:
- #874

### Functionality 

- [x] added relevant user documentation
- [x] added new CLI command `databricks labs ucx migrate-credentials`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [x] added unit tests
- [x] added integration tests

---------

Co-authored-by: qziyuan <[email protected]>
@github-project-automation github-project-automation bot moved this from Triage to Archive in UCX Mar 8, 2024
nfx added a commit that referenced this issue Mar 8, 2024
* Added AWS IAM roles support to `databricks labs ucx migrate-credentials` command ([#973](#973)). This commit adds AWS Identity and Access Management (IAM) roles support to the `databricks labs ucx migrate-credentials` command, resolving issue [#862](#862) and being related to pull request [#874](#874). It includes the addition of a `load` function to `AWSResourcePermissions` to return identified instance profiles and the creation of an `IamRoleMigration` class under `aws/credentials.py` to migrate identified AWS instance profiles. Additionally, user documentation and a new CLI command `databricks labs ucx migrate-credentials` have been added, and the changes have been thoroughly tested with manual, unit, and integration tests. The functionality additions include new methods such as `add_uc_role_policy` and `update_uc_trust_role`, among others, designed to facilitate the migration process for AWS IAM roles.
* Added `create-catalogs-schemas` command to prepare destination catalogs and schemas before table migration ([#1028](#1028)). The Databricks Labs Unity Catalog (UCX) tool has been updated with a new `create-catalogs-schemas` command to facilitate the creation of destination catalogs and schemas prior to table migration. This command should be executed after the `create-table-mapping` command and is designed to prepare the workspace for migrating tables to UC. Additionally, a new `CatalogSchema` class has been added to the `hive_metastore` package to manage the creation of catalogs and schemas in the Hive metastore. This new functionality simplifies the process of preparing the destination Hive metastore for table migration, reducing the likelihood of user errors and ensuring that the metastore is properly configured. Unit tests have been added to the `tests/unit/hive_metastore` directory to verify the behavior of the `CatalogSchema` class and the new `create-catalogs-schemas` command. This command is intended for use in contexts where GCP is not supported.
* Added automated upgrade option to set up cluster policy ([#1024](#1024)). This commit introduces an automated upgrade option for setting up a cluster policy for older versions of UCX, separating the cluster creation policy from install.py to installer.policy.py and adding an upgrade script for older UCX versions. A new class, `ClusterPolicyInstaller`, is added to the `policy.py` file in the `installer` package to manage the creation and update of a Databricks cluster policy for Unity Catalog Migration. This class handles creating a new cluster policy with specific configurations, extracting external Hive Metastore configurations, and updating job policies. Additionally, the commit includes refactoring, removal of library references, and a new script, v0.15.0_added_cluster_policy.py, which contains the upgrade function. The changes are tested through manual and automated testing with unit tests and integration tests. This feature is intended for software engineers working with the project.
* Added crawling for init scripts on local files to assessment workflow ([#960](#960)). This commit introduces the ability to crawl init scripts stored on local files and S3 as part of the assessment workflow, resolving issue [#9](#9)
* Added database filter for the `assessment` workflow ([#989](#989)). In this release, we have added a new configuration option, `include_databases`, to the assessment workflow which allows users to specify a list of databases to include for migration, rather than crawling all the databases in the Hive Metastore. This feature is implemented in the `TablesCrawler`, `UdfsCrawler`, `GrantsCrawler` classes and the associated functions such as `_all_databases`, `getIncludeDatabases`, `_select_databases`. These changes aim to improve efficiency and reduce unnecessary crawling, and are accompanied by modifications to existing functionality, as well as the addition of unit and integration tests. The changes have been manually tested and verified on a staging environment.
* Estimate migration effort based on assessment database ([#1008](#1008)). In this release, a new functionality has been added to estimate the migration effort for each asset in the assessment database. The estimation is presented in days and is displayed on a new estimates dashboard with a summary widget for a global estimate per object type, along with assumptions and scope for each object type. A new `query` parameter has been added to the `SimpleQuery` class to support this feature. Additional changes include the update of the `_install_viz` and `_install_query` methods, the inclusion of the `data_source_id` in the query metadata, and the addition of tests to ensure the proper functioning of the new feature. A new fixture, `mock_installation_with_jobs`, has been added to support testing of the assessment estimates dashboard.
* Explicitly write to `hive_metastore` from `crawl_tables` task ([#1021](#1021)). In this release, we have improved the clarity and specificity of our handling of the `hive_metastore` in the `crawl_tables` task. Previously, the `df.write.saveAsTable` method was used without explicitly specifying the `hive_metastore` database, which could result in ambiguity. To address this issue, we have updated the `saveAsTable` method to include the `hive_metastore` database, ensuring that tables are written to the correct location in the Hive metastore. These changes are confined to the `src/databricks/labs/ucx/hive_metastore/tables.scala` file and affect the `crawl_tables` task. While no new methods have been added, the existing `saveAsTable` method has been modified to enhance the accuracy and predictability of our interaction with the Hive metastore.
* Improved documentation for `databricks labs ucx move` command ([#1025](#1025)). The `databricks labs ucx move` command has been updated with new improvements to its documentation, providing enhanced clarity and ease of use for developers and administrators. This command facilitates the movement of UC tables/table(s) from one schema to another, either in the same or different catalog, during the table upgrade process. A significant enhancement is the preservation of the source table's permissions when moving to a new schema or catalog, maintaining the original table's access controls, simplifying the management of table permissions, and streamlining the migration process. These improvements aim to facilitate a more efficient table migration experience, ensuring that developers and administrators can effectively manage their UC tables while maintaining the desired level of access control and security.
* Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0 ([#1030](#1030)). In this update, the `databricks-sdk` package requirement has been updated to version `~=0.21.0` from `~=0.20.0`. This new version addresses several bugs and provides enhancements, including the fix for the `get_workspace_client` method in GCP, the use of the `all-apis` scope with the external browser, and an attempt to initialize all Databricks globals. Moreover, the API's settings nesting approach has changed, which may cause compatibility issues with previous versions. Several new services and dataclasses have been added to the API, and documentation and examples have been updated accordingly. There are no updates to the `databricks-labs-blueprint` and `PyYAML` dependencies in this commit.
@nfx nfx mentioned this issue Mar 8, 2024
nfx added a commit that referenced this issue Mar 8, 2024
* Added AWS IAM roles support to `databricks labs ucx
migrate-credentials` command
([#973](#973)). This commit
adds AWS Identity and Access Management (IAM) roles support to the
`databricks labs ucx migrate-credentials` command, resolving issue
[#862](#862) and being
related to pull request
[#874](#874). It includes
the addition of a `load` function to `AWSResourcePermissions` to return
identified instance profiles and the creation of an `IamRoleMigration`
class under `aws/credentials.py` to migrate identified AWS instance
profiles. Additionally, user documentation and a new CLI command
`databricks labs ucx migrate-credentials` have been added, and the
changes have been thoroughly tested with manual, unit, and integration
tests. The functionality additions include new methods such as
`add_uc_role_policy` and `update_uc_trust_role`, among others, designed
to facilitate the migration process for AWS IAM roles.
* Added `create-catalogs-schemas` command to prepare destination
catalogs and schemas before table migration
([#1028](#1028)). The
Databricks Labs Unity Catalog (UCX) tool has been updated with a new
`create-catalogs-schemas` command to facilitate the creation of
destination catalogs and schemas prior to table migration. This command
should be executed after the `create-table-mapping` command and is
designed to prepare the workspace for migrating tables to UC.
Additionally, a new `CatalogSchema` class has been added to the
`hive_metastore` package to manage the creation of catalogs and schemas
in the Hive metastore. This new functionality simplifies the process of
preparing the destination Hive metastore for table migration, reducing
the likelihood of user errors and ensuring that the metastore is
properly configured. Unit tests have been added to the
`tests/unit/hive_metastore` directory to verify the behavior of the
`CatalogSchema` class and the new `create-catalogs-schemas` command.
This command is intended for use in contexts where GCP is not supported.
* Added automated upgrade option to set up cluster policy
([#1024](#1024)). This
commit introduces an automated upgrade option for setting up a cluster
policy for older versions of UCX, separating the cluster creation policy
from install.py to installer.policy.py and adding an upgrade script for
older UCX versions. A new class, `ClusterPolicyInstaller`, is added to
the `policy.py` file in the `installer` package to manage the creation
and update of a Databricks cluster policy for Unity Catalog Migration.
This class handles creating a new cluster policy with specific
configurations, extracting external Hive Metastore configurations, and
updating job policies. Additionally, the commit includes refactoring,
removal of library references, and a new script,
v0.15.0_added_cluster_policy.py, which contains the upgrade function.
The changes are tested through manual and automated testing with unit
tests and integration tests. This feature is intended for software
engineers working with the project.
* Added crawling for init scripts on local files to assessment workflow
([#960](#960)). This commit
introduces the ability to crawl init scripts stored on local files and
S3 as part of the assessment workflow, resolving issue
[#9](#9)
* Added database filter for the `assessment` workflow
([#989](#989)). In this
release, we have added a new configuration option, `include_databases`,
to the assessment workflow which allows users to specify a list of
databases to include for migration, rather than crawling all the
databases in the Hive Metastore. This feature is implemented in the
`TablesCrawler`, `UdfsCrawler`, `GrantsCrawler` classes and the
associated functions such as `_all_databases`, `getIncludeDatabases`,
`_select_databases`. These changes aim to improve efficiency and reduce
unnecessary crawling, and are accompanied by modifications to existing
functionality, as well as the addition of unit and integration tests.
The changes have been manually tested and verified on a staging
environment.
* Estimate migration effort based on assessment database
([#1008](#1008)). In this
release, a new functionality has been added to estimate the migration
effort for each asset in the assessment database. The estimation is
presented in days and is displayed on a new estimates dashboard with a
summary widget for a global estimate per object type, along with
assumptions and scope for each object type. A new `query` parameter has
been added to the `SimpleQuery` class to support this feature.
Additional changes include the update of the `_install_viz` and
`_install_query` methods, the inclusion of the `data_source_id` in the
query metadata, and the addition of tests to ensure the proper
functioning of the new feature. A new fixture,
`mock_installation_with_jobs`, has been added to support testing of the
assessment estimates dashboard.
* Explicitly write to `hive_metastore` from `crawl_tables` task
([#1021](#1021)). In this
release, we have improved the clarity and specificity of our handling of
the `hive_metastore` in the `crawl_tables` task. Previously, the
`df.write.saveAsTable` method was used without explicitly specifying the
`hive_metastore` database, which could result in ambiguity. To address
this issue, we have updated the `saveAsTable` method to include the
`hive_metastore` database, ensuring that tables are written to the
correct location in the Hive metastore. These changes are confined to
the `src/databricks/labs/ucx/hive_metastore/tables.scala` file and
affect the `crawl_tables` task. While no new methods have been added,
the existing `saveAsTable` method has been modified to enhance the
accuracy and predictability of our interaction with the Hive metastore.
* Improved documentation for `databricks labs ucx move` command
([#1025](#1025)). The
`databricks labs ucx move` command has been updated with new
improvements to its documentation, providing enhanced clarity and ease
of use for developers and administrators. This command facilitates the
movement of UC tables/table(s) from one schema to another, either in the
same or different catalog, during the table upgrade process. A
significant enhancement is the preservation of the source table's
permissions when moving to a new schema or catalog, maintaining the
original table's access controls, simplifying the management of table
permissions, and streamlining the migration process. These improvements
aim to facilitate a more efficient table migration experience, ensuring
that developers and administrators can effectively manage their UC
tables while maintaining the desired level of access control and
security.
* Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0
([#1030](#1030)). In this
update, the `databricks-sdk` package requirement has been updated to
version `~=0.21.0` from `~=0.20.0`. This new version addresses several
bugs and provides enhancements, including the fix for the
`get_workspace_client` method in GCP, the use of the `all-apis` scope
with the external browser, and an attempt to initialize all Databricks
globals. Moreover, the API's settings nesting approach has changed,
which may cause compatibility issues with previous versions. Several new
services and dataclasses have been added to the API, and documentation
and examples have been updated accordingly. There are no updates to the
`databricks-labs-blueprint` and `PyYAML` dependencies in this commit.
dmoore247 pushed a commit that referenced this issue Mar 23, 2024
…ls` command (#973)

## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary -->
A few more things to be done
- [x] Added `load` function to `AWSResourcePermissions` to return
identified instance profiles
- [x] Added `IamRoleMigration` class under `aws/credentials.py` to
migrate AWS instance profiles identified

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #862

Related PR:
- #874

### Functionality 

- [x] added relevant user documentation
- [x] added new CLI command `databricks labs ucx migrate-credentials`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [x] manually tested
- [x] added unit tests
- [x] added integration tests

---------

Co-authored-by: qziyuan <[email protected]>
dmoore247 pushed a commit that referenced this issue Mar 23, 2024
* Added AWS IAM roles support to `databricks labs ucx
migrate-credentials` command
([#973](#973)). This commit
adds AWS Identity and Access Management (IAM) roles support to the
`databricks labs ucx migrate-credentials` command, resolving issue
[#862](#862) and being
related to pull request
[#874](#874). It includes
the addition of a `load` function to `AWSResourcePermissions` to return
identified instance profiles and the creation of an `IamRoleMigration`
class under `aws/credentials.py` to migrate identified AWS instance
profiles. Additionally, user documentation and a new CLI command
`databricks labs ucx migrate-credentials` have been added, and the
changes have been thoroughly tested with manual, unit, and integration
tests. The functionality additions include new methods such as
`add_uc_role_policy` and `update_uc_trust_role`, among others, designed
to facilitate the migration process for AWS IAM roles.
* Added `create-catalogs-schemas` command to prepare destination
catalogs and schemas before table migration
([#1028](#1028)). The
Databricks Labs Unity Catalog (UCX) tool has been updated with a new
`create-catalogs-schemas` command to facilitate the creation of
destination catalogs and schemas prior to table migration. This command
should be executed after the `create-table-mapping` command and is
designed to prepare the workspace for migrating tables to UC.
Additionally, a new `CatalogSchema` class has been added to the
`hive_metastore` package to manage the creation of catalogs and schemas
in the Hive metastore. This new functionality simplifies the process of
preparing the destination Hive metastore for table migration, reducing
the likelihood of user errors and ensuring that the metastore is
properly configured. Unit tests have been added to the
`tests/unit/hive_metastore` directory to verify the behavior of the
`CatalogSchema` class and the new `create-catalogs-schemas` command.
This command is intended for use in contexts where GCP is not supported.
* Added automated upgrade option to set up cluster policy
([#1024](#1024)). This
commit introduces an automated upgrade option for setting up a cluster
policy for older versions of UCX, separating the cluster creation policy
from install.py to installer.policy.py and adding an upgrade script for
older UCX versions. A new class, `ClusterPolicyInstaller`, is added to
the `policy.py` file in the `installer` package to manage the creation
and update of a Databricks cluster policy for Unity Catalog Migration.
This class handles creating a new cluster policy with specific
configurations, extracting external Hive Metastore configurations, and
updating job policies. Additionally, the commit includes refactoring,
removal of library references, and a new script,
v0.15.0_added_cluster_policy.py, which contains the upgrade function.
The changes are tested through manual and automated testing with unit
tests and integration tests. This feature is intended for software
engineers working with the project.
* Added crawling for init scripts on local files to assessment workflow
([#960](#960)). This commit
introduces the ability to crawl init scripts stored on local files and
S3 as part of the assessment workflow, resolving issue
[#9](#9)
* Added database filter for the `assessment` workflow
([#989](#989)). In this
release, we have added a new configuration option, `include_databases`,
to the assessment workflow which allows users to specify a list of
databases to include for migration, rather than crawling all the
databases in the Hive Metastore. This feature is implemented in the
`TablesCrawler`, `UdfsCrawler`, `GrantsCrawler` classes and the
associated functions such as `_all_databases`, `getIncludeDatabases`,
`_select_databases`. These changes aim to improve efficiency and reduce
unnecessary crawling, and are accompanied by modifications to existing
functionality, as well as the addition of unit and integration tests.
The changes have been manually tested and verified on a staging
environment.
* Estimate migration effort based on assessment database
([#1008](#1008)). In this
release, a new functionality has been added to estimate the migration
effort for each asset in the assessment database. The estimation is
presented in days and is displayed on a new estimates dashboard with a
summary widget for a global estimate per object type, along with
assumptions and scope for each object type. A new `query` parameter has
been added to the `SimpleQuery` class to support this feature.
Additional changes include the update of the `_install_viz` and
`_install_query` methods, the inclusion of the `data_source_id` in the
query metadata, and the addition of tests to ensure the proper
functioning of the new feature. A new fixture,
`mock_installation_with_jobs`, has been added to support testing of the
assessment estimates dashboard.
* Explicitly write to `hive_metastore` from `crawl_tables` task
([#1021](#1021)). In this
release, we have improved the clarity and specificity of our handling of
the `hive_metastore` in the `crawl_tables` task. Previously, the
`df.write.saveAsTable` method was used without explicitly specifying the
`hive_metastore` database, which could result in ambiguity. To address
this issue, we have updated the `saveAsTable` method to include the
`hive_metastore` database, ensuring that tables are written to the
correct location in the Hive metastore. These changes are confined to
the `src/databricks/labs/ucx/hive_metastore/tables.scala` file and
affect the `crawl_tables` task. While no new methods have been added,
the existing `saveAsTable` method has been modified to enhance the
accuracy and predictability of our interaction with the Hive metastore.
* Improved documentation for `databricks labs ucx move` command
([#1025](#1025)). The
`databricks labs ucx move` command has been updated with new
improvements to its documentation, providing enhanced clarity and ease
of use for developers and administrators. This command facilitates the
movement of UC tables/table(s) from one schema to another, either in the
same or different catalog, during the table upgrade process. A
significant enhancement is the preservation of the source table's
permissions when moving to a new schema or catalog, maintaining the
original table's access controls, simplifying the management of table
permissions, and streamlining the migration process. These improvements
aim to facilitate a more efficient table migration experience, ensuring
that developers and administrators can effectively manage their UC
tables while maintaining the desired level of access control and
security.
* Updated databricks-sdk requirement from ~=0.20.0 to ~=0.21.0
([#1030](#1030)). In this
update, the `databricks-sdk` package requirement has been updated to
version `~=0.21.0` from `~=0.20.0`. This new version addresses several
bugs and provides enhancements, including the fix for the
`get_workspace_client` method in GCP, the use of the `all-apis` scope
with the external browser, and an attempt to initialize all Databricks
globals. Moreover, the API's settings nesting approach has changed,
which may cause compatibility issues with previous versions. Several new
services and dataclasses have been added to the API, and documentation
and examples have been updated accordingly. There are no updates to the
`databricks-labs-blueprint` and `PyYAML` dependencies in this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feat/account-level cross-workspace installations migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants