-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract command codes and unify the checks for spark_conf, cluster_policy, init_scripts #855
Conversation
…wlers.py and let clusters, jobs, pipelines, init_scripts call those functions.
- applies the _check_cluster_failures changes from PR #845.
- Merge change from PR #845. - Move the _check_cluster_failures logic to assessment/crawlers.py - Remove the ClusterInfo logic from _check_cluster_failures as it ties to assessment/clusters.py and should not be involved when assessment/jobs.py calls it.
- filtering out job cluster when scan all purpose cluster - _try_fetch for ClustersCrawler
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #855 +/- ##
==========================================
+ Coverage 85.67% 85.86% +0.19%
==========================================
Files 42 42
Lines 5311 5335 +24
Branches 969 968 -1
==========================================
+ Hits 4550 4581 +31
+ Misses 542 537 -5
+ Partials 219 217 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't import private functions from modules. fix bugs related to union types. you've effectively re-introduced a bug in this PR. I'm fine with getting rid of mixins, but then make top-level functions in the crawlers.py
type annotated. i'd still prefer if those functions live in a domain-groupped modules - e.g. all cluster-related functions and classes - in clusters.py
, all job-related functions - in jobs.py
, etc
@@ -92,3 +96,72 @@ def spark_version_compatibility(spark_version: str | None) -> str: | |||
if (10, 0) <= version < (11, 3): | |||
return "kinda works" | |||
return "supported" | |||
|
|||
|
|||
def _check_spark_conf(conf: dict[str, str], source) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions are private (starting with _
). you cannot export private methods from a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved those functions back to cluster mixin. I make check_cluster_failures
and check_spark_conf
public function now as the former is used by jobs.py
and the latter is used by pipelines.py
also.
return None | ||
|
||
|
||
def _check_cluster_policy(ws: WorkspaceClient, cluster, source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add type annotations to top-level members if we really don't want mixins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, type annotations added even we go back to mixin.
return failures | ||
|
||
|
||
def _check_cluster_failures(ws: WorkspaceClient, cluster: ClusterDetails | compute.ClusterSpec, source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do union types, they already caused a lot of bugs.
convert cluster spec into cluster details instead:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, it was a miss when merge with #845
return failures | ||
|
||
|
||
def _safe_get_cluster_policy(ws: WorkspaceClient, policy_id: str) -> Policy | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, logical question - do we want to do diamond shape dependencies or not? e.g. _safe_get_cluster_policy
- where it should live? in crawlers.py
or in clusters.py
? what about _check_cluster_failures
?
flowchart TD
assessment --> crawlers
crawlers --> clusters
crawlers --> jobs
crawlers --> init_scripts
crawlers --> pipelines
crawlers --> azure_spns
clusters --> runtime.py
jobs --> runtime.py
init_scripts --> runtime.py
pipelines --> runtime.py
azure_spns --> runtime.py
or something like this:
flowchart TD
assessment --> crawlers
azure_spns -->|mixin| clusters
crawlers --> clusters
clusters -->|mixin| jobs
crawlers --> jobs
azure_spns -->|mixin| init_scripts
crawlers --> init_scripts
jobs -->|mixin| pipelines
crawlers --> pipelines
crawlers --> azure_spns
clusters --> runtime.py
jobs --> runtime.py
init_scripts --> runtime.py
pipelines --> runtime.py
azure_spns --> runtime.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to select the first dependency structure, it's more clear to me.
Per my understanding cluster.py
crawl and check all the all-purpose clusters, jobs.py
crawl and check all jobs (so far it only check job cluster, but we may need to check the job code in the future), pipelines.py
crawl and check all DLT pipelines (right now it only checks pipeline config, but it should check the pipeline cluster as well).
It's clear for them to inherent the _check_cluster_failures
function to check spark conf, init script, cluster policy, instead of letting jobs and pipelines to inherent _check_cluster_failures from the clusters.
There are some logical that may stay in the domain-groupped modules, because they are not commonly shared across crawlers:
- check all-purpose cluster mode (we don't have this logical yet, but may have it in the future). The cluster may need to be put into shared mode which has more limitations, but this check does not apply to job cluster.
- check job code (we don't have this logical yet, but may have it in the future)
return failures | ||
|
||
|
||
def _check_cluster_init_script(ws: WorkspaceClient, init_scripts, source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're not doing mixins, then why this function is defined so far away from _get_init_script_data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move them to mixin and make these two functions next to each other.
from databricks.sdk.service import compute | ||
from databricks.sdk.service.compute import ClusterDetails, Policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 9 is redundant because you've imported the whole compute
package. use one style of imports, don't mix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
logger, | ||
spark_version_compatibility, | ||
) | ||
from databricks.labs.ucx.assessment.crawlers import _check_cluster_failures, logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't import logger, initialize one in the top of the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
… 2. Make jobs inherenting the cluster mixin so it can reuse the `check_cluster_failures` 3. Make pipelines inherenting the cluster mixin so it can reuse the `check_spark_conf`. 4. Move `check_init_script` to mixin of `init_scripts.py`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits remaining
try: | ||
data = self._ws.dbfs.read(file_api_format_destination).data | ||
if data is not None: | ||
return base64.b64decode(data).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's 9 levels of nesting. Can you reduce the nesting, so that's a bit more readable?
return failures | ||
|
||
def _get_init_script_data(self, init_script_info: InitScriptInfo) -> str | None: | ||
if init_script_info.dbfs is not None and init_script_info.dbfs.destination is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nesting can be reduced by refactoring "if COND: 20 lines" into "if not COND: continue 20 lines"
|
||
from databricks.labs.ucx.assessment.crawlers import ( | ||
_AZURE_SP_CONF_FAILURE_MSG, | ||
_INIT_SCRIPT_DBFS_PATH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant can live in the init script mixin
@@ -20,7 +22,7 @@ class JobInfo: | |||
creator: str | None = None | |||
|
|||
|
|||
class JobsMixin(ClustersMixin): | |||
class JobsMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, shouldn't it be named CheckJobsMixin?
* Added "what" property for migration to scope down table migrations ([#856](#856)). * Added job count in the assessment dashboard ([#858](#858)). * Adopted `installation` package from `databricks-labs-blueprint` ([#860](#860)). * Debug logs to print only the first 96 bytes of SQL query by default, tunable by `debug_truncate_bytes` SDK configuration property ([#859](#859)). * Extract command codes and unify the checks for spark_conf, cluster_policy, init_scripts ([#855](#855)). * Improved installation failure with actionable message ([#840](#840)). * Improved validating groups membership cli command ([#816](#816)). Dependency updates: * Updated databricks-labs-blueprint requirement from ~=0.1.0 to ~=0.2.4 ([#867](#867)).
* Added "what" property for migration to scope down table migrations ([#856](#856)). * Added job count in the assessment dashboard ([#858](#858)). * Adopted `installation` package from `databricks-labs-blueprint` ([#860](#860)). * Debug logs to print only the first 96 bytes of SQL query by default, tunable by `debug_truncate_bytes` SDK configuration property ([#859](#859)). * Extract command codes and unify the checks for spark_conf, cluster_policy, init_scripts ([#855](#855)). * Improved installation failure with actionable message ([#840](#840)). * Improved validating groups membership cli command ([#816](#816)). Dependency updates: * Updated databricks-labs-blueprint requirement from ~=0.1.0 to ~=0.2.4 ([#867](#867)).
* Added "what" property for migration to scope down table migrations ([#856](#856)). * Added job count in the assessment dashboard ([#858](#858)). * Adopted `installation` package from `databricks-labs-blueprint` ([#860](#860)). * Debug logs to print only the first 96 bytes of SQL query by default, tunable by `debug_truncate_bytes` SDK configuration property ([#859](#859)). * Extract command codes and unify the checks for spark_conf, cluster_policy, init_scripts ([#855](#855)). * Improved installation failure with actionable message ([#840](#840)). * Improved validating groups membership cli command ([#816](#816)). Dependency updates: * Updated databricks-labs-blueprint requirement from ~=0.1.0 to ~=0.2.4 ([#867](#867)).
Changes
Extract the command codes for checking spark_conf, cluster_policy, init_scripts and put them into unified functions in
src/databricks/labs/ucx/assessment/crawlers.py
.The new functions will then be called by:
src/databricks/labs/ucx/assessment/clusters.py
src/databricks/labs/ucx/assessment/init_scripts.py
src/databricks/labs/ucx/assessment/jobs.py
src/databricks/labs/ucx/assessment/pipelines.py
Linked issues
Resolves #823
Functionality
databricks labs ucx ...
...
...
Tests