Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect dbutils.notebook.run #1284

Conversation

ericvergnaud
Copy link
Contributor

@ericvergnaud ericvergnaud commented Apr 5, 2024

Changes

  • detect dependencies on notebooks called via dbutils.notebook.run

Linked issues

Resolves #1200

add sample with RUN cell
fix issue where non-PI comments preceding language PI would prevent language PI detection
* main:
  remove `isort` (databrickslabs#1280)
  Addressed Issue with Disabled Feature in certain regions (databrickslabs#1275)
  Improve documentation (databrickslabs#1162)
  Add roadmap workflows and tasks to Table Migration Workflow document (databrickslabs#1274)
  Fix integration test with new DeployedWorkflows (databrickslabs#1250)
  Document troubleshooting guide (databrickslabs#1226)
  Split `DeployedWorkflows` out of `WorkflowsDeployment` (databrickslabs#1248)
  Inject `_TASKS` via constructor to `WorkflowsDeployment` instead of a global variable (databrickslabs#1247)
  Decouple `InstallState` from `WorkspaceDeployment` constructor
  Add document for table migration workflow (databrickslabs#1229)
  Decouple `InstallState` from `WorkflowsDeployment` constructor (databrickslabs#1246)
* main:
  Build notebook dependency graph for `%run` cells (databrickslabs#1279)

# Conflicts:
#	src/databricks/labs/ucx/source_code/notebook.py
#	tests/unit/source_code/test_notebook.py
@ericvergnaud ericvergnaud requested review from a team and larsgeorge-db April 5, 2024 14:57
@ericvergnaud ericvergnaud marked this pull request as draft April 5, 2024 14:57
@ericvergnaud ericvergnaud marked this pull request as ready for review April 8, 2024 12:01
@ericvergnaud ericvergnaud changed the title [WIP] Detect dbutils.notebook.run Detect dbutils.notebook.run Apr 8, 2024
Copy link

codecov bot commented Apr 8, 2024

Codecov Report

Attention: Patch coverage is 88.23529% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 89.23%. Comparing base (b2a11cf) to head (e1fcb4f).
Report is 35 commits behind head on main.

❗ Current head e1fcb4f differs from pull request most recent head 2224c3f. Consider uploading reports for the commit 2224c3f to get more accurate results

Files Patch % Lines
...c/databricks/labs/ucx/source_code/python_linter.py 81.15% 8 Missing and 5 partials ⚠️
src/databricks/labs/ucx/source_code/notebook.py 94.04% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1284      +/-   ##
==========================================
- Coverage   90.02%   89.23%   -0.79%     
==========================================
  Files          62       70       +8     
  Lines        7430     8242     +812     
  Branches     1335     1454     +119     
==========================================
+ Hits         6689     7355     +666     
- Misses        470      589     +119     
- Partials      271      298      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/databricks/labs/ucx/source_code/astlinter.py Outdated Show resolved Hide resolved
class MatchingVisitor(ast.NodeVisitor):

def __init__(self, node_type: type, match_nodes: list[tuple[str, type]]):
self.matched_nodes: list[ast.AST] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if something is public, let's expose this as a method - would be easier to refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

class PythonLinter(ASTLinter, Linter):
def lint(self, code: str) -> Iterable[Advice]:
self.parse(code)
nodes = self.locate(ast.Call, [("run", ast.Attribute), ("notebook", ast.Attribute), ("dbutils", ast.Name)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the code is self._dbutils.notebook.run(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's theoretically possible to do that but I can't think of any benefit for users to keep a local private copy of a public API ?

To your point, we can't address this edge cases for now. Tbh, I can think of thousands of them, such as:

run_notebook = dbutils.notebook.run
.../...
run_notebook('some notebook')

I have created ticket #1334 for that

Comment on lines 48 to 52
def __init__(self):
self._module: ast.Module | None = None

def parse(self, code: str):
self._module = ast.parse(code)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self):
self._module: ast.Module | None = None
def parse(self, code: str):
self._module = ast.parse(code)
def __init__(self, code: str):
self._module = ast.parse(code)

so that we either fail initialising or have a valid state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respectfully disagree. A constructor should not perform processing, especially not when that processing may fail under uncontrolled conditions (source code received from outside)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factory is just fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the reason for this is that ASTLinter is a base class for PythonLinter, which also needs to follow Linter conventions... changing that now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -94,7 +150,8 @@ def is_runnable(self) -> bool:
statements = parse_sql(self._original_code)
return len(statements) > 0
except SQLParseError:
return False
sqlglot_logger.warning(f"Failed to parse SQL using 'sqlglot': {self._original_code}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also log the sqlglot error, so that we can create issues over there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self._new_cell = args[3]
# PI stands for Processing Instruction
# pylint: disable=invalid-name
self._requires_isolated_PI = args[3]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._requires_isolated_PI = args[3]
self._requires_isolated_processing_instruction = args[3]

:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the warning, but I guess people looking into this code would be informed enough by the comment ?

Comment on lines 302 to 303
if cell_language.requires_isolated_pi:
if line.startswith(LANGUAGE_PREFIX):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if cell_language.requires_isolated_pi:
if line.startswith(LANGUAGE_PREFIX):
if cell_language.requires_isolated_pi and line.startswith(LANGUAGE_PREFIX):

def _make_runnable(self, lines: list[str], cell_language: CellLanguage):
prefix = f"{self.comment_prefix} {MAGIC_PREFIX} "
prefix_len = len(prefix)
# pylint: disable=too-many-nested-blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# pylint: disable=too-many-nested-blocks

this one can be avoided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -225,6 +292,41 @@ def make_cell(lines_: list[str]):

return cells

def _make_runnable(self, lines: list[str], cell_language: CellLanguage):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _make_runnable(self, lines: list[str], cell_language: CellLanguage):
def _unwrap_magic(self, lines: list[str], cell_language: CellLanguage):

i think it's a better name for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

line = f"{cell_language.comment_prefix} {COMMENT_PI}{line}"
lines[i] = line

def make_unrunnable(self, code: str, cell_language: CellLanguage) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def make_unrunnable(self, code: str, cell_language: CellLanguage) -> str:
def as_magic(self, code: str, cell_language: CellLanguage) -> str:

might be a better name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def matched_nodes(self):
return self._matched_nodes

# pylint: disable=invalid-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# pylint: disable=invalid-name
# visit_Call is the invalid naming convention, but it is required for NodeVisitor
# pylint: disable=invalid-name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -38,6 +51,43 @@ def build_dependency_graph(self, parent: DependencyGraph):
raise NotImplementedError()


class PythonLinter(Linter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to astlinter.py ( -> python_analysis.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

path = cls.get_dbutils_notebook_run_path_arg(node)
if isinstance(path, ast.Constant):
return Advisory(
'notebook-auto-migrate',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'notebook-auto-migrate',
'dbutils-notebook-run-literal',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

node.end_col_offset or 0,
)
return Advisory(
'notebook-manual-migrate',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'notebook-manual-migrate',
'dbutils-notebook-run-dynamic',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ericvergnaud ericvergnaud requested a review from nfx April 9, 2024 10:10
* main:
  Adding CSV, JSON and include path in mounts (databrickslabs#1329)
  Add missing step sync-workspace-info (databrickslabs#1330)
  disable annotation-unchecked mypy warning (databrickslabs#1331)
  Use service factory to resolve object dependencies (databrickslabs#1209)
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* main:
  Integrate detection of notebook dependencies (databrickslabs#1338)

# Conflicts:
#	src/databricks/labs/ucx/source_code/notebook.py
@nfx
Copy link
Collaborator

nfx commented Apr 10, 2024

merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Detect notebook include graph by analysing dbutils.notebook.run(...) calls
3 participants