Skip to content
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion dev/merge_spark_pr.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
import subprocess
import sys
import urllib2
from collections import deque

try:
import jira.client
Expand Down Expand Up @@ -285,6 +286,56 @@ def resolve_jira_issues(title, merge_branches, comment):
for jira_id in jira_ids:
resolve_jira_issue(merge_branches, comment, jira_id)

def standardize_jira_ref(text):
# Standardize the [MODULE] SPARK-XXXXX prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be pretty cool to add a doctest here. I've generally found doctests to be pretty helpful when developing parsing / string manipulation stuff like this, such as the pull request title parser that I wrote for spark-prs.appspot.com: https://github.com/databricks/spark-pr-dashboard/blob/0ee9102b7bb30fa61e2f47c3d282c5a018a7d198/sparkprs/utils.py#L55

Unfortunately, this script isn't well-structured to support doctests because the main code isn't in a method. If you want to test this, we can move the lines below this function (starting on 291/341) into a new main() function, then call that from a `if name == 'main' block. It will blow up the size of the diff a bit, but that's not a huge deal since we don't backport changes to this file anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, Josh. Check out the latest diff and see if it's what you had in mind.
I like what you have in spark-pr-dashboard. Would it make more sense to reference your code from the merge script instead? It's a bit different & would require a bit of finagling, but I hate to have similar code duplicated in different places.

# Converts "[SPARK-XXX][mllib] Issue", "[MLLib] SPARK-XXX. Issue" or "SPARK XXX [MLLIB]: Issue" to "[MLLIB] SPARK-XXX: Issue"

#If the string is compliant, no need to process any further
if (re.search(r'\[[A-Z0-9_]+\] SPARK-[0-9]{3,5}: \S+', text)):
return text

# Extract JIRA ref(s):
jira_refs = deque()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this and components can't just be normal Python lists?

i.e.

jira_refs = []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do convert them to lists, you can remove the deque import.

pattern = re.compile(r'(SPARK[-\s]*[0-9]{3,5})', re.IGNORECASE)
while (pattern.search(text) is not None):
ref = pattern.search(text).groups()[0]
# Replace any whitespace with a dash & convert to uppercase
jira_refs.append(re.sub(r'\s', '-', ref.upper()))
text = text.replace(ref, '')

# Extract spark component(s):
components = deque()
# Look for alphanumeric chars, spaces, and/or commas
pattern = re.compile(r'(\[[\w\s,]+\])', re.IGNORECASE)
while (pattern.search(text) is not None):
component = pattern.search(text).groups()[0]
# Convert to uppercase
components.append(component.upper())
text = text.replace(component, '')

# Cleanup remaining symbols:
pattern = re.compile(r'^\W+(.*)', re.IGNORECASE)
if (pattern.search(text) is not None):
text = pattern.search(text).groups()[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't follow this totally, but could the rest of this function be written a bit more simply?

out_string = text.strip()
for tag in components + jira_refs:
  out_string = tag + " " + out_string
return out_string

# Assemble full text (module(s), JIRA ref(s), remaining text)
if (len(components) < 1):
components = ""
component_text = ' '.join(components).strip()
if (len(jira_refs) < 1):
jira_ref_text = ""
jira_ref_text = ' '.join(jira_refs).strip()

if (len(jira_ref_text) < 1 and len(component_text) < 1):
clean_text = text.strip()
elif (len(jira_ref_text) < 1):
clean_text = component_text + ' ' + text.strip()
elif (len(component_text) < 1):
clean_text = jira_ref_text + ': ' + text.strip()
else:
clean_text = component_text + ' ' + jira_ref_text + ': ' + text.strip()

return clean_text

branches = get_json("%s/branches" % GITHUB_API_BASE)
branch_names = filter(lambda x: x.startswith("branch-"), [x['name'] for x in branches])
Expand All @@ -296,7 +347,7 @@ def resolve_jira_issues(title, merge_branches, comment):
pr_events = get_json("%s/issues/%s/events" % (GITHUB_API_BASE, pr_num))

url = pr["url"]
title = pr["title"]
title = standardize_jira_ref(pr["title"])
body = pr["body"]
target_ref = pr["base"]["ref"]
user_login = pr["user"]["login"]
Expand Down