-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-13016] Remove avro-python3 dependency from Beam #15900
Conversation
Run Python 3.8 Postcommit |
Run Python 3.8 PostCommit |
Run PythonLint PreCommit |
1 similar comment
Run PythonLint PreCommit |
Codecov Report
@@ Coverage Diff @@
## master #15900 +/- ##
==========================================
+ Coverage 83.53% 83.59% +0.06%
==========================================
Files 445 445
Lines 61385 61284 -101
==========================================
- Hits 51278 51231 -47
+ Misses 10107 10053 -54
Continue to review full report at Codecov.
|
retest this please |
d92fa50
to
444e1d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's advertise this change in CHANGES.md so that it is visible in release notes.
444e1d6
to
47192b5
Compare
Run Python 3.8 PostCommit |
Even though use_fastavro can be passed as an input parameter, it would have no effect.
@pabloem @tvalentyn Can you review the PR? thanks. |
Run Python 3.8 PostCommit |
9bd1906
to
cb6e47c
Compare
Run Python 3.8 PostCommit |
Co-authored-by: tvalentyn <[email protected]>
Run Python 3.8 PostCommit |
1 similar comment
Run Python 3.8 PostCommit |
| CoGroupByKey() \ | ||
fastavro_read_pipeline \ | ||
| 'create-fastavro' >> Create(['%s*' % fastavro_output]) \ | ||
| 'read-fastavro' >> ReadAllFromAvro() \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also compare the values for the keys, to make sure that no values were not lost during write-read operation?
I think it could be accomplished by running co-GBK of a pcollection coming form | 'read-fastavro' >> ReadAllFromAvro() \
, and pcollection of generated data. Then, we can extract the set of elements tagged with first pcollection, and the second pcollection, and verify that these sets are the same for all elements in GBK output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed according to the above suggestions. Can you verify if the pipeline looks okay?
Run Python 3.8 PostCommit |
record_pcoll_values = v['record_pcoll'] | ||
fastavro_values = v['fastavro'] | ||
assertEqual(record_pcoll_values, fastavro_values) | ||
assertEqual(len(record_pcoll_values), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What guarantees that we have unique keys for each record (so that after GBK we only have 1 value per key)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have unique keys for each record as the keys are integers in ascending order. This is an example of the output after CoGroupByKey. Here the number of records are 20 and keys are the integer(number attribute of the pcollection)
(0, {'record_pcoll': [{'label': 'abc', 'number': 0, 'number_str': '0', 'color': 'RED'}], 'fastavro': [{'label': 'abc', 'number': 0, 'number_str': '0', 'color': 'RED'}]}) (1, {'record_pcoll': [{'label': 'def', 'number': 1, 'number_str': '1', 'color': 'ORANGE'}], 'fastavro': [{'label': 'def', 'number': 1, 'number_str': '1', 'color': 'ORANGE'}]}) (2, {'record_pcoll': [{'label': 'ghi', 'number': 2, 'number_str': '2', 'color': 'YELLOW'}], 'fastavro': [{'label': 'ghi', 'number': 2, 'number_str': '2', 'color': 'YELLOW'}]}) (3, {'record_pcoll': [{'label': 'jkl', 'number': 3, 'number_str': '3', 'color': 'GREEN'}], 'fastavro': [{'label': 'jkl', 'number': 3, 'number_str': '3', 'color': 'GREEN'}]}) (4, {'record_pcoll': [{'label': 'mno', 'number': 4, 'number_str': '4', 'color': 'BLUE'}], 'fastavro': [{'label': 'mno', 'number': 4, 'number_str': '4', 'color': 'BLUE'}]}) (5, {'record_pcoll': [{'label': 'pqr', 'number': 5, 'number_str': '5', 'color': 'PURPLE'}], 'fastavro': [{'label': 'pqr', 'number': 5, 'number_str': '5', 'color': 'PURPLE'}]}) (6, {'record_pcoll': [{'label': 'stu', 'number': 6, 'number_str': '6', 'color': None}], 'fastavro': [{'label': 'stu', 'number': 6, 'number_str': '6', 'color': None}]}) (7, {'record_pcoll': [{'label': 'vwx', 'number': 7, 'number_str': '7', 'color': 'RED'}], 'fastavro': [{'label': 'vwx', 'number': 7, 'number_str': '7', 'color': 'RED'}]}) (8, {'record_pcoll': [{'label': 'abc', 'number': 8, 'number_str': '8', 'color': 'ORANGE'}], 'fastavro': [{'label': 'abc', 'number': 8, 'number_str': '8', 'color': 'ORANGE'}]}) (9, {'record_pcoll': [{'label': 'def', 'number': 9, 'number_str': '9', 'color': 'YELLOW'}], 'fastavro': [{'label': 'def', 'number': 9, 'number_str': '9', 'color': 'YELLOW'}]}) (10, {'record_pcoll': [{'label': 'ghi', 'number': 10, 'number_str': '10', 'color': 'GREEN'}], 'fastavro': [{'label': 'ghi', 'number': 10, 'number_str': '10', 'color': 'GREEN'}]}) (11, {'record_pcoll': [{'label': 'jkl', 'number': 11, 'number_str': '11', 'color': 'BLUE'}], 'fastavro': [{'label': 'jkl', 'number': 11, 'number_str': '11', 'color': 'BLUE'}]}) (12, {'record_pcoll': [{'label': 'mno', 'number': 12, 'number_str': '12', 'color': 'PURPLE'}], 'fastavro': [{'label': 'mno', 'number': 12, 'number_str': '12', 'color': 'PURPLE'}]}) (13, {'record_pcoll': [{'label': 'pqr', 'number': 13, 'number_str': '13', 'color': None}], 'fastavro': [{'label': 'pqr', 'number': 13, 'number_str': '13', 'color': None}]}) (14, {'record_pcoll': [{'label': 'stu', 'number': 14, 'number_str': '14', 'color': 'RED'}], 'fastavro': [{'label': 'stu', 'number': 14, 'number_str': '14', 'color': 'RED'}]}) (15, {'record_pcoll': [{'label': 'vwx', 'number': 15, 'number_str': '15', 'color': 'ORANGE'}], 'fastavro': [{'label': 'vwx', 'number': 15, 'number_str': '15', 'color': 'ORANGE'}]}) (16, {'record_pcoll': [{'label': 'abc', 'number': 16, 'number_str': '16', 'color': 'YELLOW'}], 'fastavro': [{'label': 'abc', 'number': 16, 'number_str': '16', 'color': 'YELLOW'}]}) (17, {'record_pcoll': [{'label': 'def', 'number': 17, 'number_str': '17', 'color': 'GREEN'}], 'fastavro': [{'label': 'def', 'number': 17, 'number_str': '17', 'color': 'GREEN'}]}) (18, {'record_pcoll': [{'label': 'ghi', 'number': 18, 'number_str': '18', 'color': 'BLUE'}], 'fastavro': [{'label': 'ghi', 'number': 18, 'number_str': '18', 'color': 'BLUE'}]}) (19, {'record_pcoll': [{'label': 'jkl', 'number': 19, 'number_str': '19', 'color': 'PURPLE'}], 'fastavro': [{'label': 'jkl', 'number': 19, 'number_str': '19', 'color': 'PURPLE'}]})
) | ||
result = self.test_pipeline.run() | ||
result.wait_until_finish() | ||
fastavro_pcoll = self.test_pipeline \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future changes, prefer not to use \ for concatenating strings, since it's error prone. Using brackets is preferable.
From PEP8: (https://www.python.org/dev/peps/pep-0008/)
The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.
(this was a preexisting issue in this file).
Remove the support for
avro
and makefastavro
as default to read Avro files.use_fastavro=False
would still usefastAvro
as default and merely kept for backward compatibility .Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.