-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 1840 invalid characters #1892
Conversation
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit aefbc02 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java
Outdated
Show resolved
Hide resolved
@tzujenchanmbd Curious about your thoughts on the acceptance tests. In cases where this is happening, it looks like it's because of how the producer is encoding accents (examples below). |
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit 477dbdd 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
Some examples of correct name in the screenshots:
Problem example on maps: https://maps.app.goo.gl/YqnS2Gj9goeWN8GT9 So it seems the issue usually happen on accented characters like "ó", "ö", "ï" in Western European languages. Perhaps dev team can help confirm, but I guess this is probably because of encoding and decoding mismatch during the data production process. For example, if the text was originally saved using a specific encoding (e.g. ISO-8859-1 or Windows-1252, "legacy" encoding covering characters commonly used in Western European languages, such as accented characters (é, ñ, ü) and special symbols), but is then read using a different encoding (e.g. UTF-8). In this case characters outside the ASCII range (like accented characters) will probably not decode correctly, leading to errors like the replacement character (�). |
We assumed that the feeds are in UTF-8, replacement characters and other variations might be due to the fact that is not in proper UTF-8. The legacy Google validator replaced the non-UTF-8 compatible characters with the replacement character. Maybe they have this implemented somewhere in their data pipeline to guarantee that the text is properly rendered in the UI, even with some characters "replaced", legacy validator code |
Revisions after discussion with @tzujenchanmbd: @davidgamez @qcdyx Is it feasible for us to parse the non-UTF-8 characters too? Ideally we could show them to the user in the notice table description, highlighted in bold so they know which characters are causing the problem.
I also want to note it looks like feeds with this error will have unparseable files that will mean notices are dropped, from the acceptance tests above. |
I suggest creating a different notice for non-UTF-8 text. There are two distinct situations: the first is the presence of a replacement character that is a valid UTF-8 character, and the second is the presence of an invalid UTF-8 character. I suspect that if we have a replacement character, it is due to a tool that transformed the feed and potentially replaced the invalid UTF-8 characters(or any other encoding) to UTF-8 or a different target encoding(violating the spec in this case). |
Regarding the dropped notices, they are expected because of the severity of the notice(Error). |
@davidgamez Makes sense. How's this for a revised notice description then, so it's more suggestive and less prescriptive that the issue is non-UTF-8 encoding: Notice name: Invalid character (not plural, to match the style of our other notices) |
The notice name and description make sense to me. |
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit 2cf9ad2 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit 738de05 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
LGTM! |
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit 44108c3 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
📝 Acceptance Test Report📋 Summary✅ The rule acceptance has passed for commit 3963b99 📊 Notices ComparisonNew Errors (5 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
Dropped Errors (1 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
New Warnings (0 out of 1602 datasets, ~0%) ✅No changes were detected due to the code change. Dropped Warnings (4 out of 1602 datasets, ~0%) ✅Details of new errors due to code change, which is less than the provided threshold of 1%.
🛡️ Corruption Check0 out of 1602 sources (~0 %) are corrupted. ⏱️ Performance Assessment📈 Validation TimeAssess the performance in terms of seconds taken for the validation process.
📜 Memory Consumption
|
Summary:
Closes #1840
Expected behavior:
Please make sure these boxes are checked before submitting your pull request - thanks!
gradle test
to make sure you didn't break anything