feat: 1840 invalid characters #1892

qcdyx · 2024-10-16T15:26:22Z

Summary:

Closes #1840

Expected behavior:

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with gradle test to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

github-actions · 2024-10-16T16:00:55Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit aefbc02
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_characters
ch-unknown-lk2-gtfs-914	invalid_characters
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_characters
nl-unknown-allgo-keolis-gtfs-1077	invalid_characters
pt-setubal-carris-metropolitana-gtfs-1874	invalid_characters

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	4.02	4.06	⬆️+0.05
Median	--	1.40	1.46	⬆️+0.06
Standard Deviation	--	11.53	11.32	⬇️-0.21
Minimum in References Reports	us-california-flex-v2-developer-test-feed-3-gtfs-1819	0.50	0.73	⬆️+0.23
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	297.19	287.84	⬇️-9.36
Minimum in Latest Reports	us-california-catalina-express-gtfs-299	0.60	0.55	⬇️-0.06
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	297.19	287.84	⬇️-9.36

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	486.19 MiB	480.76 MiB	⬇️-5.44 MiB
Median	--	245.95 MiB	246.85 MiB	⬆️+922.84 KiB
Standard Deviation	--	877.41 MiB	874.83 MiB	⬇️-2.58 MiB
Minimum in References Reports	us-oregon-hut-airport-shuttle-gtfs-635	34.05 MiB	34.09 MiB	⬆️+40.00 KiB
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	9.96 GiB	10.12 GiB	⬆️+161.15 MiB
Minimum in Latest Reports	us-virginia-jaunt-inc-gtfs-1324	34.06 MiB	34.05 MiB	⬇️-16.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	9.96 GiB	10.12 GiB	⬆️+161.15 MiB

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

emmambd · 2024-10-16T17:48:10Z

@tzujenchanmbd Curious about your thoughts on the acceptance tests. In cases where this is happening, it looks like it's because of how the producer is encoding accents (examples below).

Is there some kind of guidance it would make sense for us to provide in the notice about how to encode these to prevent the issue from occurring?

github-actions · 2024-10-16T18:10:51Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 477dbdd
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_characters
ch-unknown-lk2-gtfs-914	invalid_characters
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_characters
nl-unknown-allgo-keolis-gtfs-1077	invalid_characters
pt-setubal-carris-metropolitana-gtfs-1874	invalid_characters

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	4.02	4.14	⬆️+0.12
Median	--	1.38	1.43	⬆️+0.05
Standard Deviation	--	11.61	11.86	⬆️+0.25
Minimum in References Reports	us-california-flex-v2-developer-test-feed-3-gtfs-1819	0.51	0.62	⬆️+0.11
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	300.20	291.45	⬇️-8.76
Minimum in Latest Reports	ar-buenos-aires-subterraneos-de-buenos-aires-subte-gtfs-6	0.53	0.54	⬆️+0.01
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	300.20	291.45	⬇️-8.76

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	487.60 MiB	476.31 MiB	⬇️-11.30 MiB
Median	--	248.03 MiB	245.48 MiB	⬇️-2.56 MiB
Standard Deviation	--	863.99 MiB	843.71 MiB	⬇️-20.28 MiB
Minimum in References Reports	us-california-flex-v2-developer-test-feed-1-gtfs-1817	34.05 MiB	34.06 MiB	⬆️+8.00 KiB
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.15 GiB	9.79 GiB	⬇️-366.70 MiB
Minimum in Latest Reports	tr-kocaeli-metro-izmir-gtfs-1824	34.07 MiB	34.05 MiB	⬇️-16.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.15 GiB	9.79 GiB	⬇️-366.70 MiB

tzujenchanmbd · 2024-10-16T19:23:41Z

Some examples of correct name in the screenshots:

Rotterdam, Selma Lagerlöfweg -> Rotterdam, Selma Lagerl�fweg
Rotterdam, Port Saïdstraat -> Rotterdam, Port Sa�dstraat
Estación Washington -> Estaci��n Washington

Problem example on maps: https://maps.app.goo.gl/YqnS2Gj9goeWN8GT9

So it seems the issue usually happen on accented characters like "ó", "ö", "ï" in Western European languages.

Perhaps dev team can help confirm, but I guess this is probably because of encoding and decoding mismatch during the data production process. For example, if the text was originally saved using a specific encoding (e.g. ISO-8859-1 or Windows-1252, "legacy" encoding covering characters commonly used in Western European languages, such as accented characters (é, ñ, ü) and special symbols), but is then read using a different encoding (e.g. UTF-8). In this case characters outside the ASCII range (like accented characters) will probably not decode correctly, leading to errors like the replacement character (�).

davidgamez · 2024-10-16T19:42:45Z

Some examples of correct name in the screenshots:

Rotterdam, Selma Lagerlöfweg -> Rotterdam, Selma Lagerl�fweg

Rotterdam, Port Saïdstraat -> Rotterdam, Port Sa�dstraat

Estación Washington -> Estaci��n Washington

Problem example on maps: https://maps.app.goo.gl/YqnS2Gj9goeWN8GT9

So it seems the issue usually happen on accented characters like "ó", "ö", "ï" in Western European languages.

Perhaps dev team can help confirm, but I guess this is probably because of encoding and decoding mismatch during the data production process. For example, if the text was originally saved using a specific encoding (e.g. ISO-8859-1 or Windows-1252, "legacy" encoding covering characters commonly used in Western European languages, such as accented characters (é, ñ, ü) and special symbols), but is then read using a different encoding (e.g. UTF-8). In this case characters outside the ASCII range (like accented characters) will probably not decode correctly, leading to errors like the replacement character (�).

We assumed that the feeds are in UTF-8, replacement characters and other variations might be due to the fact that is not in proper UTF-8. The legacy Google validator replaced the non-UTF-8 compatible characters with the replacement character. Maybe they have this implemented somewhere in their data pipeline to guarantee that the text is properly rendered in the UI, even with some characters "replaced", legacy validator code

emmambd · 2024-10-16T20:06:08Z

Revisions after discussion with @tzujenchanmbd:

@davidgamez @qcdyx Is it feasible for us to parse the non-UTF-8 characters too? Ideally we could show them to the user in the notice table description, highlighted in bold so they know which characters are causing the problem.

Notice name: Invalid character (not plural, to match the style of our other notices)
Notice description: Description: This field contains invalid characters, marked in bold. Text must be encoded in UTF-8 in order to be valid. When reading text, use the same encoding that was used to save.

I also want to note it looks like feeds with this error will have unparseable files that will mean notices are dropped, from the acceptance tests above.

davidgamez · 2024-10-16T20:20:34Z

Revisions after discussion with @tzujenchanmbd:

@davidgamez @qcdyx Is it feasible for us to parse the non-UTF-8 characters too? Ideally we could show them to the user in the notice table description, highlighted in bold so they know which characters are causing the problem.

Notice name: Invalid character (not plural, to match the style of our other notices)

Notice description: Description: This field contains invalid characters, marked in bold. Text must be encoded in UTF-8 in order to be valid. When reading text, use the same encoding that was used to save.

I also want to note it looks like feeds with this error will have unparseable files that will mean notices are dropped, from the acceptance tests above.

I suggest creating a different notice for non-UTF-8 text. There are two distinct situations: the first is the presence of a replacement character that is a valid UTF-8 character, and the second is the presence of an invalid UTF-8 character. I suspect that if we have a replacement character, it is due to a tool that transformed the feed and potentially replaced the invalid UTF-8 characters(or any other encoding) to UTF-8 or a different target encoding(violating the spec in this case).

davidgamez · 2024-10-16T20:23:00Z

Regarding the dropped notices, they are expected because of the severity of the notice(Error).

emmambd · 2024-10-16T20:29:26Z

@davidgamez Makes sense. How's this for a revised notice description then, so it's more suggestive and less prescriptive that the issue is non-UTF-8 encoding:

Notice name: Invalid character (not plural, to match the style of our other notices)
Notice description: Description: This field contains invalid characters, such as the replacement character ("�"). Check that text was properly encoded in UTF-8 as required by GTFS.

davidgamez · 2024-10-16T20:36:34Z

@davidgamez Makes sense. How's this for a revised notice description then, so it's more suggestive and less prescriptive that the issue is non-UTF-8 encoding:

Notice name: Invalid character (not plural, to match the style of our other notices) Notice description: Description: This field contains invalid characters, such as the replacement character ("�"). Check that text was properly encoded in UTF-8 as required by GTFS.

The notice name and description make sense to me.

qcdyx · 2024-10-16T22:41:36Z

@davidgamez @emmambd

github-actions · 2024-10-16T23:24:42Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 2cf9ad2
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_character
ch-unknown-lk2-gtfs-914	invalid_character
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_character
nl-unknown-allgo-keolis-gtfs-1077	invalid_character
pt-setubal-carris-metropolitana-gtfs-1874	invalid_character

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	4.02	4.05	⬆️+0.03
Median	--	1.39	1.44	⬆️+0.05
Standard Deviation	--	11.59	11.40	⬇️-0.19
Minimum in References Reports	us-california-flex-v2-developer-test-feed-2-gtfs-1818	0.52	0.58	⬆️+0.06
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	301.68	289.98	⬇️-11.70
Minimum in Latest Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	0.54	0.54	⬇️-0.00
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	301.68	289.98	⬇️-11.70

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	494.48 MiB	479.71 MiB	⬇️-14.76 MiB
Median	--	247.23 MiB	245.94 MiB	⬇️-1.29 MiB
Standard Deviation	--	894.05 MiB	850.47 MiB	⬇️-43.58 MiB
Minimum in References Reports	ph-unknown-hm-transport-inc-and-robinsons-malls-gtfs-1105	34.05 MiB	34.07 MiB	⬆️+24.00 KiB
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.18 GiB	10.04 GiB	⬇️-146.33 MiB
Minimum in Latest Reports	us-oregon-high-desert-point-gtfs-636	34.05 MiB	34.05 MiB	⬇️-8.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.18 GiB	10.04 GiB	⬇️-146.33 MiB

github-actions · 2024-10-20T22:34:48Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 738de05
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_character
ch-unknown-lk2-gtfs-914	invalid_character
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_character
nl-unknown-allgo-keolis-gtfs-1077	invalid_character
pt-setubal-carris-metropolitana-gtfs-1874	invalid_character

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	3.98	4.03	⬆️+0.04
Median	--	1.39	1.43	⬆️+0.04
Standard Deviation	--	11.35	11.20	⬇️-0.15
Minimum in References Reports	us-california-catalina-express-gtfs-299	0.54	0.64	⬆️+0.10
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	292.13	290.70	⬇️-1.42
Minimum in Latest Reports	us-california-santa-clarita-transit-gtfs-812	0.63	0.55	⬇️-0.09
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	292.13	290.70	⬇️-1.42

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	486.24 MiB	476.25 MiB	⬇️-9.99 MiB
Median	--	245.94 MiB	245.94 MiB	⬆️+3.99 KiB
Standard Deviation	--	884.90 MiB	827.25 MiB	⬇️-57.65 MiB
Minimum in References Reports	us-oregon-hut-airport-shuttle-gtfs-635	34.05 MiB	34.05 MiB	⬇️-8.00 KiB
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.24 GiB	9.84 GiB	⬇️-400.14 MiB
Minimum in Latest Reports	tr-kocaeli-metro-izmir-gtfs-1824	34.07 MiB	34.05 MiB	⬇️-24.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.24 GiB	9.84 GiB	⬇️-400.14 MiB

emmambd · 2024-10-21T15:10:58Z

LGTM!

github-actions · 2024-10-21T15:19:47Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 44108c3
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_character
ch-unknown-lk2-gtfs-914	invalid_character
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_character
nl-unknown-allgo-keolis-gtfs-1077	invalid_character
pt-setubal-carris-metropolitana-gtfs-1874	invalid_character

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	4.00	4.03	⬆️+0.04
Median	--	1.39	1.43	⬆️+0.04
Standard Deviation	--	11.54	11.37	⬇️-0.18
Minimum in References Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	0.51	0.64	⬆️+0.13
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	300.37	297.15	⬇️-3.22
Minimum in Latest Reports	tr-kocaeli-metro-izmir-gtfs-1824	0.56	0.54	⬇️-0.02
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	300.37	297.15	⬇️-3.22

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	478.72 MiB	478.05 MiB	⬇️-682.76 KiB
Median	--	246.71 MiB	246.50 MiB	⬇️-222.88 KiB
Standard Deviation	--	852.26 MiB	862.92 MiB	⬆️+10.66 MiB
Minimum in References Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	34.49 MiB	34.49 MiB	⬇️0 bytes
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.21 GiB	10.08 GiB	⬇️-139.33 MiB
Minimum in Latest Reports	us-california-flex-v2-developer-test-feed-3-gtfs-1819	34.50 MiB	34.48 MiB	⬇️-24.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.21 GiB	10.08 GiB	⬇️-139.33 MiB

davidgamez

LGTM

github-actions · 2024-10-21T18:13:58Z

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 3963b99
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7	invalid_character
ch-unknown-lk2-gtfs-914	invalid_character
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	invalid_character
nl-unknown-allgo-keolis-gtfs-1077	invalid_character
pt-setubal-carris-metropolitana-gtfs-1874	invalid_character

Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance

New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset	Notice Code
ch-unknown-lk2-gtfs-914	duplicate_route_name
ch-unknown-lk2-gtfs-914	fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914	fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077	fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914	missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	missing_timepoint_value
ch-unknown-lk2-gtfs-914	stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape
ch-unknown-lk2-gtfs-914	stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077	stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874	stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926	stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077	stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874	stop_without_stop_time
ch-unknown-lk2-gtfs-914	stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874	trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	4.08	4.07	⬇️-0.01
Median	--	1.41	1.42	⬆️+0.02
Standard Deviation	--	11.78	11.57	⬇️-0.21
Minimum in References Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	0.50	0.54	⬆️+0.03
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	302.57	298.73	⬇️-3.83
Minimum in Latest Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	0.50	0.54	⬆️+0.03
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	302.57	298.73	⬇️-3.83

📜 Memory Consumption

Metric	Dataset ID	Reference (s)	Latest (s)	Difference (s)
Average	--	475.68 MiB	461.37 MiB	⬇️-14.31 MiB
Median	--	248.48 MiB	244.63 MiB	⬇️-3.85 MiB
Standard Deviation	--	828.31 MiB	783.94 MiB	⬇️-44.37 MiB
Minimum in References Reports	us-massachusetts-massachusetts-area-express-max-gtfs-431	34.48 MiB	34.50 MiB	⬆️+24.00 KiB
Maximum in Reference Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.20 GiB	10.11 GiB	⬇️-99.86 MiB
Minimum in Latest Reports	us-michigan-detroit-people-mover-gtfs-417	34.49 MiB	34.48 MiB	⬇️-8.00 KiB
Maximum in Latest Reports	gb-unknown-uk-aggregate-feed-gtfs-2014	10.20 GiB	10.11 GiB	⬇️-99.86 MiB

qcdyx added 3 commits October 16, 2024 11:05

added check for invalid characters in asString method

cace781

formatted code

84a7fde

Merge branch 'master' into 1840-invalid-characters

b3aa55e

qcdyx requested a review from davidgamez October 16, 2024 15:26

davidgamez reviewed Oct 16, 2024

View reviewed changes

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java Outdated Show resolved Hide resolved

qcdyx added 2 commits October 16, 2024 13:20

removed return null;

aba5d26

Merge branch 'master' into 1840-invalid-characters

5d32118

InvalidCharacterNotice changes based on requirement

2a2628f

Merge branch 'master' into 1840-invalid-characters

c7939f1

Merge branch 'master' into 1840-invalid-characters

4de8a76

davidgamez approved these changes Oct 21, 2024

View reviewed changes

Merge branch 'master' into 1840-invalid-characters

b8ac341

qcdyx merged commit 457cfc5 into master Oct 21, 2024
333 checks passed

qcdyx deleted the 1840-invalid-characters branch October 21, 2024 17:51

emmambd mentioned this pull request Oct 21, 2024

Validator Accepts Replacement Character in stop_name Field #1840

Closed

emmambd mentioned this pull request Nov 7, 2024

Validator v6 Fails to Validate GTFS File (UNPARSABLE_ROWS) #1918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 1840 invalid characters #1892

feat: 1840 invalid characters #1892

qcdyx commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024

emmambd commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

tzujenchanmbd commented Oct 16, 2024

davidgamez commented Oct 16, 2024

emmambd commented Oct 16, 2024 •

edited

Loading

davidgamez commented Oct 16, 2024

davidgamez commented Oct 16, 2024

emmambd commented Oct 16, 2024

davidgamez commented Oct 16, 2024

qcdyx commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 20, 2024

emmambd commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

davidgamez left a comment

github-actions bot commented Oct 21, 2024

feat: 1840 invalid characters #1892

feat: 1840 invalid characters #1892

Conversation

qcdyx commented Oct 16, 2024 • edited Loading

github-actions bot commented Oct 16, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

emmambd commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

tzujenchanmbd commented Oct 16, 2024

davidgamez commented Oct 16, 2024

emmambd commented Oct 16, 2024 • edited Loading

davidgamez commented Oct 16, 2024

davidgamez commented Oct 16, 2024

emmambd commented Oct 16, 2024

davidgamez commented Oct 16, 2024

qcdyx commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

github-actions bot commented Oct 20, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

emmambd commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

davidgamez left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 21, 2024

📝 Acceptance Test Report

📋 Summary

📊 Notices Comparison

🛡️ Corruption Check

⏱️ Performance Assessment

qcdyx commented Oct 16, 2024 •

edited

Loading

emmambd commented Oct 16, 2024 •

edited

Loading