-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914
Merged
shankari
merged 21 commits into
e-mission:may_2023_point_release
from
shankari:upgrade_scripts
May 9, 2023
Merged
👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914
shankari
merged 21 commits into
e-mission:may_2023_point_release
from
shankari:upgrade_scripts
May 9, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
During the lazy migration, the first place in the timeline could be the first place in this incremental batch, with confirmed trips before it. But those confirmed trips don't have confirmed places. So if the timeline.first_place has a cleaned `ending_trip`, we should replace it with the corresponding confirmed trip This fixes e-mission/e-mission-docs#898 (comment) Testing done: - re-ran the pipeline on the test users, did not have errors - re-ran the pipeline on the durham dataset, did not have errors
… creating confirmed places When we run the pipeline on an ongoing system, we may have a mix of old style and new style confirmed trips. This is because the trips created on previous runs will be "old style", but if there was unprocessed data before the current run, they would have been converted to "new style" trips, along with linked confirmed places, in the pipeline step before this one. So we cannot blindly assert that, if we see an "old style" trip, there are no confirmed places. Instead, we must assert that the start and end places are likely to be cleaned places. Even this is tricky, because of the "likely to be" issue. Since the trips are linked, the start place of an "old style" trip is the end place of the previous "old style" trip. Since we convert the data trip by trip, we would have created the confirmed place that corresponds to the start place of a trip as the end place of the previous trip. So we cannot add a blanket check for start places Further, if we did process some data in the previous CREATE_CONFIRMED_OBJECTS stage, we would have created confirmed places for the last part of the timeline. The first of those confirmed places would be the end place of an "old style" confirmed trip. So we cannot add a blanket check for the end place either. So we add more complex assertions for old-style trips: - If it has a start confirmed place, the place should have been created as the end place while processing a previous composite trip - if we have an end confirmed place, the place should be the start place of a "new style" confirmed trip This fixes: e-mission/e-mission-docs#898 (comment) Testing done: - re-ran the pipeline on the test users, did not have errors - re-ran the pipeline on the durham dataset, did not have errors
Before this change, we followed the high-level structure: - open DB cursor - for entry in cursor - process(entry) If there were a lot of entries/the processing was slow, the cursor could time out in the for loop In practice, the cursor timeout time was ~ 10 minutes We improve the robustness of this codebase at the cost of additional memory usage by reading all entries into memory. We read all trips into memory while building the trip model, and while generating the public dashboard so this does not seem to be an excessive memory burden. And we do not anticipate reading a large number of trips after this initial creation until the pipeline is reset This fixes: e-mission/e-mission-docs#898 (comment) Testing done: - Re-ran pipeline on durham data; no failures
…gle place We can sometimes run into the case where the timeline for cleaned -> composite conversion consists of a single place. With the previous implementation of `get_first_place` and `get_last_place`, we would return `None` if there were no trips, this causing the conversion to fail. We now fix the methods so that they return the single place in this corner case. Note that this is only the case in which there is a single place and no trips. If there is even one trip, we will have two places (start and end) This fixes: e-mission/e-mission-docs#898 (comment) e-mission/e-mission-docs#898 (comment)
If we run the pipeline multiple times without any new data, then we will have both a last cleaned place and a last confirmed place, and neither of them will have an exit_ts, since we are still at the last place where we arrived (lack of new data). So in this fairly common case, we can't try to copy the exit data over since we won't have any exit data!! Fixed by checking to see if exit information exists before copying it over. This fixes: e-mission/e-mission-docs#898 (comment) e-mission/e-mission-docs#898 (comment) Testing done: - Re-ran pipeline on Durham data, no failures
…et code Reset changes to fully support e-mission#895 - `place_queries.py`: support the `CONFIRMED_PLACE_KEY` as a trip key query - `reset.py`: - find the reset timestamp based on the last confirmed place instead of the last cleaned place since the confirmed timeline is the final timeline - return the last cleaned place as the cleaned place of the last confirmed place - open the confirmed place at the end of the confirmed timeline in addition to cleaned and raw places for their respective timelines - indicate that the composite trip creation also has fuzz ``` def mark_composite_object_creation_done(user_id, last_processed_ts): if last_processed_ts is None: mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS, None) else: mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS, last_processed_ts + END_FUZZ_AVOID_LTE) ``` This fixes e-mission#911 (comment) But results in a new error e-mission#911 (comment) ``` ====================================================================== FAIL: testResetToTsInMiddleOfTrip (__main__.TestPipelineReset) - Load data for both days ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 311, in testResetToTsInMiddleOfTrip self.compare_result(ad.AttrDict({'result': api_result}).result, File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 102, in compare_result self.assertEqual(rs.properties.exit_fmt_time, es.properties.exit_fmt_time) AssertionError: '2016-07-25T17:12:27.853000-07:00' != '2016-07-22T17:21:36-07:00' - 2016-07-25T17:12:27.853000-07:00 ? ^ - ----- ^^^ + 2016-07-22T17:21:36-07:00 ? ^ + ^ ---------------------------------------------------------------------- ```
The question of FUZZ while resetting the pipeline is complicated. In general, we want to set the `last_processed_ts` to the `reset_ts`, but this causes some trips to be processed twice. - So we added FUZZ to all stages e-mission/e-mission-docs#191 (comment) e-mission/e-mission-docs#191 (comment) e-mission/e-mission-docs#191 (comment) - But then it turned out that the CLEAN_RESAMPLE stage would miss some untracked time trip-like objects because the places on the two sides of the untracked time can have zero duration, so adding the fuzz would cause some trips to be missed e-mission/e-mission-docs#191 (comment) - But if the stage focused on trips only (no places), such as the label assist stages, we would again get duplicate trips without fuzz, so we created a list of stages that needed the fuzz e-mission@807e1d4 - This list initially contained all the label assist stages, including CREATE_CONFIRMED_OBJECT. However, we recently started creating a full confirmed timeline, including confirmed places and confirmed untracked time. So we changed the `last_processed_ts` to the timestamp of the last place (`last_expected_trip_done.data.end_ts -> last_processed_ts = cPlace['data']['enter_ts']`) e-mission@0aa7c83 - So we should no longer include the FUZZ for the CREATE_CONFIRMED_OBJECTS stage, since it breaks computation when there is untracked time e-mission#911 (comment) We do have to include it for the `CREATE_COMPOSITE_OBJECTS` stage, but we already did so in e-mission@509f2c8 So the only change required here is to remove `CREATE_CONFIRMED_OBJECTS` Testing done: - All reset tests pass
To make them easier to debug. If the flag is passed in, then the entries are not deleted when the test is complete, allowing us to inspect objects, or connect to the database
…at_fixes 🔙 ⚡ 🐛 🧠 ✅ Backwards compat, robustness and bug-fixes for migration and incremental pipeline runs
…exist In 624082e, we added a check that > if we have an end confirmed place, the place should be the start place of a "new style" confirmed trip This is generally true EXCEPT if the confirmed place is the end of the confirmed timeline. In that case, there is no next trip. So we only check for the next trip if the place has a "starting_trip" field and we know that it is not the end of the chain. This fixes e-mission/e-mission-docs#898 (comment)
…at_fixes 🔙 Run validation check for the next trip only when it is expecte…
✨ 🐛 Couple of confirmed place fixes
…sentation to LEGACY - Created a simple class (`json_wrappers.py`) which encapsulates the correct settings that we need for serializing and deserializing the JSON representations with the correct UUID representation - Changed all the references in the `emission` directory to use the new wrapper functions - Including cleaning up prior hardcoded attempts at cleanup, such as e-mission@3b456e7#diff-cfddece09bbf132974a13acdfb77be4e585a4bff39999f559dc8200c1ffbe78d Additional Context: - e-mission/e-mission-docs#856 (comment) - https://pymongo.readthedocs.io/en/stable/examples/uuid.html - e-mission@9c683cf - e-mission@6ac02a2 - e-mission@edd8b77 As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation > bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions = > JSONOptions(strict_number_long=False, datetime_representation=0, > strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False, > uuid_representation=UuidRepresentation.UNSPECIFIED, > unicode_decode_error_handler='strict', tzinfo=None, > type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None), > datetime_conversion=DatetimeConversion.DATETIME)¶ Testing done: - Ensured that there were no imports of json_utils ``` $ find emission -name \*.py | xargs grep json_utils $ ``` - Ensured that all `bju` prefixes were replaced, other than in the wrapper itself ``` $ find emission -name \*.py | xargs grep bju emission/storage/json_wrappers.py:import bson.json_util as bju emission/storage/json_wrappers.py:wrapped_object_hook = lambda s: bju.object_hook(s, emission/storage/json_wrappers.py: json_options = bju.RELAXED_JSON_OPTIONS.with_options( emission/storage/json_wrappers.py:wrapped_default = lambda o: bju.default(o, json_options = bju.LEGACY_JSON_OPTIONS) emission/storage/json_wrappers.py:wrapped_dumps = lambda o: bju.dumps(o, json_options = bju.LEGACY_JSON_OPTIONS.with_options( emission/storage/json_wrappers.py:wrapped_loads = lambda s: bju.loads(s) ``` - Ensured that all `esj` imports used the `wrapped` version of the name ``` $ find emission -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" | wc -l 0 ```
…sentation to LEGACY Changed all the references in the `bin` scripts to also use the new wrappers to serialize/deserialize UUID objects. This is a follow-on to e-mission@51a6881 which also has the additional context Additional Context: - e-mission/e-mission-docs#856 (comment) - https://pymongo.readthedocs.io/en/stable/examples/uuid.html - 9c683cf - 6ac02a2 - edd8b77 As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation > bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions = > JSONOptions(strict_number_long=False, datetime_representation=0, > strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False, > uuid_representation=UuidRepresentation.UNSPECIFIED, > unicode_decode_error_handler='strict', tzinfo=None, > type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None), > datetime_conversion=DatetimeConversion.DATETIME)¶ Testing done: - Ensured that there were no imports of json_utils ``` $ find bin -name \*.py | xargs grep json_utils $ ``` - Ensured that all `bju` prefixes were replaced ``` $ find bin -name \*.py | xargs grep bju $ ``` - Ensured that all `esj` imports used the `wrapped` version of the name ``` $ find bin -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" $ ```
We could change the wrapper to take kwargs, but the easiest option for now is to just switch to json.dumps with a different default
So that we can tweak the composite trip format as needed without having to reset the entire pipeline. Note that the composite trip stage is a display-only stage that packages up the trip information to send to the phone
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We upgraded the server in #900 (fixing e-mission/e-mission-docs#856)
But:
This PR attempts to fix both issues.