👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

shankari · 2023-05-08T19:05:48Z

We upgraded the server in #900 (fixing e-mission/e-mission-docs#856)

But:

we did not upgrade all the scripts (which are not currently checked as part of the automated tests)
we upgraded the tests in a fairly hacky way by copying pasting the json options into multiple locations instead of creating a wrapper

This PR attempts to fix both issues.

During the lazy migration, the first place in the timeline could be the first place in this incremental batch, with confirmed trips before it. But those confirmed trips don't have confirmed places. So if the timeline.first_place has a cleaned `ending_trip`, we should replace it with the corresponding confirmed trip This fixes e-mission/e-mission-docs#898 (comment) Testing done: - re-ran the pipeline on the test users, did not have errors - re-ran the pipeline on the durham dataset, did not have errors

… creating confirmed places When we run the pipeline on an ongoing system, we may have a mix of old style and new style confirmed trips. This is because the trips created on previous runs will be "old style", but if there was unprocessed data before the current run, they would have been converted to "new style" trips, along with linked confirmed places, in the pipeline step before this one. So we cannot blindly assert that, if we see an "old style" trip, there are no confirmed places. Instead, we must assert that the start and end places are likely to be cleaned places. Even this is tricky, because of the "likely to be" issue. Since the trips are linked, the start place of an "old style" trip is the end place of the previous "old style" trip. Since we convert the data trip by trip, we would have created the confirmed place that corresponds to the start place of a trip as the end place of the previous trip. So we cannot add a blanket check for start places Further, if we did process some data in the previous CREATE_CONFIRMED_OBJECTS stage, we would have created confirmed places for the last part of the timeline. The first of those confirmed places would be the end place of an "old style" confirmed trip. So we cannot add a blanket check for the end place either. So we add more complex assertions for old-style trips: - If it has a start confirmed place, the place should have been created as the end place while processing a previous composite trip - if we have an end confirmed place, the place should be the start place of a "new style" confirmed trip This fixes: e-mission/e-mission-docs#898 (comment) Testing done: - re-ran the pipeline on the test users, did not have errors - re-ran the pipeline on the durham dataset, did not have errors

Before this change, we followed the high-level structure: - open DB cursor - for entry in cursor - process(entry) If there were a lot of entries/the processing was slow, the cursor could time out in the for loop In practice, the cursor timeout time was ~ 10 minutes We improve the robustness of this codebase at the cost of additional memory usage by reading all entries into memory. We read all trips into memory while building the trip model, and while generating the public dashboard so this does not seem to be an excessive memory burden. And we do not anticipate reading a large number of trips after this initial creation until the pipeline is reset This fixes: e-mission/e-mission-docs#898 (comment) Testing done: - Re-ran pipeline on durham data; no failures

…gle place We can sometimes run into the case where the timeline for cleaned -> composite conversion consists of a single place. With the previous implementation of `get_first_place` and `get_last_place`, we would return `None` if there were no trips, this causing the conversion to fail. We now fix the methods so that they return the single place in this corner case. Note that this is only the case in which there is a single place and no trips. If there is even one trip, we will have two places (start and end) This fixes: e-mission/e-mission-docs#898 (comment) e-mission/e-mission-docs#898 (comment)

If we run the pipeline multiple times without any new data, then we will have both a last cleaned place and a last confirmed place, and neither of them will have an exit_ts, since we are still at the last place where we arrived (lack of new data). So in this fairly common case, we can't try to copy the exit data over since we won't have any exit data!! Fixed by checking to see if exit information exists before copying it over. This fixes: e-mission/e-mission-docs#898 (comment) e-mission/e-mission-docs#898 (comment) Testing done: - Re-ran pipeline on Durham data, no failures

…et code Reset changes to fully support e-mission#895 - `place_queries.py`: support the `CONFIRMED_PLACE_KEY` as a trip key query - `reset.py`: - find the reset timestamp based on the last confirmed place instead of the last cleaned place since the confirmed timeline is the final timeline - return the last cleaned place as the cleaned place of the last confirmed place - open the confirmed place at the end of the confirmed timeline in addition to cleaned and raw places for their respective timelines - indicate that the composite trip creation also has fuzz ``` def mark_composite_object_creation_done(user_id, last_processed_ts): if last_processed_ts is None: mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS, None) else: mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS, last_processed_ts + END_FUZZ_AVOID_LTE) ``` This fixes e-mission#911 (comment) But results in a new error e-mission#911 (comment) ``` ====================================================================== FAIL: testResetToTsInMiddleOfTrip (__main__.TestPipelineReset) - Load data for both days ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 311, in testResetToTsInMiddleOfTrip self.compare_result(ad.AttrDict({'result': api_result}).result, File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 102, in compare_result self.assertEqual(rs.properties.exit_fmt_time, es.properties.exit_fmt_time) AssertionError: '2016-07-25T17:12:27.853000-07:00' != '2016-07-22T17:21:36-07:00' - 2016-07-25T17:12:27.853000-07:00 ? ^ - ----- ^^^ + 2016-07-22T17:21:36-07:00 ? ^ + ^ ---------------------------------------------------------------------- ```

The question of FUZZ while resetting the pipeline is complicated. In general, we want to set the `last_processed_ts` to the `reset_ts`, but this causes some trips to be processed twice. - So we added FUZZ to all stages e-mission/e-mission-docs#191 (comment) e-mission/e-mission-docs#191 (comment) e-mission/e-mission-docs#191 (comment) - But then it turned out that the CLEAN_RESAMPLE stage would miss some untracked time trip-like objects because the places on the two sides of the untracked time can have zero duration, so adding the fuzz would cause some trips to be missed e-mission/e-mission-docs#191 (comment) - But if the stage focused on trips only (no places), such as the label assist stages, we would again get duplicate trips without fuzz, so we created a list of stages that needed the fuzz e-mission@807e1d4 - This list initially contained all the label assist stages, including CREATE_CONFIRMED_OBJECT. However, we recently started creating a full confirmed timeline, including confirmed places and confirmed untracked time. So we changed the `last_processed_ts` to the timestamp of the last place (`last_expected_trip_done.data.end_ts -> last_processed_ts = cPlace['data']['enter_ts']`) e-mission@0aa7c83 - So we should no longer include the FUZZ for the CREATE_CONFIRMED_OBJECTS stage, since it breaks computation when there is untracked time e-mission#911 (comment) We do have to include it for the `CREATE_COMPOSITE_OBJECTS` stage, but we already did so in e-mission@509f2c8 So the only change required here is to remove `CREATE_CONFIRMED_OBJECTS` Testing done: - All reset tests pass

To make them easier to debug. If the flag is passed in, then the entries are not deleted when the test is complete, allowing us to inspect objects, or connect to the database

…at_fixes 🔙 ⚡ 🐛 🧠 ✅ Backwards compat, robustness and bug-fixes for migration and incremental pipeline runs

…exist In 624082e, we added a check that > if we have an end confirmed place, the place should be the start place of a "new style" confirmed trip This is generally true EXCEPT if the confirmed place is the end of the confirmed timeline. In that case, there is no next trip. So we only check for the next trip if the place has a "starting_trip" field and we know that it is not the end of the chain. This fixes e-mission/e-mission-docs#898 (comment)

…at_fixes 🔙 Run validation check for the next trip only when it is expecte…

✨ 🐛 Couple of confirmed place fixes

…sentation to LEGACY - Created a simple class (`json_wrappers.py`) which encapsulates the correct settings that we need for serializing and deserializing the JSON representations with the correct UUID representation - Changed all the references in the `emission` directory to use the new wrapper functions - Including cleaning up prior hardcoded attempts at cleanup, such as e-mission@3b456e7#diff-cfddece09bbf132974a13acdfb77be4e585a4bff39999f559dc8200c1ffbe78d Additional Context: - e-mission/e-mission-docs#856 (comment) - https://pymongo.readthedocs.io/en/stable/examples/uuid.html - e-mission@9c683cf - e-mission@6ac02a2 - e-mission@edd8b77 As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation > bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions = > JSONOptions(strict_number_long=False, datetime_representation=0, > strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False, > uuid_representation=UuidRepresentation.UNSPECIFIED, > unicode_decode_error_handler='strict', tzinfo=None, > type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None), > datetime_conversion=DatetimeConversion.DATETIME)¶ Testing done: - Ensured that there were no imports of json_utils ``` $ find emission -name \*.py | xargs grep json_utils $ ``` - Ensured that all `bju` prefixes were replaced, other than in the wrapper itself ``` $ find emission -name \*.py | xargs grep bju emission/storage/json_wrappers.py:import bson.json_util as bju emission/storage/json_wrappers.py:wrapped_object_hook = lambda s: bju.object_hook(s, emission/storage/json_wrappers.py: json_options = bju.RELAXED_JSON_OPTIONS.with_options( emission/storage/json_wrappers.py:wrapped_default = lambda o: bju.default(o, json_options = bju.LEGACY_JSON_OPTIONS) emission/storage/json_wrappers.py:wrapped_dumps = lambda o: bju.dumps(o, json_options = bju.LEGACY_JSON_OPTIONS.with_options( emission/storage/json_wrappers.py:wrapped_loads = lambda s: bju.loads(s) ``` - Ensured that all `esj` imports used the `wrapped` version of the name ``` $ find emission -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" | wc -l 0 ```

…sentation to LEGACY Changed all the references in the `bin` scripts to also use the new wrappers to serialize/deserialize UUID objects. This is a follow-on to e-mission@51a6881 which also has the additional context Additional Context: - e-mission/e-mission-docs#856 (comment) - https://pymongo.readthedocs.io/en/stable/examples/uuid.html - 9c683cf - 6ac02a2 - edd8b77 As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation > bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions = > JSONOptions(strict_number_long=False, datetime_representation=0, > strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False, > uuid_representation=UuidRepresentation.UNSPECIFIED, > unicode_decode_error_handler='strict', tzinfo=None, > type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None), > datetime_conversion=DatetimeConversion.DATETIME)¶ Testing done: - Ensured that there were no imports of json_utils ``` $ find bin -name \*.py | xargs grep json_utils $ ``` - Ensured that all `bju` prefixes were replaced ``` $ find bin -name \*.py | xargs grep bju $ ``` - Ensured that all `esj` imports used the `wrapped` version of the name ``` $ find bin -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" $ ```

We could change the wrapper to take kwargs, but the easiest option for now is to just switch to json.dumps with a different default

So that we can tweak the composite trip format as needed without having to reset the entire pipeline. Note that the composite trip stage is a display-only stage that packages up the trip information to send to the phone

shankari added 21 commits April 27, 2023 23:32

✅ Support SKIP_TEARDOWN for the reset tests as well

b8259ae

To make them easier to debug. If the flag is passed in, then the entries are not deleted when the test is complete, allowing us to inspect objects, or connect to the database

Merge pull request e-mission#911 from shankari/several_backwards_comp…

53381ea

…at_fixes 🔙 ⚡ 🐛 🧠 ✅ Backwards compat, robustness and bug-fixes for migration and incremental pipeline runs

Merge pull request e-mission#912 from shankari/several_backwards_comp…

9f9fd54

…at_fixes 🔙 Run validation check for the next trip only when it is expecte…

Merge pull request e-mission#913 from e-mission/may_2023_point_release

d7e7a57

✨ 🐛 Couple of confirmed place fixes

👽 support timestamp() from the latest arrow version

bdd6b25

🐛 imported as esj and not ejs

e96bd8e

👽 Add missing import of emission.storage.json_wrappers

e97a63b

👽 bju.dumps doesn't support multiple arguments

38331ae

We could change the wrapper to take kwargs, but the easiest option for now is to just switch to json.dumps with a different default

👽 support timestamp() from the latest arrow version

7cf1b1e

👽 wrapped_default instead of default

67e40ee

✨ script to reset only the composite trip stage

e61fe38

So that we can tweak the composite trip format as needed without having to reset the entire pipeline. Note that the composite trip stage is a display-only stage that packages up the trip information to send to the phone

shankari changed the base branch from master to may_2023_point_release May 9, 2023 04:22

shankari merged commit 79762b1 into e-mission:may_2023_point_release May 9, 2023

shankari deleted the upgrade_scripts branch May 9, 2023 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

shankari commented May 8, 2023

👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

Conversation

shankari commented May 8, 2023