Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👽 ♻️ Upgrade and refactor scripts and tests to match the server side upgrade #914

Merged
merged 21 commits into from
May 9, 2023

Conversation

shankari
Copy link
Contributor

@shankari shankari commented May 8, 2023

We upgraded the server in #900 (fixing e-mission/e-mission-docs#856)

But:

  • we did not upgrade all the scripts (which are not currently checked as part of the automated tests)
  • we upgraded the tests in a fairly hacky way by copying pasting the json options into multiple locations instead of creating a wrapper

This PR attempts to fix both issues.

During the lazy migration, the first place in the timeline could be the first
place in this incremental batch, with confirmed trips before it. But those
confirmed trips don't have confirmed places.

So if the timeline.first_place has a cleaned `ending_trip`, we should replace
it with the corresponding confirmed trip

This fixes
e-mission/e-mission-docs#898 (comment)

Testing done:
- re-ran the pipeline on the test users, did not have errors
- re-ran the pipeline on the durham dataset, did not have errors
… creating confirmed places

When we run the pipeline on an ongoing system, we may have a mix of old style
and new style confirmed trips. This is because the trips created on previous
runs will be "old style", but if there was unprocessed data before the current
run, they would have been converted to "new style" trips, along with linked
confirmed places, in the pipeline step before this one.

So we cannot blindly assert that, if we see an "old style" trip, there are no
confirmed places. Instead, we must assert that the start and end places are
likely to be cleaned places.

Even this is tricky, because of the "likely to be" issue.

Since the trips are linked, the start place of an "old style" trip is the end
place of the previous "old style" trip. Since we convert the data trip by trip,
we would have created the confirmed place that corresponds to the start place
of a trip as the end place of the previous trip. So we cannot add a blanket
check for start places

Further, if we did process some data in the previous CREATE_CONFIRMED_OBJECTS
stage, we would have created confirmed places for the last part of the
timeline. The first of those confirmed places would be the end place of an "old
style" confirmed trip. So we cannot add a blanket check for the end place either.

So we add more complex assertions for old-style trips:
- If it has a start confirmed place, the place should have been created as the
  end place while processing a previous composite trip
- if we have an end confirmed place, the place should be the start place of a
  "new style" confirmed trip

This fixes:
e-mission/e-mission-docs#898 (comment)

Testing done:
- re-ran the pipeline on the test users, did not have errors
- re-ran the pipeline on the durham dataset, did not have errors
Before this change, we followed the high-level structure:
- open DB cursor
- for entry in cursor
    - process(entry)

If there were a lot of entries/the processing was slow, the cursor could time out in the for loop
In practice, the cursor timeout time was ~ 10 minutes

We improve the robustness of this codebase at the cost of additional memory
usage by reading all entries into memory. We read all trips into memory while
building the trip model, and while generating the public dashboard so this does
not seem to be an excessive memory burden.

And we do not anticipate reading a large number of trips after this initial
creation until the pipeline is reset

This fixes:
e-mission/e-mission-docs#898 (comment)

Testing done:
- Re-ran pipeline on durham data; no failures
…gle place

We can sometimes run into the case where the timeline for cleaned -> composite
conversion consists of a single place.

With the previous implementation of `get_first_place` and `get_last_place`, we
would return `None` if there were no trips, this causing the conversion to fail.

We now fix the methods so that they return the single place in this corner case.
Note that this is only the case in which there is a single place and no trips.
If there is even one trip, we will have two places (start and end)

This fixes:
e-mission/e-mission-docs#898 (comment)
e-mission/e-mission-docs#898 (comment)
If we run the pipeline multiple times without any new data, then we will have
both a last cleaned place and a last confirmed place, and neither of them will
have an exit_ts, since we are still at the last place where we arrived (lack of
new data).

So in this fairly common case, we can't try to copy the exit data over since we
won't have any exit data!!

Fixed by checking to see if exit information exists before copying it over.

This fixes:
e-mission/e-mission-docs#898 (comment)
e-mission/e-mission-docs#898 (comment)

Testing done:
- Re-ran pipeline on Durham data, no failures
…et code

Reset changes to fully support
e-mission#895

- `place_queries.py`: support the `CONFIRMED_PLACE_KEY` as a trip key query
- `reset.py`:
    - find the reset timestamp based on the last confirmed place instead of the
      last cleaned place since the confirmed timeline is the final timeline
    - return the last cleaned place as the cleaned place of the last confirmed place
    - open the confirmed place at the end of the confirmed timeline in addition
      to cleaned and raw places for their respective timelines
    - indicate that the composite trip creation also has fuzz

```
def mark_composite_object_creation_done(user_id, last_processed_ts):
    if last_processed_ts is None:
        mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS, None)
    else:
        mark_stage_done(user_id, ps.PipelineStages.CREATE_COMPOSITE_OBJECTS,
                        last_processed_ts + END_FUZZ_AVOID_LTE)
```

This fixes
e-mission#911 (comment)

But results in a new error
e-mission#911 (comment)

```
======================================================================
FAIL: testResetToTsInMiddleOfTrip (__main__.TestPipelineReset)
- Load data for both days
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 311, in testResetToTsInMiddleOfTrip
    self.compare_result(ad.AttrDict({'result': api_result}).result,
  File "/Users/kshankar/e-mission/gis_branch_tests/emission/tests/pipelineTests/TestPipelineReset.py", line 102, in compare_result
    self.assertEqual(rs.properties.exit_fmt_time, es.properties.exit_fmt_time)
AssertionError: '2016-07-25T17:12:27.853000-07:00' != '2016-07-22T17:21:36-07:00'
- 2016-07-25T17:12:27.853000-07:00
?          ^     - ----- ^^^
+ 2016-07-22T17:21:36-07:00
?          ^    +   ^
----------------------------------------------------------------------
```
The question of FUZZ while resetting the pipeline is complicated.

In general, we want to set the `last_processed_ts` to the `reset_ts`, but this
causes some trips to be processed twice.

- So we added FUZZ to all stages
e-mission/e-mission-docs#191 (comment)
e-mission/e-mission-docs#191 (comment)
e-mission/e-mission-docs#191 (comment)

- But then it turned out that the CLEAN_RESAMPLE stage would miss some
  untracked time trip-like objects because the places on the two sides of the
  untracked time can have zero duration, so adding the fuzz would cause some trips to be missed
e-mission/e-mission-docs#191 (comment)

- But if the stage focused on trips only (no places), such as the label assist
  stages, we would again get duplicate trips without fuzz, so we created a list
  of stages that needed the fuzz
  e-mission@807e1d4

- This list initially contained all the label assist stages, including CREATE_CONFIRMED_OBJECT.
    However, we recently started creating a full confirmed timeline, including
    confirmed places and confirmed untracked time. So we changed the
    `last_processed_ts` to the timestamp of the last place
    (`last_expected_trip_done.data.end_ts -> last_processed_ts = cPlace['data']['enter_ts']`)
    e-mission@0aa7c83

- So we should no longer include the FUZZ for the CREATE_CONFIRMED_OBJECTS
  stage, since it breaks computation when there is untracked time
    e-mission#911 (comment)

We do have to include it for the `CREATE_COMPOSITE_OBJECTS` stage, but we already did so in
    e-mission@509f2c8

So the only change required here is to remove `CREATE_CONFIRMED_OBJECTS`

Testing done:
- All reset tests pass
To make them easier to debug.
If the flag is passed in, then the entries are not deleted when the test is
complete, allowing us to inspect objects, or connect to the database
…at_fixes

🔙 ⚡ 🐛 🧠 ✅ Backwards compat, robustness and bug-fixes for migration and incremental pipeline runs
…exist

In 624082e, we added a check that

> if we have an end confirmed place, the place should be the start place of a
"new style" confirmed trip

This is generally true EXCEPT if the confirmed place is the end of the confirmed timeline.
In that case, there is no next trip.

So we only check for the next trip if the place has a "starting_trip" field and
we know that it is not the end of the chain.

This fixes
e-mission/e-mission-docs#898 (comment)
…at_fixes

🔙 Run validation check for the next trip only when it is expecte…
…sentation to LEGACY

- Created a simple class (`json_wrappers.py`) which encapsulates the correct
  settings that we need for serializing and deserializing the JSON
  representations with the correct UUID representation
- Changed all the references in the `emission` directory to use the new wrapper functions
    - Including cleaning up prior hardcoded attempts at cleanup, such as
        e-mission@3b456e7#diff-cfddece09bbf132974a13acdfb77be4e585a4bff39999f559dc8200c1ffbe78d

Additional Context:
- e-mission/e-mission-docs#856 (comment)
- https://pymongo.readthedocs.io/en/stable/examples/uuid.html
- e-mission@9c683cf
- e-mission@6ac02a2
- e-mission@edd8b77

As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation

>  bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions =
>  JSONOptions(strict_number_long=False, datetime_representation=0,
>  strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False,
>  uuid_representation=UuidRepresentation.UNSPECIFIED,
>  unicode_decode_error_handler='strict', tzinfo=None,
>  type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None),
>  datetime_conversion=DatetimeConversion.DATETIME)¶

Testing done:
- Ensured that there were no imports of json_utils
```
$ find emission -name \*.py | xargs grep json_utils
$
```

- Ensured that all `bju` prefixes were replaced, other than in the wrapper itself

```
$ find emission -name \*.py | xargs grep bju
emission/storage/json_wrappers.py:import bson.json_util as bju
emission/storage/json_wrappers.py:wrapped_object_hook = lambda s: bju.object_hook(s,
emission/storage/json_wrappers.py:    json_options = bju.RELAXED_JSON_OPTIONS.with_options(
emission/storage/json_wrappers.py:wrapped_default = lambda o: bju.default(o, json_options = bju.LEGACY_JSON_OPTIONS)
emission/storage/json_wrappers.py:wrapped_dumps = lambda o: bju.dumps(o, json_options = bju.LEGACY_JSON_OPTIONS.with_options(
emission/storage/json_wrappers.py:wrapped_loads = lambda s: bju.loads(s)
```

- Ensured that all `esj` imports used the `wrapped` version of the name

```
$ find emission -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped" | wc -l
       0
```
…sentation to LEGACY

Changed all the references in the `bin` scripts to also use the new wrappers to
serialize/deserialize UUID objects.

This is a follow-on to
e-mission@51a6881
which also has the additional context

Additional Context:
- e-mission/e-mission-docs#856 (comment)
- https://pymongo.readthedocs.io/en/stable/examples/uuid.html
- 9c683cf
- 6ac02a2
- edd8b77

As of 4.3.3, the LEGACY_JSON_OPTIONS also has an UNSPECIFIED UUID representation

>  bson.json_util.LEGACY_JSON_OPTIONS: bson.json_util.JSONOptions =
>  JSONOptions(strict_number_long=False, datetime_representation=0,
>  strict_uuid=False, json_mode=0, document_class=dict, tz_aware=False,
>  uuid_representation=UuidRepresentation.UNSPECIFIED,
>  unicode_decode_error_handler='strict', tzinfo=None,
>  type_registry=TypeRegistry(type_codecs=[], fallback_encoder=None),
>  datetime_conversion=DatetimeConversion.DATETIME)¶

Testing done:
- Ensured that there were no imports of json_utils

```
$ find bin -name \*.py | xargs grep json_utils
$
```

- Ensured that all `bju` prefixes were replaced

```
$ find bin -name \*.py | xargs grep bju
$
```

- Ensured that all `esj` imports used the `wrapped` version of the name

```
$ find bin -name \*.py | xargs grep esj | egrep -v "import|esj.wrapped"
$
```
We could change the wrapper to take kwargs, but the easiest option for now is
to just switch to json.dumps with a different default
So that we can tweak the composite trip format as needed without having to
reset the entire pipeline. Note that the composite trip stage is a display-only
stage that packages up the trip information to send to the phone
@shankari shankari changed the base branch from master to may_2023_point_release May 9, 2023 04:22
@shankari shankari merged commit 79762b1 into e-mission:may_2023_point_release May 9, 2023
@shankari shankari deleted the upgrade_scripts branch May 9, 2023 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant