Data cleaning with multiple platforms #177

Meng6 · 2022-03-11T22:50:09Z

Meng6
Mar 11, 2022
Collaborator

The first step of data cleaning module is to fill NA with 0 for the selected event features. However, RAPIDS only supports the single platform (iOS or Android) currently. We need to differentiate different platforms for different participants or even the same participant.

For example, application foreground features are only available for Android platform. If one participant use an Android device during the first week, and an iOS device during the second week, we cannot fill all NA with 0 for the selected application foreground features (this is what we do currently). Instead, the first week's features should be filled with 0 and the second week's features should still be NA.

To solve this issue, I came up with 2 ideas:

While downloading the raw data, we add one more column named platform to denote whether this data is from an Android device or an iOS device. We keep that column while extracting the features within a time segment. The final output (all_sensor_features.csv) would contain one more column named platform to denote whether a time segment is for Android or iOS. However, there might be an issue with the feature extraction step: if multiple platforms are used during a single time segment, how can we assign a platform for it? To my understanding, it might be ok to use the majority platform (platform with more samples).
Update our participant files. Instead of having a list of DEVICE_IDS and one START_DATE and one END_DATE, we can have a list of START_DATE and END_DATE. For example, currently: DEVICE_IDS=[d1, d2], START_DATE=2022-01-01, END_DATE=2022-01-10; we can update it as: DEVICE_IDS=[d1, d2], START_DATE=[2022-01-01, 2022-01-04], END_DATE=[2022-01-03 23:59:59, 2022-01-10]. With the updated participant files, we can get start and end date time per platform per participant. By adding participant files as an extra input for data cleaning module, we can assign Android or iOS per time segment by comparing local_segment_start_datetime, local_segment_end_datetime, START_DATE, and END_DATE within the data cleaning step.

Hi @JulioV, which one do you think is better? Or do you have any other suggestions? Both 1 and 2 might need your help to update the pull_phone_data part of code. I can update the data cleaning section.

JulioV · 2022-03-16T19:13:18Z

JulioV
Mar 16, 2022
Collaborator

What I think we need for the imputation process is a CSV like this one:

timestamp,os
123,android
567,ios

When we loop through every pair of device/os keys here, we can do the following

When the user specifies a device_id/os pair, we save the timestamp of the first row of the data for that device plus the OS indicated by the user
When the user uses the keyword infer, we read the timestamp column on the infer_device_os function of container.R in addition to device_id and brand columns. Currently, the infer_device_os function returns android or ios but with the new update it would return a vector c(timestamp, android/ios).

Then we would put together all vectors for each device of a participant as rows in a data frame and write that data frame as a CSV to disk that you can read from the imputation algorithm (input file). This output CSV file would be defined in the pull_phone_data rule output.

For example, for a participant that used both an Android and iOS and has the keyword infer in the participant file we would save a CSV like:

timestamp,os
123,android
567,ios

For a participant that used an Android device and used the keyword infer in the participant file we would save a CSV like:

timestamp,os
123,android

For a participant that used an Android and specified that OS in the participant file (no infer keyword) we would save a CSV like:

timestamp,os
timestamp_of_first_row_of_data,android

For time segments when the participant switched phones, we can choose the OS that had the most data in that segment based on the timestamp of that new file we are going to create and the start and end the time of the segment (what you said about the majority OS). These timestamps computations will have to be done in UTC but I do not think the lack of time zone awareness is a problem.

Do you want to give a try to modify pull_phone_data.R and the phone data stream containers? I can review your progress in a PR

1 reply

Meng6 Mar 19, 2022
Collaborator Author

Thanks so much for your suggestions, @JulioV! I will modify the files you mentioned and create a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data cleaning with multiple platforms #177

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Data cleaning with multiple platforms #177

Meng6 Mar 11, 2022 Collaborator

Replies: 1 comment · 1 reply

JulioV Mar 16, 2022 Collaborator

Meng6 Mar 19, 2022 Collaborator Author

Meng6
Mar 11, 2022
Collaborator

Replies: 1 comment 1 reply

JulioV
Mar 16, 2022
Collaborator

Meng6 Mar 19, 2022
Collaborator Author