Replies: 1 comment 1 reply
-
What I think we need for the imputation process is a CSV like this one:
When we loop through every pair of device/os keys here, we can do the following
Then we would put together all vectors for each device of a participant as rows in a data frame and write that data frame as a CSV to disk that you can read from the imputation algorithm (input file). This output CSV file would be defined in the For example, for a participant that used both an Android and iOS and has the keyword
For a participant that used an Android device and used the keyword
For a participant that used an Android and specified that OS in the participant file (no
For time segments when the participant switched phones, we can choose the OS that had the most data in that segment based on the timestamp of that new file we are going to create and the start and end the time of the segment (what you said about the majority OS). These timestamps computations will have to be done in UTC but I do not think the lack of time zone awareness is a problem. Do you want to give a try to modify |
Beta Was this translation helpful? Give feedback.
-
The first step of data cleaning module is to fill NA with 0 for the selected event features. However, RAPIDS only supports the single platform (iOS or Android) currently. We need to differentiate different platforms for different participants or even the same participant.
For example, application foreground features are only available for Android platform. If one participant use an Android device during the first week, and an iOS device during the second week, we cannot fill all NA with 0 for the selected application foreground features (this is what we do currently). Instead, the first week's features should be filled with 0 and the second week's features should still be NA.
To solve this issue, I came up with 2 ideas:
platform
to denote whether this data is from an Android device or an iOS device. We keep that column while extracting the features within a time segment. The final output (all_sensor_features.csv
) would contain one more column namedplatform
to denote whether a time segment is for Android or iOS. However, there might be an issue with the feature extraction step: if multiple platforms are used during a single time segment, how can we assign a platform for it? To my understanding, it might be ok to use the majority platform (platform with more samples).DEVICE_IDS
and oneSTART_DATE
and oneEND_DATE
, we can have a list ofSTART_DATE
andEND_DATE
. For example, currently:DEVICE_IDS=[d1, d2], START_DATE=2022-01-01, END_DATE=2022-01-10
; we can update it as:DEVICE_IDS=[d1, d2], START_DATE=[2022-01-01, 2022-01-04], END_DATE=[2022-01-03 23:59:59, 2022-01-10]
. With the updated participant files, we can get start and end date time per platform per participant. By adding participant files as an extra input for data cleaning module, we can assign Android or iOS per time segment by comparinglocal_segment_start_datetime
,local_segment_end_datetime
,START_DATE
, andEND_DATE
within the data cleaning step.Hi @JulioV, which one do you think is better? Or do you have any other suggestions? Both 1 and 2 might need your help to update the
pull_phone_data
part of code. I can update the data cleaning section.Beta Was this translation helpful? Give feedback.
All reactions