Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing segmented/sampled data #378

Open
atruskie opened this issue Sep 15, 2020 · 7 comments
Open

Processing segmented/sampled data #378

atruskie opened this issue Sep 15, 2020 · 7 comments

Comments

@atruskie
Copy link
Member

Is your feature request related to a problem? Please describe.

We've had a couple of datasets so far that consist of segmented or sampled data.
While we can produce indices and FCS for this data, the process is very inefficient and convulted.

Describe the solution you'd like

We should be able to process multiple files from a sensor as if they were one large file, while also supporting both regular and irregular gaps in data.

Describe alternatives you've considered

The old process involves:

  1. loop through each file, calculate indices for each file
  2. use concatenateindexfiles to stitch the results together

However, creating a full result set for many (many!) small, short files is extremely inefficient.

Additional context

Related to #191

@towsey
Copy link
Contributor

towsey commented Sep 15, 2020 via email

@atruskie
Copy link
Member Author

That's an interesting question. But, yes, I think that by default any files processed by #191 or this issue should be stored in a day long format.

That format could be tricky to make though. The simplest approach would be inserting null values for missing data (not default values). To optimize on space, we should read/write/use sparse matrices. Also, it would require all data be aligned to absolute minutes.

@meperra
Copy link

meperra commented Dec 16, 2020

Was this ever something that was put into place? I am currently trying to manage some segmented data and process the indices outputs in R, but finding it tricky to organize the matrices without the null values that would keep each matrix the same length.

We'd like to average different indices across days that have recordings at different times, and while it's no problem adding null values to fill gaps at the end of the day, it's more difficult to do this when recording gaps exist at the beginning of a 24 hour period.

@atruskie
Copy link
Member Author

atruskie commented Dec 20, 2020

@meperra, this was a feature request. It's something we intend to do but it is not yet done.

While it is inefficient, we've certainly had more than one person complete analyses like this without this feature inbuilt.

I am currently trying to manage some segmented data and process the indices outputs in R, but finding it tricky to organize the matrices without the null values that would keep each matrix the same length.

I'd like to learn more about why this is an issue.

Naively, you'd allocate a vector 1440 elements in size, per day, per index. For any minutes where there are data, you fill in the values. Then you're left with a properly structured vector. (where that vector could be a slice of a larger matrix).

Note: I do not recommend filling in or zeroing with a default value cells where data is missing... you'll bias your calculation. You need to properly skip missing cells and accept that different minutes can have different population sizes in the final aggregate day.

Also note: averaging indices are not straight forward. @towsey did you have a white paper on averaging indices?

@meperra
Copy link

meperra commented Dec 22, 2020

The issue I run into is just that the start time of the first file for a specific day is not specified, so the time before the first recording is not populated with NAs or 0s. Instead, the first minute of the first recording appears as the start of the day in the CSV files (index 0 in the CSV is always the first minute of the first recording, rather than the first minute of that day). The issue is not having different population sizes for different minutes in the aggregate day, but rather organizing it so that each minute is in the correct place within its vector so that I am averaging the same minutes across not different days.

E.g. if my first recording starts at 1AM, the hour at the start of that day is not recognized and the length of the vector is 1380 elements, with the first minute (1:00-1:01AM) labelled as Index 0. Ideally, that minute would be labelled as 61 instead, and the minutes that precede it would be populated with 0s (which are then changed to NAs), just like the time between two nonconsecutive recordings currently is, and then the vector would be the same length as a full 24 hour recording. It's not a huge issue to look at the timestamp and identify if discrepancies in length are happening at the end or the beginning of a concatenated day, and I think there is a workaround we can figure out in R, but I just wanted to see if there was a simpler solution. If the time between the true beginning of that day and the first recording could be recognized, and those cells could be populated, then any discrepancies in vector length would be at the end of the day, and those are easily fixed by adding elements to the vector in R (these elements would be NAs that are omitted when averages are taken).

I guess the short version of this is just that my vector lengths vary because of discontinuous recordings, and I would like to use missing values (NAs) to arrange my data appropriately within a 1440 element vector so that each element in each vector is associated with the same minute in a 24 hour day. These missing values will be omitted when averages are taken, but I'm under the impression that I need to make sure the actual values are in the right place.

Hopefully that makes some sense?

@atruskie
Copy link
Member Author

Your files should have a date stamp in them, right?
So, a date adjusted value would be: filename_datestamp + (index_row as minutes) - day_of_recording

All AP does is produce results for input files. This is the simplest way for it to operate and thus the most powerful (since it can be used for a variety of cases). All results it outputs are relative to the recording it is processing (not the start of the day).

Even if we move to processing multiple results, its highly likely we will continue to produce results that are relative to the input recording. There are a large number of formats and non-trivial problems involved with using datestamps from filenames, which is why we leave that problem as an exercise for the reader. Without knowing all the complexities of your format, it's impossible for us to know what the right thing to do is (don't get me wrong, I have plans to make all this easier... but the point holds).

For example: we process a single 1-minute file. What is the context here? Are there other files in the day (in a different folder)? Do we produce a massive day-size matrix of results with only one one-minute slice filled? What if you only wanted results from that minute? What if the datestamp in the filename is wrong (happens frequently)? What if you're doing a sampling experiment where you take every 5th recording and concatenate the results?

So, the basic process you should take is:

  1. gather all the files you want to analyze for a date (month, year, whatever, in whatever folder structure you have)
  2. generate indices for each
  3. use the datestamp in the filename (or whatever metadata is appropriate) to offset the results
  4. insert the results into the right place in your day-vector (or whatever larger structure you have)

@meperra
Copy link

meperra commented Jan 5, 2021

That does make sense re: the fact that everyone is not doing the same analysis as me. I selfishly wanted everything easy and catered to my needs (haha, I apologize for being so lazy), but the basic process you outlined is not too tricky to figure out using the filenames! Thank you for your help, AP is great program that has been incredibly helpful thus far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants