Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas interface for observation timeseries #4

Open
lkangas opened this issue Sep 18, 2020 · 2 comments
Open

Pandas interface for observation timeseries #4

lkangas opened this issue Sep 18, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@lkangas
Copy link

lkangas commented Sep 18, 2020

Having the data output as a flat pandas dataframe greatly improves usability and versatility of the open data output.

Example of turning a result from a weather observations multipointcoverage result into a DF:

rows = []

for date in obs.data.keys():
    for loc in obs.data[date].keys():
        row = obs.data[date][loc].copy()
        for key in row.keys():
            row[key] = row[key]['value']
        row['date'] = date
        row['location'] = loc
        rows.append(row)

df = pd.DataFrame(rows)```
@pnuu pnuu self-assigned this Oct 6, 2020
@pnuu pnuu added the enhancement New feature or request label Oct 6, 2020
@pnuu pnuu added this to the v0.4.0 milestone Oct 6, 2020
@adriennn
Copy link

adriennn commented Mar 2, 2021

the above doesn't work as such, the times columns is a list, so you'll need to do check if the entry bein processed has the 'values' key.

@mikaelhg
Copy link
Contributor

mikaelhg commented Aug 4, 2022

Here's what I've come up with:

args = ['timeseries=True', f"starttime={start_time}", f"endtime={end_time}"]
obs = download_stored_query(query, args=args)

cols = set([v for p in obs.data for v in obs.data[p]])
cols.remove('times')

dfs = []
for name in obs.data:
    data = {k: obs.data[name][k]['values'] for k in cols}
    idx = pd.DatetimeIndex(name='hour', data=obs.data[name]['times'])
    idx0 = pd.CategoricalIndex(name='place', data=[name]*idx.size)
    df = pd.DataFrame(data=data, index=[idx0, idx], columns=cols, dtype='float64')
    dfs.append(df)

df = pd.concat(dfs)

df.attrs.update({'location_metadata': obs.location_metadata})

df.to_parquet('data/airquality.parquet')

If you wanted to compress this, unfortunately Pandas and numpy don't offer compressed fixed point formats, but you can use int16 and multiply the values by 10 to achieve that.

Alternative indexing:

    mi = pd.MultiIndex.from_product([[name], obs.data[name]['times']], names=['place', 'hour'])
    df = pd.DataFrame(data=data, index=mi, columns=cols, dtype='float64')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants