Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfs is taking a long time #166

Open
peterdudfield opened this issue Aug 1, 2024 · 8 comments
Open

gfs is taking a long time #166

peterdudfield opened this issue Aug 1, 2024 · 8 comments
Assignees

Comments

@peterdudfield
Copy link
Contributor

Detailed Description

it looks like downloading the data doesnt take to long but mapping it to xarray does

Screenshot 2024-08-01 at 21 27 38

I think the code bit is here - https://github.com/openclimatefix/nwp-consumer/blob/main/src/nwp_consumer/internal/inputs/noaa/aws.py#L108

Context

  • always good to get data fast

Possible Implementation

  • speed up if possible?
@peterdudfield
Copy link
Contributor Author

I can see in the data that this data consumer does trim the data down to india. Is it worth doing this, to help speed up the "mapping raw file to an xarray dataset"

@peterdudfield
Copy link
Contributor Author

I just ran this locally and it seem to take about ~10 seconds

python

import xarray as xr
import cfgrib

ds = cfgrib.open_datasets('gfs.t00z.pgrb2.1p00.f039')

ds = [
    d
    for d in ds
    if any(x in d.coords for x in ["surface", "heightAboveGround", "isobaricInhPa"])
]

# Split into surface, heightAboveGround, and isobaricInhPa lists
surface = [d for d in ds if "surface" in d.coords]
heightAboveGround = [d for d in ds if "heightAboveGround" in d.coords]
isobaricInhPa = [d for d in ds if "isobaricInhPa" in d.coords]

# Update name of each data variable based off the attribute GRIB_stepType
for i, d in enumerate(surface):
    for variable in d.data_vars.keys():
        d = d.rename(
            {variable: f"{variable}_surface_{d[f'{variable}'].attrs['GRIB_stepType']}"}
        )
    surface[i] = d
for i, d in enumerate(heightAboveGround):
    for variable in d.data_vars.keys():
        d = d.rename({variable: f"{variable}_{d[f'{variable}'].attrs['GRIB_stepType']}"})
    heightAboveGround[i] = d

surface = xr.merge(surface)
# Drop unknown data variable
surface = surface.drop_vars("unknown_surface_instant", errors="ignore")
heightAboveGround = xr.merge(heightAboveGround)
isobaricInhPa = xr.merge(isobaricInhPa)

ds = xr.merge([surface, heightAboveGround, isobaricInhPa])

# Map the data to the internal dataset representation
# * Transpose the Dataset so that the dimensions are correctly ordered
# * Rechunk the data to a more optimal size
ds = (
    ds.rename({"time": "init_time"})
    .expand_dims("init_time")
    .expand_dims("step")
    .transpose("init_time", "step", ...)
    .sortby("step")
    .chunk(
        {
            "init_time": 1,
            "step": -1,
        },
    )
)

This makes me think perhaps the consumer deosnt have enough memory, and is therefore running slow

@peterdudfield
Copy link
Contributor Author

Ive tried upping the memory from 5GB to 6GB, but does seemed to make much difference right now

@peterdudfield
Copy link
Contributor Author

I will try upping CPU from 1 GB to 2GB, as the logs shows a log of CPU usage

@peterdudfield
Copy link
Contributor Author

One other option is to save the file post mapping, and then this would only need to be done once

@peterdudfield
Copy link
Contributor Author

I will try upping CPU from 1 GB to 2GB, as the logs shows a log of CPU usage

roughly the time seems to be halved when adding more compute

@peterdudfield peterdudfield mentioned this issue Aug 2, 2024
6 tasks
@peterdudfield
Copy link
Contributor Author

It looks like the loading takes the longest time, other steps are quick.
note it seems to load 2 files at the same time, possibly in parrallel?

@peterdudfield
Copy link
Contributor Author

Perhaps there is a good way to check if the raw datafile is already been downloaded into s3, and if so, download from there, not from GFS server. Similar for other NWP providers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants