Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion jobs/kpi-forecasting/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
.cache
.idea
.vscode
.local
.python-version
.python_history
.vscode
12 changes: 2 additions & 10 deletions jobs/kpi-forecasting/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
FROM python:3.8
MAINTAINER Perry McManis <pmcmanis@mozilla.com>
FROM python:3.10
LABEL maintainer="Brad Ochocki <bochocki@mozilla.com>"

# https://github.com/mozilla-services/Dockerflow/blob/master/docs/building-container.md
ARG USER_ID="10001"
Expand All @@ -12,19 +12,11 @@ RUN groupadd --gid ${USER_ID} ${GROUP_ID} && \

WORKDIR ${HOME}

RUN apt install gcc
RUN apt install g++

RUN pip install --upgrade pip

RUN pip install pystan==2.19.1.1
RUN python3 -m pip install prophet --no-cache-dir

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

RUN pip install git+https://github.com/Nixtla/statsforecast.git

COPY . .

# Drop root and change ownership of the application folder to the user
Expand Down
4 changes: 2 additions & 2 deletions jobs/kpi-forecasting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ pip install -r requirements.txt
Run the scripts with:

```sh
python kpi_forecasting.py -c yaml/desktop.yaml
python ~/kpi-forecasting/kpi_forecasting.py -c ~/kpi-forecasting/yaml/desktop_non_cumulative.yaml

python kpi_forecasting.py -c yaml/mobile.yaml
python ~/kpi-forecasting/kpi_forecasting.py -c ~/kpi-forecasting/yaml/mobile_non_cumulative.yaml
```

### On SQL Queries And Preprocessing
Expand Down
15 changes: 9 additions & 6 deletions jobs/kpi-forecasting/kpi-forecasting/Utils/PosteriorSampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,9 @@ def get_confidence_intervals(
uncertainty_samples["ds"] > np.datetime64(final_observed_sample_date)
]
.groupby("{}".format(aggregation_unit_of_time))
.sum()
.sum(numeric_only=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

print(samples_df_grouped.tail())
# start the aggregated dataframe with the mean of the uncertainty samples
uncertainty_samples_aggregated = samples_df_grouped.mean(axis=1).reset_index()

Expand Down Expand Up @@ -71,6 +70,8 @@ def get_confidence_intervals(
columns={"y": "value"}
).sort_values(by="{}".format(aggregation_unit_of_time))

observed_aggregated = observed_aggregated.astype({"value": np.float64})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observed_aggregated["value"] is being stored as an Int64Dtype, which is a pandas type for storing large integers. For some reason, using this type breaks the following merge on line 100:

   all_aggregated = pd.merge(
        observed_aggregated,
        uncertainty_samples_aggregated,
        on=["{}".format(aggregation_unit_of_time), "value", "type"],
        how="outer",
    )

I think using float64 instead is an okay workaround here, since the values in the confidence intervals are reported as float64 anyways.


# check if whether there are overlap in actual and forecast at the group level
if (
aggregation_unit_of_time == "ds_month"
Expand All @@ -83,10 +84,12 @@ def get_confidence_intervals(
).dayofyear
!= 1
):
uncertainty_samples_aggregated.at[0, 1:] = (
uncertainty_samples_aggregated.iloc[0, 1:]
+ observed_aggregated.iloc[-1].value
)
# add observed samples from current time period to uncertainty samples for
# the remainder of the period.
uncertainty_samples_aggregated.iloc[0, 1:] += observed_aggregated["value"].iloc[
-1
]
Comment on lines +89 to +91
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same intended logic as before, but the previous code doesn't work in new versions of pandas because observed_aggregated.iloc[-1].value doesn't return a single value, it returns an array of values. Using the . column access method was also confusing, because at first glance it looks like a typo of .values which casts a pandas column to a numpy array.


observed_aggregated = observed_aggregated.loc[
observed_aggregated[aggregation_unit_of_time]
< observed_aggregated[aggregation_unit_of_time].max()
Expand Down
2 changes: 1 addition & 1 deletion jobs/kpi-forecasting/kpi-forecasting/kpi_forecasting.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def main() -> None:
aggregation_unit_of_time=config["confidences"],
asofdate=predictions["ds"].max(),
final_observed_sample_date=dataset["ds"].max(),
target="desktop",
target=config["target"],
)

write_predictions_to_bigquery(predictions, config)
Expand Down
187 changes: 100 additions & 87 deletions jobs/kpi-forecasting/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,94 +1,107 @@
adagio==0.2.4
ansi2html==1.8.0
antlr4-python3-runtime==4.11.1
appdirs==1.4.4
attrs==20.3.0
bcrypt==3.2.0
beautifulsoup4==4.10.0
BigQuery-Python==1.15.0
black==22.3.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.12
click==8.0.4
cmdstanpy==0.9.68
asttokens==2.2.1
backcall==0.2.0
blinker==1.6.2
cachetools==5.3.0
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmdstanpy==1.1.0
comm==0.1.3
contourpy==1.0.7
convertdate==2.4.0
cryptography==36.0.1
cycler==0.11.0
Cython==0.29.28
ephem==4.1.3
flake8==3.8.4
google-api-core==1.31.5
google-api-python-client==2.38.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.5.0
google-auth==1.35.0
google-cloud-bigquery-storage==1.0.0
google-cloud-bigquery==1.27.2
google-cloud-core==1.7.2
google-cloud-storage==1.31.0
google-crc32c==1.3.0
google-resumable-media==1.3.3
google==3.0.0
googleapis-common-protos==1.55.0
grpcio==1.44.0
hijri-converter==2.2.3
holidays==0.16
httplib2==0.20.4
idna==3.3
iniconfig==1.1.1
Jinja2==2.11.2
joblib==1.2.0
kiwisolver==1.3.2
korean-lunar-calendar==0.2.1
dash==2.9.3
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-table==5.0.0
db-dtypes==1.1.1
debugpy==1.6.7
decorator==5.1.1
ephem==4.1.4
executing==1.2.0
Flask==2.3.2
fonttools==4.39.3
fs==2.4.16
fugue==0.8.3
fugue-sql-antlr==0.1.6
google-api-core==2.11.0
google-auth==2.17.3
google-cloud-bigquery==3.10.0
google-cloud-core==2.3.2
google-crc32c==1.5.0
google-resumable-media==2.5.0
googleapis-common-protos==1.59.0
grpcio==1.54.0
grpcio-status==1.54.0
hijri-converter==2.3.1
holidays==0.24
idna==3.4
ipykernel==6.23.0
ipython==8.13.2
itsdangerous==2.1.2
jedi==0.18.2
Jinja2==3.1.2
jupyter-dash==0.4.2
jupyter_client==8.2.0
jupyter_core==5.3.0
kiwisolver==1.4.4
korean-lunar-calendar==0.3.1
llvmlite==0.40.0
LunarCalendar==0.0.9
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
more-itertools==8.6.0
mypy-extensions==0.4.3
numpy
oauthlib==3.2.0
packaging==21.3
pandas-gbq==0.13.2
pandas==1.3.5
paramiko==2.9.2
pathspec==0.9.0
Pillow==9.0.1
plotly==4.9.0
pluggy==0.13.1
protobuf==3.19.4
py==1.10.0
pyarrow==7.0.0
pyasn1-modules==0.2.8
pyasn1==0.4.8
pycodestyle==2.6.0
pycparser==2.21
pydata-google-auth==1.3.0
pyflakes==2.2.0
PyMeeus==0.5.11
PyNaCl==1.5.0
pyparsing==2.4.7
pytest-black==0.3.11
pytest-flake8==1.0.6
pytest==6.0.2
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
numba==0.57.0
numpy==1.24.3
orjson==3.8.12
packaging==23.1
pandas==1.5.3
parso==0.8.3
patsy==0.5.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.5.0
platformdirs==3.5.0
plotly==5.14.1
plotly-resampler==0.8.3.2
prompt-toolkit==3.0.38
prophet==1.1.2
proto-plus==1.22.2
protobuf==4.23.0
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==12.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.15.1
PyMeeus==0.5.12
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2021.3
pytz==2023.3
PyYAML==6.0
regex==2020.11.13
requests-oauthlib==1.3.1
requests==2.27.1
retrying==1.3.3
rsa==4.8
setuptools-git==1.2
pyzmq==25.0.2
qpd==0.4.1
requests==2.30.0
retrying==1.3.4
rsa==4.9
scipy==1.10.1
six==1.16.0
soupsieve==2.3.1
statsforecast==1.1.0
statsmodels==0.13.2
storage==0.0.4.3
threadpoolctl==3.1.0
toml==0.10.2
tqdm==4.63.0
typed-ast==1.5.4
typing-extensions==3.10.0.0
ujson==5.1.0
uritemplate==4.1.1
urllib3==1.26.8
sqlglot==12.2.0
stack-data==0.6.2
statsforecast==1.5.0
statsmodels==0.14.0
tenacity==8.2.2
tornado==6.3.1
tqdm==4.65.0
trace-updater==0.0.9.1
traitlets==5.9.0
triad==0.8.7
urllib3==2.0.2
wcwidth==0.2.6
Werkzeug==2.3.4