-
Notifications
You must be signed in to change notification settings - Fork 3
feat: introduce benchmarking framework #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Conflicts: Snakefile rules/collect.smk
Conflicts: rules/collect.smk
@daniel-rdt This PR is not ready yet. I still have a bunch of todos. Nevertheless, I'm already happy to receive an early feedback from you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgilon for this inital implementation. The architecture follows a very sensible logic and thanks for documenting everything so thorougly up to here.
I like the idea of using the Wen et al (2022) methodology to assess the backcasting. One idea for the plotting here might be to reproduce a similar graph to what they introduced for their graphical abstract which gives a more visual overview of the overall performance wrt those set of indicators.
I also left some comments and small suggestions (it looks like more than it is, no worries, because it is a lot of related small code suggestions). Some additional high level comments I have:
- the difference between
build_benchmark
andmake_benchmark
is difficult to grasp from the rule names. Maybe we can find a clearer name for one or both of them? Maybe something likebuild_benchmark_statistics
(since this is mainly computing outputs using the statistics module) andmake_benchmark_indicators
orcompare_benchmark_metrics
? - add documentation / overview of the output files that include the benchmark results to the PR description
Co-authored-by: Daniel Rüdt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just reviewing the addition of the version control commit hash to the plots and benchmark data. Approved! This looks good and will be immediately helpful in allowing us to track benchmark development over time.
Conflicts: doc/configuration.rst
return df | ||
|
||
|
||
def _convert_units( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant with #116 , let's consolidate it in _helpers.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tgilon. The benchmarking architecture looks good overall. Great work! :)
As this is my second round of review, I appreciate the renaming of the rules and cleaned up configuration logic, it is now much clearer to understand the flow.
I do have a few comments that need to be addressed as I found an issue with the temporal aggregation and the add_loss_factors
calculation. The major points that I have are:
- Fix temporal aggregation
- Fix get_loss_factors function
- Add KPI summary figure for all KPIs and / or optionally add new summary figure that combines all KPIs into one
- Consolidate unit conversion with vercorized version from #97
- Improve logging in a few places
|
||
# Plot overview | ||
indicators = pd.read_csv(kpis_in, index_col=0) | ||
plot_overview(indicators, kpis_out, scenario) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use one overview figure for all KPIs, we should plot the current overview plot for all KPIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plot shows the magnitude of the error. I found easier to communicate this than using a radar. Perhaps something to improve.
) | ||
|
||
# Clean data | ||
available_years = set(benchmarks_tyndp.year).intersection(benchmarks_n.year) # noqa: F841 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth adding a comment and a logger here that specifies that for the given scenario (e.g. DE) only the subset of years is available for comparison (which is clear if you know the TYNDP methodology but is still useful to log). This also explains better when the script produces a lot of insufficient data for growth error calculation later on.
# Check if at least two sources are available to compare | ||
if len(df.columns) != 2: | ||
logging.info(f"Skipping table {table}, need exactly two sources to compare.") | ||
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into the case that because I am testing with CY 2013, the generation profile table from the model run was empty. So to trace back the issue I had to find the logger warning that states why that is the case. Instead it would be much easier for debugging (usability) if the user could identify here quickly why the table was empty. We could e.g. add a hint what to check. Something like:
# Check if at least two sources are available to compare | |
if len(df.columns) != 2: | |
logging.info(f"Skipping table {table}, need exactly two sources to compare.") | |
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing") | |
# Check if at least two sources are available to compare | |
if len(df.columns) != 2: | |
logging.info(f"Skipping table {table}, need exactly two sources to compare. Please make sure that you are using the correct climate year 2009 to compare results.") | |
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing") |
|
||
df_agg = ( | ||
df_map.groupby(["carrier", "scenario", "year", "table", "map"]) | ||
.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we have any time series with GWh instead of GW which would mean that the length of the interval matters and the aggregation method needs to be sum() instead of mean(). For all GW values though it makes sense
aggregation_map = ( | ||
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill() | ||
) | ||
df_map = df.join(aggregation_map, on="snapshot", how="left") | ||
|
||
df_agg = ( | ||
df_map.groupby(["carrier", "scenario", "year", "table", "map"]) | ||
.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed here is that the mapping forward fills past the last entry even if the snapshots from the model only cover a subset of the year (e.g. like in the test configs one week). when aggregating below this means that the very last snapshot is wrong as it takes the mean over all of the rest of the year. Solution is to first filter from first to last snapshot given in the model
aggregation_map = ( | |
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill() | |
) | |
df_map = df.join(aggregation_map, on="snapshot", how="left") | |
df_agg = ( | |
df_map.groupby(["carrier", "scenario", "year", "table", "map"]) | |
.mean() | |
aggregation_map = ( | |
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill() | |
) | |
df_map = df.join(aggregation_map, on="snapshot", how="left") | |
df_agg = ( | |
df_map | |
.loc[:,:,:,:,idx_agg[0]:idx_agg[-1]] | |
.groupby(["carrier", "scenario", "year", "table", "map"]) | |
.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Careful, if ever there would be extensive quantities (like GWh), since the last bin df_agg.loc[df_idx_agg[-1], "map"]
is only a single hour bin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, maybe then it would be a good idea to also bring the snapshot_weightings
into this to make sure that the length of that last snapshot is considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i don't have a final review yet, but to speed up the process let me already add the questions i have instead of bunching them.
rule prepare_benchmarks: | ||
input: | ||
lambda w: expand( | ||
RESULTS | ||
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv", | ||
**config["scenario"], | ||
run=config["run"]["name"], | ||
), | ||
RESULTS + "validation/resources/benchmarks_tyndp.csv", | ||
|
||
|
||
rule make_benchmarks: | ||
input: | ||
lambda w: expand( | ||
RESULTS | ||
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv", | ||
**config["scenario"], | ||
run=config["run"]["name"], | ||
), | ||
|
||
|
||
rule plot_benchmarks: | ||
input: | ||
lambda w: expand( | ||
RESULTS | ||
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf", | ||
**config["scenario"], | ||
run=config["run"]["name"], | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the lambdas
rule prepare_benchmarks: | |
input: | |
lambda w: expand( | |
RESULTS | |
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), | |
RESULTS + "validation/resources/benchmarks_tyndp.csv", | |
rule make_benchmarks: | |
input: | |
lambda w: expand( | |
RESULTS | |
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), | |
rule plot_benchmarks: | |
input: | |
lambda w: expand( | |
RESULTS | |
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), | |
rule prepare_benchmarks: | |
input: | |
expand( | |
RESULTS | |
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), | |
RESULTS + "validation/resources/benchmarks_tyndp.csv", | |
rule make_benchmarks: | |
input: | |
expand( | |
RESULTS | |
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), | |
rule plot_benchmarks: | |
input: | |
expand( | |
RESULTS | |
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf", | |
**config["scenario"], | |
run=config["run"]["name"], | |
), |
n.statistics.withdrawal( | ||
comps=demand_comps, | ||
bus_carrier=elec_bus_carrier + ["gas", "H2", "coal", "oil"], | ||
groupby=["bus"] + grouper, | ||
nice_names=False, | ||
aggregate_across_components=True, | ||
) | ||
.reindex(eu27_idx, level="bus") | ||
.groupby(by=grouper) | ||
.sum() | ||
.loc[lambda x: ~x.index.get_level_values("carrier").isin(exclude_carriers)] | ||
.groupby(level="bus_carrier") | ||
.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a gas powerplant modelled as a link from gas
to elec
bus carriers?
I have not dealt with withdrawal
with multiple bus_carriers at the same time. What is the logic in the example?
I would expect it to add for a gas powerplant an entry:
bus="EU", bus_carrier="gas", carrier="OCGT", withdrawal="x tonne of gas"
(which actually should not be part of final energy demand since it converts to electricity which is only a secondary carrier). This would then be filtered out since the bus="EU" is not a member state?
I am really confused how that gives FED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think withdrawal is measuring primary energy consumption here, isn't it?
replace_dict = SCENARIO_DICT.copy() | ||
replace_dict.update({"Lh2": "LH2"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace_dict = SCENARIO_DICT.copy() | |
replace_dict.update({"Lh2": "LH2"}) | |
replace_dict = SCENARIO_DICT | {"Lh2": "LH2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, nothing else jumped out at me.
aggregation_map = ( | ||
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill() | ||
) | ||
df_map = df.join(aggregation_map, on="snapshot", how="left") | ||
|
||
df_agg = ( | ||
df_map.groupby(["carrier", "scenario", "year", "table", "map"]) | ||
.mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Careful, if ever there would be extensive quantities (like GWh), since the last bin df_agg.loc[df_idx_agg[-1], "map"]
is only a single hour bin.
Co-authored-by: Daniel Rüdt <[email protected]> Co-authored-by: Jonas Hörsch <[email protected]>
for more information, see https://pre-commit.ci
Closes #31.
Changes proposed in this Pull Request
This PR introduces a benchmarking framework for continuous and systematic validation of Open TYNDP model outputs against TYNDP 2024 scenarios. This framework provides flexible and scalable validation across multiple metrics and benchmarking methods.
The following metrics from the TYNDP 2024 Scenarios report are considered relevant for benchmarking:
The data is published in the Scenarios package.
This PR is based on the methodology proposed by Wen et al. (2022). This methodology provides a multi-criteria approach to ensure:
This methodology defines the following indicators:
Hourly time series from the TYNDP 2024 will be aggregated to match the temporal resolution of Open-TYNDP.
Summary tables are computed for both the overall and per-carrier results.
Tasks
method
configuration is neededWorkflow
config/benchmarking.default.yaml
.retrieve_additional_tyndp_data
: Retrieve the TYNDP 2024 Scenarios Report Data Figures package for benchmarking purposes. This rule will be deprecated once the data bundle has been updated (Update TYNDP 2024 data bundle on Zenodo #87).clean_tyndp_benchmark
: Read and process the raw TYNDP 2024 Scenarios Report data. The output data structure is a long-format table.build_statistics
: Compute the benchmark statistics from the optimised network. Run for every planning horizon. The output data structure is a long-format table.make_benchmark
: Compute accuracy indicators for comparing model results against reference data from TYNDP 2024.make_benchmarks
to collectmake_benchmark
outputsplot_benchmark
: Generate visualisation outputs for model validationplot_benchmarks
to collectplot_benchmarks
outputsresults/validation/
folder. This includes:results/validation/resources/
for processed inputs information from both Open-TYNDP and TYNDP 2024.results/validation/csvs_s_{clusters}_{opts}_{sector_opts}_all_years/
for quantitive information for each tableresults/validation/graphics_s_{clusters}_{opts}_{sector_opts}_all_years/
for figures of each tableresults/validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv
as summary tableresults/validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf
as summary figureOpen Issues
Other data and Conversions
, starting at line 215) and the Annex VI of the Scenarios Methodology Report, p. 117. While most of the values are identical, two significant discrepancies have been observed for both DE00 and EE00. From the available information, it is also unclear whether the same loss factor is used for all the nodes in countries with multiple nodes (such as Luxembourg).Notes
The preliminary observations of the DE scenario, using a temporal resolution of 720SEG, are summarised below.
• Demand requires validation
• Heat and biofuels require mapping
• Climate year mismatch: model uses 2013 data, available benchmark uses 2009
• NT are close but the match is not perfect
• Transport and prosumer demand not yet incorporated for DE and GA
•
Climate year mismatch: model uses 2013 data, available benchmark uses 2009solved with #109• Energy and non-energy industrial uses require disaggregation
• Energy and non-energy industrial uses require disaggregation
• Aviation hydrogen demand not modelled
• "Small scale res" category requires specification
• Demand shedding not yet implemented
• Additional generation sources require improvements
• Domestic production and import sources require distinction
• Supply modelling incomplete
• Mapping complete
• No biomass import assumed
• Import coverage incomplete
Example of indicators extracted from
kpis_eu27_s_all__all_years.csv
for NT scenario with 45SEG:Example of indicators extracted from
power_generation_s_all__all_years.csv
for NT scenario with 45SEG:Example of figure created for the final energy demand for NT scenario in 2040 with 45SEG:

Example of figure created for the generation profiles for DE scenario in 2040 with 720SEG:

Example of summary figure created for NT scenario

Checklist
envs/environment.yaml
.config/config.default.yaml
.doc/configtables/*.csv
.config/test/*.yaml
.doc/data_sources.rst
.doc/release_notes.rst
is added.README
anddoc/index.rst
.