Skip to content

Conversation

tgilon
Copy link
Member

@tgilon tgilon commented Jul 17, 2025

Closes #31.

Changes proposed in this Pull Request

This PR introduces a benchmarking framework for continuous and systematic validation of Open TYNDP model outputs against TYNDP 2024 scenarios. This framework provides flexible and scalable validation across multiple metrics and benchmarking methods.

The following metrics from the TYNDP 2024 Scenarios report are considered relevant for benchmarking:

  • Exogenous Inputs:
    • Fig. 5-10
      • Benchmark Final Energy demand by fuel, EU27 (TWh), (Fig 5, p24 and Fig 51, p63)
      • Benchmark Electricity demand per sector, EU27 (TWh), (Fig 6, p25 and Fig 52, p63)
      • Benchmark Methane demand by sector, EU27 (TWh), (Fig 8, p27 and Fig 53, p64)
      • Benchmark Hydrogen demand by sector, EU27 (TWh), (Fig 10, p28 and Fig 54, p64)
  • Investment and dispatch modelling outputs:
    • Fig. 13-40
      • Benchmark of net installed capacity for electricity generation, EU27 (GW), (Fig 25, p39 and Fig 55, p65)
      • Benchmark of electricity generation, EU27 (TWh), (Fig 26, p39 and Fig 56, p65)
      • Benchmark methane supply, EU27 (TWh), (Fig 32, p45 and Fig 57, p66)
      • Benchmark hydrogen supply, EU27 (TWh), (Fig 33, p46 and Fig 58, p67)
      • Benchmark biomass supply, EU27 (TWh), (Fig 59, p67)
      • Benchmark energy imports, EU27 (TWh), (Fig 40, p51 and Fig 60, p68)
      • Hourly generation profile of power generation, Fig 30, p35

The data is published in the Scenarios package.

This PR is based on the methodology proposed by Wen et al. (2022). This methodology provides a multi-criteria approach to ensure:

  • the diversity (each indicator have its own added value),
  • the effectiveness (each indicator provides essential and correct information),
  • the robustness (against diverse units and orders of magnitude), and
  • the compatibility (can be used to compare across countries) of the selected set of indicators.

This methodology defines the following indicators:

  • Missing: Count of carriers / sectors dropped due to missing values
  • sMPE (Symmetric Mean Percentage Error): Indicates the direction of the deviation between modeled scenarios and TYNDP 2024 outcomes, showing if the output is overall overestimated or underestimated.
  • sMAPE (Symmetric Mean Absolute Percentage Error): Indicates the absolute magnitude of the deviations, avoiding the cancellation of negative and positive errors.
  • sMdAPE (Symmetric Median Absolute Percentage Error): Provides skewness information to complement sMAPE.
  • RMSLE (Root Mean Square Logarithmic Error): Complements the percentage errors since it shows the logarithmic deviation values.
  • Growth error: Show the error on temporal scale. This indicator is ignored for dynamic time series (i.e. hourly generation profiles).

Hourly time series from the TYNDP 2024 will be aggregated to match the temporal resolution of Open-TYNDP.

Summary tables are computed for both the overall and per-carrier results.

Tasks

  • Validate architecture
  • Implement demonstrator
  • Decide if method configuration is needed
  • Use standard retrieve method instead of custom
  • Simplify configuration structure
  • Distinguish build and make benchmark more clearly
  • Address the discrepancies in the profiles when using time aggregation
  • Add plotting rule
  • Overall indicators misleading when substantial missing data
  • Test for all the scenarios
  • Assess if default values from the configuration are still relevant
  • Document the new configurations (incl. overview of outputs files and release note)
  • Add version to outputs
  • Clean log
  • Filter out EU27 statistics

Workflow

  1. New configuration files config/benchmarking.default.yaml.
  2. retrieve_additional_tyndp_data: Retrieve the TYNDP 2024 Scenarios Report Data Figures package for benchmarking purposes. This rule will be deprecated once the data bundle has been updated (Update TYNDP 2024 data bundle on Zenodo #87).
  3. (new) clean_tyndp_benchmark: Read and process the raw TYNDP 2024 Scenarios Report data. The output data structure is a long-format table.
  4. (new) build_statistics: Compute the benchmark statistics from the optimised network. Run for every planning horizon. The output data structure is a long-format table.
    • This rule takes loss factors into account for the electricity demand. Loss factors from the Supply Tool are assumed to be the correct one.
  5. (new) make_benchmark: Compute accuracy indicators for comparing model results against reference data from TYNDP 2024.
  6. (new) make_benchmarks to collect make_benchmark outputs
  7. (new) plot_benchmark: Generate visualisation outputs for model validation
  8. (new) plot_benchmarks to collect plot_benchmarks outputs
  9. The full set of files produced for the benchmarking are stored in the results/validation/ folder. This includes:
    • results/validation/resources/ for processed inputs information from both Open-TYNDP and TYNDP 2024.
    • results/validation/csvs_s_{clusters}_{opts}_{sector_opts}_all_years/ for quantitive information for each table
    • results/validation/graphics_s_{clusters}_{opts}_{sector_opts}_all_years/ for figures of each table
    • results/validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv as summary table
    • results/validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf as summary figure
    • the structure of these outputs can be validated in the artifacts of the GitHub CI (e.g. artifacts section here)

image

Open Issues

  • The planning year of the generation profiles is not documented. For the time being, this PR assumes 2040.
  • Two sources for loss factors have been identified : the Supply Tool (sheet Other data and Conversions, starting at line 215) and the Annex VI of the Scenarios Methodology Report, p. 117. While most of the values are identical, two significant discrepancies have been observed for both DE00 and EE00. From the available information, it is also unclear whether the same loss factor is used for all the nodes in countries with multiple nodes (such as Luxembourg).
    • DE00: 0.05 (report) and 0.03 (supply tool) in 2030
    • EE00: 0.00 (report) and 0.07 (supply tool) in 2030

Notes

The preliminary observations of the DE scenario, using a temporal resolution of 720SEG, are summarised below.

Table Preliminary observations
Final energy demand • Incomplete demand coverage identified
• Demand requires validation
• Heat and biofuels require mapping
• Climate year mismatch: model uses 2013 data, available benchmark uses 2009
Electricity demand • Only aggregated value can be compared
• NT are close but the match is not perfect
• Transport and prosumer demand not yet incorporated for DE and GA
Climate year mismatch: model uses 2013 data, available benchmark uses 2009 solved with #109
Methane demand • Sectoral mapping incomplete
• Energy and non-energy industrial uses require disaggregation
Hydrogen demand • Coverage gaps in multiple sectors
• Energy and non-energy industrial uses require disaggregation
• Aviation hydrogen demand not modelled
Power capacity • Offshore wind expended beyond expected values
• "Small scale res" category requires specification
• Demand shedding not yet implemented
Power generation • Offshore wind generation exceeds expected values
• Additional generation sources require improvements
Methane supply • Supply coverage incomplete
• Domestic production and import sources require distinction
Hydrogen supply • "Low-carbon" and "renewable import" categories need clarification
• Supply modelling incomplete
Biomass supply • Supply appears underestimated
• Mapping complete
Energy imports • Methane import disaggregation limited by data aggregation
• No biomass import assumed
• Import coverage incomplete
Generation profiles • Climate year mismatch: model uses 2013 data, available benchmark uses 2009

Example of indicators extracted from kpis_eu27_s_all__all_years.csv for NT scenario with 45SEG:

Table sMPE sMAPE sMdAPE RMSLE Growth Error Missing version
Final energy demand -0.20 0.33 0.23 0.45 0.01 6 v0.2+gb167cb17f
Electricity demand 0.02 0.02 0.02 0.03 0.00 0 v0.2+gb167cb17f
Methane demand NA v0.2+gb167cb17f
Hydrogen demand -0.53 0.53 0.52 0.72 10 v0.2+gb167cb17f
Power capacity -0.53 0.68 0.36 5.61 -0.01 3 v0.2+gb167cb17f
Power generation -0.13 0.82 0.67 3.97 -0.01 2 v0.2+gb167cb17f
Methane supply NA v0.2+gb167cb17f
Hydrogen supply -0.76 1.13 1.01 9.60 -0.00 5 v0.2+gb167cb17f
Biomass supply -1.48 1.48 1.48 4.43 0.51 1 v0.2+gb167cb17f
Energy imports -1.34 1.36 2.00 27.07 0.14 2 v0.2+gb167cb17f
Generation profiles NA v0.2+gb167cb17f
Total (excl. time series) -0.62 0.98 0.82 11.06 0.02 31 v0.2+gb167cb17f

Example of indicators extracted from power_generation_s_all__all_years.csv for NT scenario with 45SEG:

Carrier sMPE sMAPE sMdAPE RMSLE Growth Error version
Battery -2.00 2.00 2.00 13.72 0.49 v0.2+gb167cb17f
CHP and small thermal -1.67 1.67 1.67 2.83 -0.17 v0.2+gb167cb17f
Coal + other fossil 0.56 0.56 0.56 0.59 -0.03 v0.2+gb167cb17f
Hydro and pumped storage -0.16 0.16 0.16 0.17 -0.01 v0.2+gb167cb17f
Hydrogen -1.71 1.71 1.71 12.21 1.54 v0.2+gb167cb17f
Methane 0.20 0.20 0.20 0.27 0.03 v0.2+gb167cb17f
Nuclear -0.81 0.81 0.81 0.97 -0.07 v0.2+gb167cb17f
Oil -0.08 0.27 0.27 0.29 0.05 v0.2+gb167cb17f
Solar -0.05 0.05 0.05 0.05 -0.00 v0.2+gb167cb17f
Wind offshore -0.00 0.00 0.00 0.00 0.00 v0.2+gb167cb17f
Wind onshore -0.05 0.05 0.05 0.06 -0.00 v0.2+gb167cb17f
Demand shedding v0.2+gb167cb17f
Small scale res v0.2+gb167cb17f
Biofuels v0.2+gb167cb17f

Example of figure created for the final energy demand for NT scenario in 2040 with 45SEG:
benchmarking_fed_NT_2030

Example of figure created for the generation profiles for DE scenario in 2040 with 720SEG:
benchmarking_gen_profiles_DE_2040

Example of summary figure created for NT scenario
benchmarking_overview_NT

Checklist

  • I tested my contribution locally and it works as intended.
  • Code and workflow changes are sufficiently documented.
  • Changed dependencies are added to envs/environment.yaml.
  • Changes in configuration options are added in config/config.default.yaml.
  • Changes in configuration options are documented in doc/configtables/*.csv.
  • Changes in configuration options are added in config/test/*.yaml.
  • OET license identifier is added to all edited or newly created code files.
  • Sources of newly added data are documented in doc/data_sources.rst.
  • A release note doc/release_notes.rst is added.
  • Major features are listed in README and doc/index.rst.

@tgilon tgilon added this to the Release v0.3 milestone Jul 17, 2025
@tgilon tgilon self-assigned this Jul 17, 2025
@tgilon tgilon linked an issue Jul 17, 2025 that may be closed by this pull request
2 tasks
@daniel-rdt daniel-rdt self-assigned this Jul 17, 2025
@tgilon tgilon added the major feature Major feature for the Open TYNDP. label Jul 23, 2025
@tgilon
Copy link
Member Author

tgilon commented Aug 14, 2025

@daniel-rdt This PR is not ready yet. I still have a bunch of todos. Nevertheless, I'm already happy to receive an early feedback from you.

Copy link
Member

@daniel-rdt daniel-rdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgilon for this inital implementation. The architecture follows a very sensible logic and thanks for documenting everything so thorougly up to here.
I like the idea of using the Wen et al (2022) methodology to assess the backcasting. One idea for the plotting here might be to reproduce a similar graph to what they introduced for their graphical abstract which gives a more visual overview of the overall performance wrt those set of indicators.

I also left some comments and small suggestions (it looks like more than it is, no worries, because it is a lot of related small code suggestions). Some additional high level comments I have:

  • the difference between build_benchmark and make_benchmark is difficult to grasp from the rule names. Maybe we can find a clearer name for one or both of them? Maybe something like build_benchmark_statistics (since this is mainly computing outputs using the statistics module) and make_benchmark_indicators or compare_benchmark_metrics?
  • add documentation / overview of the output files that include the benchmark results to the PR description

Copy link
Member

@willu47 willu47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just reviewing the addition of the version control commit hash to the plots and benchmark data. Approved! This looks good and will be immediately helpful in allowing us to track benchmark development over time.

return df


def _convert_units(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant with #116 , let's consolidate it in _helpers.py.

Copy link
Member

@daniel-rdt daniel-rdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tgilon. The benchmarking architecture looks good overall. Great work! :)
As this is my second round of review, I appreciate the renaming of the rules and cleaned up configuration logic, it is now much clearer to understand the flow.

I do have a few comments that need to be addressed as I found an issue with the temporal aggregation and the add_loss_factors calculation. The major points that I have are:

  • Fix temporal aggregation
  • Fix get_loss_factors function
  • Add KPI summary figure for all KPIs and / or optionally add new summary figure that combines all KPIs into one
  • Consolidate unit conversion with vercorized version from #97
  • Improve logging in a few places


# Plot overview
indicators = pd.read_csv(kpis_in, index_col=0)
plot_overview(indicators, kpis_out, scenario)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use one overview figure for all KPIs, we should plot the current overview plot for all KPIs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current plot shows the magnitude of the error. I found easier to communicate this than using a radar. Perhaps something to improve.

)

# Clean data
available_years = set(benchmarks_tyndp.year).intersection(benchmarks_n.year) # noqa: F841
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth adding a comment and a logger here that specifies that for the given scenario (e.g. DE) only the subset of years is available for comparison (which is clear if you know the TYNDP methodology but is still useful to log). This also explains better when the script produces a lot of insufficient data for growth error calculation later on.

Comment on lines +418 to +421
# Check if at least two sources are available to compare
if len(df.columns) != 2:
logging.info(f"Skipping table {table}, need exactly two sources to compare.")
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into the case that because I am testing with CY 2013, the generation profile table from the model run was empty. So to trace back the issue I had to find the logger warning that states why that is the case. Instead it would be much easier for debugging (usability) if the user could identify here quickly why the table was empty. We could e.g. add a hint what to check. Something like:

Suggested change
# Check if at least two sources are available to compare
if len(df.columns) != 2:
logging.info(f"Skipping table {table}, need exactly two sources to compare.")
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing")
# Check if at least two sources are available to compare
if len(df.columns) != 2:
logging.info(f"Skipping table {table}, need exactly two sources to compare. Please make sure that you are using the correct climate year 2009 to compare results.")
return pd.DataFrame(), pd.Series("NA", index=[table], name="Missing")


df_agg = (
df_map.groupby(["carrier", "scenario", "year", "table", "map"])
.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we have any time series with GWh instead of GW which would mean that the length of the interval matters and the aggregation method needs to be sum() instead of mean(). For all GW values though it makes sense

Comment on lines +99 to +106
aggregation_map = (
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill()
)
df_map = df.join(aggregation_map, on="snapshot", how="left")

df_agg = (
df_map.groupby(["carrier", "scenario", "year", "table", "map"])
.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed here is that the mapping forward fills past the last entry even if the snapshots from the model only cover a subset of the year (e.g. like in the test configs one week). when aggregating below this means that the very last snapshot is wrong as it takes the mean over all of the rest of the year. Solution is to first filter from first to last snapshot given in the model

Suggested change
aggregation_map = (
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill()
)
df_map = df.join(aggregation_map, on="snapshot", how="left")
df_agg = (
df_map.groupby(["carrier", "scenario", "year", "table", "map"])
.mean()
aggregation_map = (
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill()
)
df_map = df.join(aggregation_map, on="snapshot", how="left")
df_agg = (
df_map
.loc[:,:,:,:,idx_agg[0]:idx_agg[-1]]
.groupby(["carrier", "scenario", "year", "table", "map"])
.mean()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Careful, if ever there would be extensive quantities (like GWh), since the last bin df_agg.loc[df_idx_agg[-1], "map"] is only a single hour bin.

Copy link
Member

@daniel-rdt daniel-rdt Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, maybe then it would be a good idea to also bring the snapshot_weightings into this to make sure that the length of that last snapshot is considered?

Copy link
Member

@coroa coroa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i don't have a final review yet, but to speed up the process let me already add the questions i have instead of bunching them.

Comment on lines +120 to +148
rule prepare_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv",
**config["scenario"],
run=config["run"]["name"],
),
RESULTS + "validation/resources/benchmarks_tyndp.csv",


rule make_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv",
**config["scenario"],
run=config["run"]["name"],
),


rule plot_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf",
**config["scenario"],
run=config["run"]["name"],
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the lambdas

Suggested change
rule prepare_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv",
**config["scenario"],
run=config["run"]["name"],
),
RESULTS + "validation/resources/benchmarks_tyndp.csv",
rule make_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv",
**config["scenario"],
run=config["run"]["name"],
),
rule plot_benchmarks:
input:
lambda w: expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf",
**config["scenario"],
run=config["run"]["name"],
),
rule prepare_benchmarks:
input:
expand(
RESULTS
+ "validation/resources/benchmarks_s_{clusters}_{opts}_{sector_opts}_{planning_horizons}.csv",
**config["scenario"],
run=config["run"]["name"],
),
RESULTS + "validation/resources/benchmarks_tyndp.csv",
rule make_benchmarks:
input:
expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.csv",
**config["scenario"],
run=config["run"]["name"],
),
rule plot_benchmarks:
input:
expand(
RESULTS
+ "validation/kpis_eu27_s_{clusters}_{opts}_{sector_opts}_all_years.pdf",
**config["scenario"],
run=config["run"]["name"],
),

Comment on lines +99 to +111
n.statistics.withdrawal(
comps=demand_comps,
bus_carrier=elec_bus_carrier + ["gas", "H2", "coal", "oil"],
groupby=["bus"] + grouper,
nice_names=False,
aggregate_across_components=True,
)
.reindex(eu27_idx, level="bus")
.groupby(by=grouper)
.sum()
.loc[lambda x: ~x.index.get_level_values("carrier").isin(exclude_carriers)]
.groupby(level="bus_carrier")
.sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a gas powerplant modelled as a link from gas to elec bus carriers?

I have not dealt with withdrawal with multiple bus_carriers at the same time. What is the logic in the example?

I would expect it to add for a gas powerplant an entry:

bus="EU", bus_carrier="gas",  carrier="OCGT", withdrawal="x tonne of gas"

(which actually should not be part of final energy demand since it converts to electricity which is only a secondary carrier). This would then be filtered out since the bus="EU" is not a member state?

I am really confused how that gives FED.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think withdrawal is measuring primary energy consumption here, isn't it?

Comment on lines +92 to +93
replace_dict = SCENARIO_DICT.copy()
replace_dict.update({"Lh2": "LH2"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
replace_dict = SCENARIO_DICT.copy()
replace_dict.update({"Lh2": "LH2"})
replace_dict = SCENARIO_DICT | {"Lh2": "LH2"}

Copy link
Member

@coroa coroa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nothing else jumped out at me.

Comment on lines +99 to +106
aggregation_map = (
pd.Series(idx_agg.rename("map"), index=idx_agg).reindex(idx_full).ffill()
)
df_map = df.join(aggregation_map, on="snapshot", how="left")

df_agg = (
df_map.groupby(["carrier", "scenario", "year", "table", "map"])
.mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Careful, if ever there would be extensive quantities (like GWh), since the last bin df_agg.loc[df_idx_agg[-1], "map"] is only a single hour bin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major feature Major feature for the Open TYNDP.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration of automated tests and benchmarks
4 participants