Remove `Column` class for `Parameter` #156

glatterf42 · 2025-01-30T10:31:43Z

Supersedes #151; I'm copying its description etc below.

This PR removes the Column class for Parameter as it's not necessary to keep around and even slightly faster to do so. See the following benchmarks:

Benchmarks with `Column`

tests/profiling/test_profiling_parameters.py ......                                                                                                                                    [100%]


---------------------------------------------------------------------------------------------------- benchmark: 6 tests ----------------------------------------------------------------------------------------------------
Name (time in ms)                                           Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_parameter_add_data_insert[sqlite-small]            45.7072 (1.0)      108.1139 (1.0)       52.2688 (1.0)      19.6248 (1.0)       45.9778 (1.0)       0.6958 (1.0)           1;1  19.1319 (1.0)          10           1
test_parameter_add_data_upsert_all[sqlite-small]        54.6769 (1.20)     136.7937 (1.27)      71.5041 (1.37)     30.9461 (1.58)      55.8460 (1.21)     11.0482 (15.88)         2;2  13.9852 (0.73)         10           1
test_parameter_add_data_upsert_half[sqlite-small]       57.9691 (1.27)     122.9973 (1.14)      65.2775 (1.25)     20.2952 (1.03)      58.5939 (1.27)      1.4961 (2.15)          1;1  15.3192 (0.80)         10           1
test_parameter_add_data_insert[sqlite-medium]          142.7757 (3.12)     208.8990 (1.93)     150.6410 (2.88)     20.4936 (1.04)     144.2870 (3.14)      1.9785 (2.84)          1;1   6.6383 (0.35)         10           1
test_parameter_add_data_upsert_all[sqlite-medium]      181.2061 (3.96)     254.5545 (2.35)     192.6365 (3.69)     22.0582 (1.12)     185.4630 (4.03)      3.7982 (5.46)          1;2   5.1911 (0.27)         10           1
test_parameter_add_data_upsert_half[sqlite-medium]     251.3761 (5.50)     427.4622 (3.95)     279.5083 (5.35)     54.8663 (2.80)     254.5343 (5.54)     38.5086 (55.35)         1;1   3.5777 (0.19)         10           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================================================================= 6 passed, 13 deselected in 21.86s ==============================================================================

Benchmarks now

tests/profiling/test_profiling_parameters.py ......                                    [100%]


---------------------------------------------------------------------------------------------------- benchmark: 6 tests ----------------------------------------------------------------------------------------------------
Name (time in ms)                                           Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_parameter_add_data_insert[sqlite-small]            43.5097 (1.0)      105.0430 (1.53)      50.5152 (1.0)      19.1710 (4.03)      44.4000 (1.0)       0.6198 (1.0)           1;2  19.7960 (1.0)          10           1
test_parameter_add_data_upsert_all[sqlite-small]        52.9452 (1.22)      68.6140 (1.0)       55.2633 (1.09)      4.7614 (1.0)       53.6298 (1.21)      1.3119 (2.12)          1;1  18.0952 (0.91)         10           1
test_parameter_add_data_upsert_half[sqlite-small]       55.7147 (1.28)     125.8547 (1.83)      64.4476 (1.28)     21.8285 (4.58)      56.5543 (1.27)      2.1717 (3.50)          1;2  15.5165 (0.78)         10           1
test_parameter_add_data_insert[sqlite-medium]          135.4159 (3.11)     208.4362 (3.04)     146.6560 (2.90)     21.7741 (4.57)     140.3144 (3.16)      1.5109 (2.44)          1;2   6.8187 (0.34)         10           1
test_parameter_add_data_upsert_all[sqlite-medium]      172.6628 (3.97)     241.9307 (3.53)     182.8954 (3.62)     20.9766 (4.41)     175.4558 (3.95)      4.0788 (6.58)          1;1   5.4676 (0.28)         10           1
test_parameter_add_data_upsert_half[sqlite-medium]     242.7693 (5.58)     437.8403 (6.38)     284.0225 (5.62)     61.7876 (12.98)    250.1303 (5.63)     70.3974 (113.57)        1;1   3.5208 (0.18)         10           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
============================= 6 passed, 13 deselected in 19.70s ==============================

It remains to be seen if the rest of the normalization (see #143) is beneficial in terms of performance or not. That may not be urgent, though, as the performance seems to already be better than ixmp's.

TODO:

Add DB migration

Of course, one question still remains open: what about the rest of the "normalization" PRs? What about the .data attribute? Might it not be faster to also normalize that with its own table?

For future reference: I tried (though the branch is not on GitHub). I followed the same approach as in #143 and added a float "values" column as well as a relationship to Unit as "units" to the ParameterData table. This worked fine except for one part: the UPSERT functionality.
With our current implementation, we load the JSON dict into a pd.DataFrame and then use combine_first() to achieve an operation that's very similar (if not identical) to UPSERT. With a dedicated ParameterData table, we would not need to load the data into memory, but could instead make use of sqlite's and postgres' INSERT ... ON CONFLICT DO UPDATE structure, which is fully supported by sqlalchemy. This kind of statement requires us to specify the UniqueConstraint or columns that should be checked for a conflict. And this is where the problems come in:

As in TableData, I defined 15 columns for data linked to IndexSets. All of these plus the parameter__id were grouped in the UniqueConstraint. However, when I ran some test that only added data to two of these columns and specified these columns for the conflict check, sqlalchemy told me it couldn't find a UniqueConstraint fitting these two columns.
We could alternatively specify the name of the UniqueConstraint, but that would most likely not work either because most of the columns are NULL. And NULLs are distinct, so even rows where the non-NULL values are identical would be seen as distinct. Postgres offers a NULLS NOT DISTINCT clause, but sqlite doesn't support it and it's generally accepted SQL standard that they are always distinct.
We could work around this problem by using a special default value instead of NULL, such as "___". However, since this is not intended behaviour, this would not make our DB model more stable.
Another idea was to use partial indexes. These can even be used for partial unique indexes, which sounds close to what we need: every Parameter with X columns would get a partial unique index for all its rows that is only defined for these X columns. This would likely lead to thousands of indexes on this DB table, which would all be checked for each query to figure out if they are applicable, so would slow down this table over time considerably.
One way around that might be via partitioning. We could define separate categories based on the number of non-NULL columns, and make one partition for each category. Each of these might then be able to get a (partial) unique index covering the whole partition, and we would only need to ensure that each ParameterData coming in is stored in the right partition. I have not tried this nor checked if partitions generally support having distinct indexes, etc, so I don't know how well this might work.
Lastly, @meksor suggested putting a UniqueConstaint on each column. Then, specifying two columns would trigger both constraints, but we are not sure how these would be evaluated. Would there be a conflict only in the case where we want it to be, i.e. when both constraints fail, or would the conflict arise already when just one constraint fails, which we would consider premature?

Last but not least, even if we figure this problem out, we would be left with a collection that already firmly falls into the category "large" in sqlalchemy as we would frequently want to have several thousand ParameterData rows in .data. They offer a few techniques of handling these large collections, but the most prominent seems to be a WriteOnlyCollection, which is not what we need: we often want to read the whole data stored in a Parameter.
Sqlalchemy also offers various relationship loading techniques. Generally, lazy is the default and selectin is the most efficient. These can be set per query, which I admittedly did not try, but I set the whole relationship (in the table definition) of IndexSet to IndexSetData to selectin (since that seems to take most of the time in our profiles), and our benchmarks were actually a little slower than with lazy.

Note

One thing that might be worth a try still is to revert indexset.data to a json list and see if that enhances our performance.

For the time being, I will not continue work on this. The profiles generated for #150 show that combine_first() makes up at most 10% of the runtime, almost negligible compared to validate_data(). And since we seem to already surpass ixmp in performance, this work can wait at least, if continued at all (due to the problems outlined above).

codecov · 2025-01-30T10:36:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.0%. Comparing base (170f7fc) to head (87d520e).

Additional details and impacted files

@@                   Coverage Diff                   @@
##           profile-optimization-db    #156   +/-   ##
=======================================================
  Coverage                     87.0%   87.0%           
=======================================================
  Files                          230     230           
  Lines                         8156    8176   +20     
=======================================================
+ Hits                          7098    7118   +20     
  Misses                        1058    1058

Files with missing lines	Coverage Δ
ixmp4/core/optimization/parameter.py	`93.2% <100.0%> (-0.1%)`	⬇️
ixmp4/data/abstract/optimization/parameter.py	`82.0% <100.0%> (+0.4%)`	⬆️
ixmp4/data/api/optimization/parameter.py	`94.0% <100.0%> (ø)`
ixmp4/data/db/base.py	`92.3% <100.0%> (+<0.1%)`	⬆️
ixmp4/data/db/optimization/base.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/indexset/repository.py	`98.1% <100.0%> (ø)`
ixmp4/data/db/optimization/parameter/model.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/parameter/repository.py	`96.1% <100.0%> (-0.2%)`	⬇️
ixmp4/data/db/optimization/utils.py	`100.0% <100.0%> (ø)`
ixmp4/data/types.py	`100.0% <100.0%> (ø)`
... and 1 more

♻️ refactor(Parameter DB): Remove superfluous Columns for Parameter

87d520e

glatterf42 added the enhancement New feature or request label Jan 30, 2025

glatterf42 requested a review from meksor January 30, 2025 10:31

glatterf42 self-assigned this Jan 30, 2025

glatterf42 changed the base branch from main to profile-optimization-db January 30, 2025 10:32

This was referenced Jan 31, 2025

Remove Column class entirely #157

Open

Fix not enough column names used for same IndexSet #161

Open

Fix actual return type of column_names property #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `Column` class for `Parameter` #156

Remove `Column` class for `Parameter` #156

glatterf42 commented Jan 30, 2025

codecov bot commented Jan 30, 2025 •

edited

Loading

Remove Column class for Parameter #156

Are you sure you want to change the base?

Remove Column class for Parameter #156

Conversation

glatterf42 commented Jan 30, 2025

codecov bot commented Jan 30, 2025 • edited Loading

Codecov Report

Remove `Column` class for `Parameter` #156

Remove `Column` class for `Parameter` #156

codecov bot commented Jan 30, 2025 •

edited

Loading