Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Supersedes #151; I'm copying its description etc below.
This PR removes the
Column
class forParameter
as it's not necessary to keep around and even slightly faster to do so. See the following benchmarks:Benchmarks with `Column`
Benchmarks now
It remains to be seen if the rest of the normalization (see #143) is beneficial in terms of performance or not. That may not be urgent, though, as the performance seems to already be better than ixmp's.
TODO:
Of course, one question still remains open: what about the rest of the "normalization" PRs? What about the
.data
attribute? Might it not be faster to also normalize that with its own table?For future reference: I tried (though the branch is not on GitHub). I followed the same approach as in #143 and added a
float
"values" column as well as a relationship toUnit
as "units" to theParameterData
table. This worked fine except for one part: theUPSERT
functionality.With our current implementation, we load the JSON dict into a
pd.DataFrame
and then usecombine_first()
to achieve an operation that's very similar (if not identical) toUPSERT
. With a dedicatedParameterData
table, we would not need to load the data into memory, but could instead make use of sqlite's and postgres'INSERT ... ON CONFLICT DO UPDATE
structure, which is fully supported by sqlalchemy. This kind of statement requires us to specify theUniqueConstraint
or columns that should be checked for a conflict. And this is where the problems come in:TableData
, I defined 15 columns for data linked toIndexSet
s. All of these plus theparameter__id
were grouped in theUniqueConstraint
. However, when I ran some test that only added data to two of these columns and specified these columns for the conflict check, sqlalchemy told me it couldn't find aUniqueConstraint
fitting these two columns.UniqueConstraint
, but that would most likely not work either because most of the columns areNULL
. AndNULL
s are distinct, so even rows where the non-NULL
values are identical would be seen as distinct. Postgres offers aNULLS NOT DISTINCT
clause, but sqlite doesn't support it and it's generally accepted SQL standard that they are always distinct.NULL
, such as"___"
. However, since this is not intended behaviour, this would not make our DB model more stable.Parameter
with X columns would get a partial unique index for all its rows that is only defined for these X columns. This would likely lead to thousands of indexes on this DB table, which would all be checked for each query to figure out if they are applicable, so would slow down this table over time considerably.NULL
columns, and make one partition for each category. Each of these might then be able to get a (partial) unique index covering the whole partition, and we would only need to ensure that eachParameterData
coming in is stored in the right partition. I have not tried this nor checked if partitions generally support having distinct indexes, etc, so I don't know how well this might work.UniqueConstaint
on each column. Then, specifying two columns would trigger both constraints, but we are not sure how these would be evaluated. Would there be a conflict only in the case where we want it to be, i.e. when both constraints fail, or would the conflict arise already when just one constraint fails, which we would consider premature?Last but not least, even if we figure this problem out, we would be left with a collection that already firmly falls into the category "large" in sqlalchemy as we would frequently want to have several thousand
ParameterData
rows in.data
. They offer a few techniques of handling these large collections, but the most prominent seems to be aWriteOnlyCollection
, which is not what we need: we often want to read the whole data stored in aParameter
.Sqlalchemy also offers various relationship loading techniques. Generally,
lazy
is the default andselectin
is the most efficient. These can be set per query, which I admittedly did not try, but I set the whole relationship (in the table definition) ofIndexSet
toIndexSetData
toselectin
(since that seems to take most of the time in our profiles), and our benchmarks were actually a little slower than withlazy
.Note
One thing that might be worth a try still is to revert
indexset.data
to a json list and see if that enhances our performance.For the time being, I will not continue work on this. The profiles generated for #150 show that
combine_first()
makes up at most 10% of the runtime, almost negligible compared tovalidate_data()
. And since we seem to already surpass ixmp in performance, this work can wait at least, if continued at all (due to the problems outlined above).