Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Log, Checkpoints and Run-Locks #166

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from
Draft

Conversation

meksor
Copy link
Contributor

@meksor meksor commented Mar 4, 2025

This PR is meant to contain the three major parts of what we internally often call the "versioning" feature.

Change Log

The change log is an in-database representation of all transactions run on a given ixmp4 platform.
This requires the generation of "version tables" from database models (by adding columns and removing constraints) and a central "transaction" table tracking all transactions committed, referenced by each version table.
This part of the feature is logically/structurally built around the concepts in a database, so it might actually cover more functionality than needed (f.e. there is no way to update a region, but the system tracks update statements on the region table).
Currently this is handled by sqlalchemy-history as it has support for alembic migrations and can be replaced with sqlalchemy-continuum which would enable database-native versioning for postgres via postgres triggers.
Enabling and disabling these features on a per-platform basis (to avoid performance problems with sqlite) is still a little unclear to me and will need a little more time investment.

The tricky parts:

  1. This change entails an inevitable performance regression. While it will not change scaling characteristics (think big O), it will worsen all linear operations siginificantly. What was once a single query statement might become three or more (sequential ones!) with this change:
INSERT (a, b, c) INTO xy;

# becomes:

INSERT (a, b, c) INTO xy;
INSERT (datetime.now()) INTO transactions;
INSERT (a, b, c, transaction__id, operation_type) INTO xy_versions;

Most of the overhead can be alleviated by using database-native table triggers, but these are hard to write and maintain so I would suggest relying on a library for that as well

  1. What to version & handling relationships

Versioning of the data points table puts me in front of a choice:
They can either:
a) Prevent deletes entirely for all regions, variables and units - hiding the unused ones when SELECTing
or
b) Version every single table in the database that is subject to a relation and query the version variant of the related table when rolling back

I went with option b).

Run Checkpoints

To be able to distinguish different workflow steps in a sea of "transaction" records (one ixmp4 api call might create multiple records) we will need to add another table to ixmp4 which references run__id and transaction__id and stores a user-supplied message.

We now have /four/ relevant tables.

The original table xy.

a b c
A B C

The version table xy_version.

a b c transaction_id end_transaction_id operation_type
A B C 1 NULL 0

The transaction table transactions.

id issued_at
1 13:37 27.08.1999

The new checkpoints table.
May also include a timestamp and user information.

message transaction_id run_id
"Hello World" 1 1

The last table (checkpoints) can be used to query all workflow steps for a specific run and to revert to said steps by looking at which transaction record they point to and replaying or rolling back the change history.
This would probably look kind of like this:

run = ...
with run.acquire_lock():
    run.checkpoints.create("Create run")
    run.iamc.add(my_data)
    run.checkpoints.create("Add data")

run.checkpoints.tabulate()

# msg | transaction_id
# "Create run" | 3
# "Add data" | 5

run.checkpoints.list()[0].revert()

run.iamc.tabulate()

# empty ... 

Run Locks

To make sure that a user does not include unintended changes by a different user, a locking mechanism is needed for each "run" object/row. A user shall be able to acquire a lock, preventing all other users from writing to objects related to the run.
Still unclear to me:
If a user tries to acquire the lock to a locked run, should the operation:
a) block until the other user releases the lock
b) fail immediately

Also, I remember the uploads crashing and not releasing the lock being a problem in the past. How do we want to handle this in the future?

@glatterf42
Copy link
Member

@khaeru, this PR and the open questions are likely interesting to you, too. Please chime in if you know the answers/have some preferences for those :)

I imagine people would not be happy to wait an unknown amount of time on someone else releasing a lock, so my guess is: if you need a lock but can't get it, we should immediately raise an error.
And regarding the failed uploads in the past: I'm not sure what our options are. Theoretically, we could release any lock once an operation finishes, successful or not, but if it's not successful, additional edits from some other person who doesn't know about the failure might not be helpful. Is there a way to mark the Run as "broken" somehow, so that whoever acquires the lock next knows that they're getting into something? Maybe they could even reach out to whoever had the lock last before them.

@khaeru
Copy link
Member

khaeru commented Mar 6, 2025

@khaeru, this PR and the open questions are likely interesting to you, too. Please chime in if you know the answers/have some preferences for those :)

Okay, will try! Here is some info that I hope is useful, maybe not direct answers to questions but context that might make the answers/requirements more clear.

First some things I am not 100% clear on, but just stating here so I can be corrected if I've misunderstood:

  • A Run has a unique ID.
  • message_ix etc. ("etc." to include message-ix-models, message_data, and the whole mass of other, user code known and unknown to us) expect that the combination of (model name, scenario name, version) is unique. @glatterf42 through Enable populate_test_platform() with IXMP4Backend ixmp#560 and previous has been able to implement this, meaning that all 3 of these identifiers also exist in ixmp4.
  • There is a 1:1 correspondence between (run_id) and (model name, scenario name, version).
  • The PR is adding tables with a column 'transaction_id' and also the name (in the example code snippet) 'checkpoint', which I sense are the same thing.

Some comments about requirements from the message_ix side:

  • message_ix etc. does not rely on the capability to 'revert' (per the snippet) from one (model name, scenario name, version, transaction_id) to (same, same, same, earlier transaction ID). IOW it would be an added feature and not affect whether we can provide a smooth migration path to code based on message_ix.
  • What they do expect/rely on:
    • Once a particular set of changes is 'committed' (term used in ixmp/ixmp_source), that becomes the content associated with the (model name, scenario name, version).
    • If a particular set of of changes fails during a 'check_out'/'with TimeSeries.transact()' block, or is explicitly 'discarded', it disappears, and the prior content remains.
    • Nothing like 'transaction_id' is exposed to the user.
    • Thus, the user is not able to list, inspect, or access (model name, scenario name, version, transaction_id earlier than the latest).
  • I don't know if that makes it possible to have a different implementation, perhaps one with less performance impact (time, or storage, e.g. by not storing more than ~2 transaction_id associated with a given run_id) or maintenance overhead, but providing that info in case.
  • Per "uploads crashing and not releasing a lock", the term 'upload' I understand as referring to the Scenario Explorer and IAMC-structured ('time-series') data, and I am not aware of issues there.
    • Similar but distinct, we did have frequent issues where users would call ixmp.TimeSeries.check_out() (usually as message_ix.Scenario.check_out(), which is identical). Then their code would error or crash. Because they never explicitly called TimeSeries.commit() or TimeSeries.discard(), the database would record that the object (~ixmp4 Run) was "checked out to" them, and not allow any user (them, or another) to change it.
    • To address this issue I added (Add TimeSeries.transact(); clean up code tree ixmp#422) TimeSeries.transact() that—like the .acquire_lock() in the example above—is a context manager that avoids leaving the object in a "checked out" state, and have tried to encourage people to always use this.
    • This appears to have worked, at least I see fewer if any mentions of "locked Scenarios" that need support intervention.
    • So I support that ixmp4 has such a feature, and it is probably feasible to give a migration pathway from ixmp.TimeSeries.transact() to Run.acquire_lock().
    • I would also further say it is fine if ixmp4 does not support the older ixmp workflow along the lines of:
      s.check_out()
      s.add_par(...)
      if condition:
          s.commit()
      else:
          s.discard()
      In other words, if only "atomic transactions" via with run.acquire_lock() are provided, that would be livable. We can add noisy warnings to ixmp.TimeSeries.check_out() during the shim/migration period that say check_out() will not ever be supported with IXMP4Backend and encourage users to switch to using acquire_lock/transact. The 'check_out'/…/'discard' workflow was I guess designed with an eye to interactive use, e.g. in IPython/Jupyter, but with .transact() we can now teach users "best practices" without the code being more verbose.

In terms of locking/versioning/etc., here are some example cases and what I think we would expect:

  1. User A is working with a local database. They have multiple processes (e.g. running test suite jobs in parallel using pytest-xdist, or actually solving models through GAMS) that all try to make clones/copies of the same (model name, scenario name, version)—let's say version=1—around the same time. Each of these should get a distinct version number, e.g. (model name, scenario name, 2), (model name, scenario name, 3), (model name, scenario name, 4), etc. with associated run_ids. The calls that happen to arrive second, third, etc. should block (probably for a very short time) until they can be processed.
  2. User B and User C are working with a shared database, however ixmp4 supports doing that (e.g. both via an ixmp4 API to a server process that talks to a local database on its own machine; or each connecting directly to the same database server). They each try to clone the same (model name, scenario name, version=1). Each gets a distinct (model name, scenario name, 2) or (model name, scenario name, 3).
  3. User D and User E, working with a shared database, both access the same (model name, scenario name, version), and try to run some code like:
    with run.acquire_lock():
        run.checkpoints.create("Create run")
        run.optimization....  # Add, remove, or modify some data
        run.checkpoints.create("Add data")
    
    If User D's operation starts when User E's operation arrives, then run.acquire_lock() blocks for a specific, configurable amount of time to wait for User D's operation to conclude. If as a result of User D's changes (e.g. suppose they remove a member from an IndexSet) User E's changes (e.g. adding elements to a Parameter using that set member as a key-value) fail, then upon such failure the Run is not locked, and their changes are discarded.

In general:

  • I think we should look to and follow the standard library's semantics. For example:
  • It would be good to consider whether words like 'transact(ion)', 'commit', 'checkpoint', etc. have existing usage and meanings (a) in ixmp, message_ix etc. or (b) in likely users' mind ('commit' e.g. because many people may also be Git users; fewer are RDBMS users); and consider whether the meaning in ixmp4/message_ix is different. This can help us avoid inviting potential confusion, as that creates work to steer users clear or help correct their misunderstanding.

Copy link

codecov bot commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 88.60759% with 72 lines in your changes missing coverage. Please review.

Project coverage is 87.1%. Comparing base (c7ea1f3) to head (7c51f65).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ixmp4/data/db/run/repository.py 70.9% 18 Missing ⚠️
ixmp4/data/db/base.py 94.6% 7 Missing ⚠️
ixmp4/data/abstract/checkpoint.py 78.5% 6 Missing ⚠️
ixmp4/data/api/checkpoint.py 80.6% 6 Missing ⚠️
ixmp4/data/db/checkpoint.py 87.5% 6 Missing ⚠️
ixmp4/data/abstract/run.py 78.5% 3 Missing ⚠️
ixmp4/data/db/meta/repository.py 78.5% 3 Missing ⚠️
ixmp4/core/run.py 91.6% 2 Missing ⚠️
ixmp4/data/db/iamc/datapoint/repository.py 81.8% 2 Missing ⚠️
ixmp4/data/db/iamc/measurand.py 75.0% 2 Missing ⚠️
... and 14 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #166     +/-   ##
=======================================
+ Coverage   87.0%   87.1%   +0.1%     
=======================================
  Files        230     235      +5     
  Lines       8170    8711    +541     
=======================================
+ Hits        7112    7593    +481     
- Misses      1058    1118     +60     
Files with missing lines Coverage Δ
ixmp4/core/checkpoints.py 100.0% <100.0%> (ø)
ixmp4/core/exceptions.py 95.6% <100.0%> (+0.2%) ⬆️
ixmp4/core/iamc/data.py 91.6% <100.0%> (+0.1%) ⬆️
ixmp4/data/abstract/__init__.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/annotations.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/base.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/iamc/datapoint.py 92.5% <ø> (ø)
ixmp4/data/abstract/iamc/measurand.py 96.0% <ø> (ø)
ixmp4/data/abstract/iamc/timeseries.py 71.0% <ø> (ø)
ixmp4/data/abstract/iamc/variable.py 83.3% <ø> (ø)
... and 64 more
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meksor
Copy link
Contributor Author

meksor commented Mar 20, 2025

The PR is adding tables with a column 'transaction_id' and also the name (in the example code snippet) 'checkpoint', which I sense are the same thing.

No! Checkpoints connect a particular run (-id) to a transaction, they serve as points to roll back to, but i think this might be irrelevant for the message integration anyway.

I don't know if that makes it possible to have a different implementation, perhaps one with less performance impact (time, or storage, e.g. by not storing more than ~2 transaction_id associated with a given run_id) or maintenance overhead, but providing that info in case.

The revert logic is always the same: select the relevant objects from the version table at the particular transaction, then delete all current data and insert the old data. For the checkpoint we get run__id and transaction__id from the checkpoint itself, for the run.transact() feature we just have to remember the transaction at which the run was locked (currently done in a new column in the run table).

Similar but distinct, we did have frequent issues where users would call ixmp.TimeSeries.check_out() (usually as message_ix.Scenario.check_out(), which is identical). Then their code would error or crash. Because they never explicitly called TimeSeries.commit() or TimeSeries.discard(), the database would record that the object (~ixmp4 Run) was "checked out to" them, and not allow any user (them, or another) to change it.

In other words, if only "atomic transactions" via with run.acquire_lock() are provided, that would be livable. We can add noisy warnings to ixmp.TimeSeries.check_out() during the shim/migration period that say check_out() will not ever be supported with IXMP4Backend and encourage users to switch to using acquire_lock/transact. The 'check_out'/…/'discard' workflow was I guess designed with an eye to interactive use, e.g. in IPython/Jupyter, but with .transact() we can now teach users "best practices" without the code being more verbose.

Ok good, i think this approach is preferable for me as well.

Example cases: 1 and 2 seems doable with the same logic, but that logic is out of scope for this PR.
I think, theoretically, the chance of the race condition you describe is there, but very small.

3.: Okay, the timeout needs to be implemented, will need to look into that.

It would be good to consider whether words like 'transact(ion)', 'commit', 'checkpoint', etc. have existing usage and meanings (a) in ixmp, message_ix etc. or (b) in likely users' mind ('commit' e.g. because many people may also be Git users; fewer are RDBMS users); and consider whether the meaning in ixmp4/message_ix is different. This can help us avoid inviting potential confusion, as that creates work to steer users clear or help correct their misunderstanding.

I agree, unfortunately all the names i can think of are taken. PSQL even has "checkpoints" as well, so in fact all these terms ('transact(ion)', 'commit', 'checkpoint') have (sometimes multiple) other meanings.

@meksor
Copy link
Contributor Author

meksor commented Mar 24, 2025

Hi, the first version of this feature is available on this branch.
There are still a few open issues which I might try to solve on auxiliary branches and merge into this one.

  • exchange sqlachemy-history for sqlachemy-continuum in order to use its sql trigger generation functionality. This will (hopefully, presumably) get rid of some of the manual version object generation code (f.e. in bulk_insert_versions) and speed up this PR so the regression is acceptable.
  • Disable versioning for specific platforms as the sql trigger generation will only work for postgres. This is tricky since I assume I need to find a way to run the sqlalchemy mapper in a "reentrant" way. It also means throwing and testing a new exception.
  • Implement versioning and revert logic for optimization data, which is best done by fridolin or at least with fridolins help.

Further I need clarification on some issues by @danielhuppmann and @khaeru. Details follow.

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run. This means, the following code will throw a RegionNotFound error.

region1 = platgorm.regions.create("Region 1")
platgorm.regions.create("Region 2")

run = platform.runs.create(model, scenario)
data_with_region1 = pd.DataFrame(...) # Has IAMC Data that references "Region 1"
data_with_region2 = pd.DataFrame(...) # Has IAMC Data that references "Region 2"

with run.transact():
    run.iamc.add(data_with_region1)
    run.checkpoints.create("Add data with region 1")

    run.iamc.add(data_with_region2)
    run.checkpoints.create("Add data with region 2")

    run.iamc.remove(data_with_region1)
    run.checkpoints.create("Remove data with region 1")

platform.regions.delete(region1)

with run.transact():
    # Revert to "Add data with region 1"
    run.checkpoints[1].revert() # Will throw an error!!

This seems good to me, because if incorrect regions were deleted, we do not want to implicitly create them again when rolling back a run.
What do you think?

Automatic Rollback

Right now, every time a run exits the transact() context manager, it will roll back to the last checkpoint created. This means, if the user forgets to create a checkpoint after adding data, the run will be empty.
This adds quite a bit of overhead, especially via the web API.

with run.transact():
    run.iamc.add(test_data_1)
    run.checkpoints.create("Add test data 1")
    run.iamc.add(test_data_2)

# the run will now contain only test_data_1
assert run.iamc.tabulate() == test_data_1

Is that right or should the data stay?

Client-Side Locking Mechanism

The run lock is now implemented in the facade layer using the two backend methods backend.run.lock and backend.run.unlock. The client has to keep track of which locks it owns as the methods behave as follows:

backend.run.lock(run_id) will check the specified run's lock_transaction column. If the column contains NULL, it will be updated to contain the id of the last transaction in the transactions table.
If the column already contains a transaction.id, the Run.IsLocked exception is thrown.

backend.run.unlock(run_id) will always set the lock_transaction column to NULL, regardless of the current value.

Hence, the transact contextmanager logic tries to aquire the lock via backend.run.lock(run_id) and tracks that the operation succeeded in the owns_lock property of the corresponding run facade model.

    @contextmanager
    def transact(self) -> Generator[None, None, None]:
        self._model = self.backend.runs.lock(self._model.id)
        self.owns_lock = True

        yield

        checkpoint_df = self.checkpoints.tabulate()
        checkpoint_transaction = int(checkpoint_df["transaction__id"].max())

        assert self._model.lock_transaction is not None

        if checkpoint_transaction > self._model.lock_transaction:
            self.backend.runs.revert(self._model.id, checkpoint_transaction)
        else:
            self.backend.runs.revert(self._model.id, self._model.lock_transaction)

        self._model = self.backend.runs.unlock(self._model.id)
        self.owns_lock = False

A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Posponed for an indefinite time, alright?

LockRequired Exception

Since the client now keeps track of which locks it owns, we can throw exceptions in the facade layer when an operation is attempted that requires a lock.
Currently this applies to:

  • run.iamc.add
  • run.iamc.remove
  • run.checkpoints.create()

A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

Run Meta Indicators/Data

Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm sure as soon as I post this comment I will think of something else, but for now: please advise!

@khaeru
Copy link
Member

khaeru commented Mar 24, 2025

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

As mentioned, message_ix/current ixmp test suites and usage don't need this, so no opinion.

Automatic Rollback
[…]
Is that right or should the data stay?

Stay? This is what I meant above about "following semantics of the standard library". See How to use the connection context manager in the docs for the sqlite3 module. This is the behaviour that the current ixmp.TimeSeries.transact() is trying to emulate—see the example there in the docs.

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Yes, that's fine. If something gets inadvertently locked by another user (and it shouldn't if we're encouraging users to use the context manager), it's already a great improvement that backend.run.unlock() will be accessible, instead of having to do something like this.

LockRequired Exception
[…]
A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

This is more for @glatterf42 I think. Just to clarify, one point above was that creating a Run instance at all should be safe against collisions. That is to say, rather than locking a existing Run while certain things are done to it, two simultaneous attempts to create new (not-yet-existing) Runs that want the same or similar identifiers should be safe.

If that's already achieved some other way than the locking mechanism you've written, e.g. by lower-level transaction control in the DB layer, then we're fine.

Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm not aware of people using these through ixmp/message_ix, so no opinion.

@danielhuppmann
Copy link
Member

Thanks @meksor, this is very nice!

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

This makes sense.

Automatic Rollback
[…]
Is that right or should the data stay?

Would it make sense that the transact()-context ends with setting a checkpoint, and the checkpoint-description is part of the transact-arguments?

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Agree with @khaeru, that's ok for our use case.

LockRequired Exception and Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

Please add creation and deletion of meta indicators to the operations that are change-logged and require a lock.

Adding the optimization-items to the change-log-and-locks-logic can be left for a follow-up PR (led by @glatterf42), in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants