Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bucket successfully deleted from S3 but DB update fails #1524

Conversation

rv0lt
Copy link
Member

@rv0lt rv0lt commented Apr 4, 2024

Read this before submitting the PR

  1. Always create a Draft PR first
  2. Go through sections 1-5 below, fill them in and check all the boxes
  3. Make sure that the branch is updated; if there's an "Update branch" button at the bottom of the PR, rebase or update branch.
  4. When all boxes are checked, information is filled in, and the branch is updated: mark as Ready For Review and tag reviewers (top right)
  5. Once there is a submitted review, implement the suggestions (if reasonable, otherwise discuss) and request an new review.

If there is a field which you are unsure about, enter the edit mode of this description or go to the PR template; There are invisible comments providing descriptions which may be of help.

1. Description / Summary

Fix raising of error when bucket successfully deleted from S3 but DB update fails

--- To reproduce ---

Add a few lines for the application to sleep

  • Line 754 in api/project.py
            import time
            flask.current_app.logger.debug("Sleeping for 15 seconds.")
            time.sleep(15)

Release from the client project_1 and update the database to change the deadline. A testbucket has to exit in Minio

UPDATE projectstatuses SET deadline = CURDATE() - INTERVAL 1 DAY where project_id = 1;

Expire the project from backend container

flask set-available-to-expired

And update the deadline again

1. From the flask command (cronjob)

Add a second waiting time, after the else. In order to be able to turn on the db again and continue for the next projects.

Run
flask set-expired-to-archived

And during the 15 seconds stop. Stop the mariadb container, the error got must be a

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

i.e. The error throw if not from a request

2. From a request

Add another time stop in 774. After the line that catches the sqlerror and before creating the DeletionError

            if flask.request:
                flask.current_app.logger.debug("Sleeping for 15 seconds.")
                time.sleep(15)
                raise DeletionError(

That is, because, if the db is still stopped when trying to throw the exception. There will be a different error thown.

Remember to recreate the testbucket in Minio

From the client, run the command to expire the project, with the db container up.

During the first 15-second wait, stop the container to force the exception.

During the second 15 seconds wait, start the container again to be able to communicate with the client and return the error.

The command will fail in the client, and in the logs it can be seen that the DeletionError was generated.

2. Jira task / GitHub issue

Link to the github issue or add the Jira task ID here.

3. Type of change

What type of change(s) does the PR contain?

Check the relevant boxes below. For an explanation of the different sections, enter edit mode of this PR description template.

  • New feature
    • Breaking: Why / How? Add info here.
    • Non-breaking
  • Database change: Remember the to include a new migration version, or explain here why it's not needed.
  • Bug fix
  • Security Alert fix
    • Package update
      • Major version update
  • Documentation
  • Workflow
  • Tests only

4. Additional information

5. Actions / Scans

Check the boxes when the specified checks have passed.

For information on what the different checks do and how to fix it if they're failing, enter edit mode of this description or go to the PR template.

  • Black
  • Prettier
  • Yamllint
  • Tests
  • CodeQL
  • Trivy
  • Snyk

@rv0lt rv0lt self-assigned this Apr 4, 2024
…cessfully-deleted-from-S3-but-DB-update-fails
Copy link

codecov bot commented Apr 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.84%. Comparing base (8006196) to head (677c5ee).
Report is 8 commits behind head on dev.

Current head 677c5ee differs from pull request most recent head f0d1d85

Please upload reports for the commit f0d1d85 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1524      +/-   ##
==========================================
- Coverage   92.36%   91.84%   -0.52%     
==========================================
  Files          29       29              
  Lines        4804     4807       +3     
==========================================
- Hits         4437     4415      -22     
- Misses        367      392      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rv0lt rv0lt marked this pull request as ready for review April 10, 2024 08:57
@rv0lt rv0lt requested a review from a team April 10, 2024 11:55
@valyo
Copy link
Member

valyo commented Apr 10, 2024

@rv0lt : do you have a good way to manually test this? If yes, please, add it to the description

@rv0lt
Copy link
Member Author

rv0lt commented Apr 10, 2024

@rv0lt : do you have a good way to manually test this? If yes, please, add it to the description

Manually? The thing I have been doing is to exec into the flask shell and see what was going on. You mean that?

@valyo
Copy link
Member

valyo commented Apr 10, 2024

Yes, but how to reproduce the situation when the error appears

Copy link
Member

@valyo valyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting this traceback:

[2024-04-10 13:15:58,057] commands [ERROR] DeletionError: Project bucket contents were deleted, but they were not deleted from the database. Please contact SciLifeLab Data Centre.
 Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/code/dds_web/commands.py", line 688, in set_expired_to_archived
    new_status_row, delete_message = archive.archive_project(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/api/project.py", line 492, in archive_project
    RemoveContents().delete_project_contents(project=project, delete_bucket=True)
  File "/code/dds_web/api/project.py", line 773, in delete_project_contents
    raise DeletionError(
          ^^^^^^^^^^^^^^
  File "/code/dds_web/errors.py", line 200, in __init__
    super().__init__((alt_message or "Deletion failed.") if not pass_message else message)
  File "/code/dds_web/errors.py", line 39, in __init__
    resource=flask.request.path or "not applicable",
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 316, in __get__
    obj = instance._get_current_object()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 513, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/usr/local/bin/flask", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 1050, in main
    cli.main()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/commands.py", line 713, in set_expired_to_archived
    project.project_statuses.append(new_status_row)
                                    ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'new_status_row' where it is not associated with a value

but I still think that this PR is a different taks than the one I (at least) have in mind

@rv0lt
Copy link
Member Author

rv0lt commented Apr 10, 2024

code

What I have been trying to do is to catch the exception for the Deletion. Which is turned into the RuntimeError
due to failed initialization (it can be seen down in the traceback). And try to grep the real error from the traceback.

Then, log it out adding the message of the real error. So it is easy to find (the first line of [ERROR] DeletionError: Project bucket contents were deleted, but they were not deleted from the database. Please contact SciLifeLab Data Centre.

So, it ends up logging the error but instead of the generic Runtime, it tries to be a bit more specific

Ngl, this whole task is a bit mind scratching honestly

I am getting this traceback:

[2024-04-10 13:15:58,057] commands [ERROR] DeletionError: Project bucket contents were deleted, but they were not deleted from the database. Please contact SciLifeLab Data Centre.
 Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/code/dds_web/commands.py", line 688, in set_expired_to_archived
    new_status_row, delete_message = archive.archive_project(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/api/project.py", line 492, in archive_project
    RemoveContents().delete_project_contents(project=project, delete_bucket=True)
  File "/code/dds_web/api/project.py", line 773, in delete_project_contents
    raise DeletionError(
          ^^^^^^^^^^^^^^
  File "/code/dds_web/errors.py", line 200, in __init__
    super().__init__((alt_message or "Deletion failed.") if not pass_message else message)
  File "/code/dds_web/errors.py", line 39, in __init__
    resource=flask.request.path or "not applicable",
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 316, in __get__
    obj = instance._get_current_object()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 513, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/usr/local/bin/flask", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 1050, in main
    cli.main()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/commands.py", line 713, in set_expired_to_archived
    project.project_statuses.append(new_status_row)
                                    ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'new_status_row' where it is not associated with a value

but I still think that this PR is a different taks than the one I (at least) have in mind

@valyo
Copy link
Member

valyo commented Apr 10, 2024

I agree it is confusing, and I seem to lack understanding of some of the things involved (on flask and python level).
But my main question is how do you trigger this error in your dev environment, so that you can get an exception in the first place

@rv0lt
Copy link
Member Author

rv0lt commented Apr 10, 2024

I agree it is confusing, and I seem to lack understanding of some of the things involved (on flask and python level). But my main question is how do you trigger this error in your dev environment, so that you can get an exception in the first place

Yes, I am trying to write some instructions. Until now, I was just running "flask shell" in the backend container (it just opens a python shell but that can be conected to the apliocation), and running the code lines one by one. I am trying to think of a way to make it more reproducible than just doing that

@valyo
Copy link
Member

valyo commented Apr 10, 2024

and how do you simulate the DB error?

@rv0lt
Copy link
Member Author

rv0lt commented Apr 10, 2024

and how do you simulate the DB error?

Ah, yes, same. Running one by one the lines of the function with the db container off.
So writting some like this on the flask shell

try:
    try:
     model.query....
    catch **operationError**:
     raise **DeletionError**
catch **RuntimeError**:

Veeeeeery messy I know. That is why I am trying to find a way to document/test this better

@rv0lt
Copy link
Member Author

rv0lt commented Apr 10, 2024

So writting some like this on the flask shell

Just FYI. Reason why we cannot call just RemoveContents for example with the db off. It is because of decorators. They will detect it before executing the function logic. Probably mocking can be used here, will try something like that in a script

@valyo
Copy link
Member

valyo commented Apr 11, 2024

@rv0lt : here is the traceback I had in mind when we initiated this task:
Traceback (most recent call last): File "/usr/local/bin/flask", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/flask/cli.py", line 1050, in main cli.main() File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator return __ctx.invoke(f, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/code/dds_web/commands.py", line 681, in set_expired_to_archived new_status_row, delete_message = archive.archive_project( File "/code/dds_web/api/project.py", line 492, in archive_project RemoveContents().delete_project_contents(project=project, delete_bucket=True) File "/code/dds_web/api/project.py", line 778, in delete_project_contents if isinstance(err, sqlalchemy.exc.OperationalError)UnboundLocalError: local variable 'err' referenced before assignment

If I understand this correctly, the DeletionError was raised and if we had sqlerr instead of err passed to isinstance, then we should have got the proper message printed.
EDIT I managed to trigger 'sqlalchemy.exc.OperationalError' (by changing the query at line 754 to something that will produce db error: 'models.File.query.filter(models.File.unit_id == project.id).delete()') and even with the correct var sqlerr inside the 'DeletionError', we don't get the expected exception raising because of what @rv0lt was explaining all the time: 'DeletionError' inherits from 'LoggedHTTPException', which leads to 'RuntimeError'.

@rv0lt rv0lt requested a review from valyo April 15, 2024 14:46
@rv0lt
Copy link
Member Author

rv0lt commented Apr 15, 2024

Added an explanation to recreate the situation

@valyo
Copy link
Member

valyo commented Apr 17, 2024

It seems this does the trick with RuntimeError: Working outside of request context. in a nice way.

Correct me if I am wrong, but we also want the flask command to register/print the error (what's now alt_message) and to continue running, don't we?

@rv0lt
Copy link
Member Author

rv0lt commented Apr 17, 2024

It seems this does the trick with RuntimeError: Working outside of request context. in a nice way.

Correct me if I am wrong, but we also want the flask command to register/print the error (what's now alt_message) and to continue running, don't we?

I see, the alt_message is more difficult, because this issue can be caused by different operations failures. And thus, we go again to the solution of having to inspect the traceback in running time.

The continue running, true. This will be caught by the outermost except, it can be solved by moving the archive.archive_project to the try block just below, this one continues the loop afterwards

@valyo
Copy link
Member

valyo commented Apr 17, 2024

hm, why should logging the error be difficult. Maybe I am naive but shouldn't something like this work:

        ) as sqlerr:
            alt_message=(
                "Project bucket contents were deleted, but they were not deleted from the "
                "database. Please contact SciLifeLab Data Centre."
                + (
                    "Database malfunction."
                    if isinstance(sqlerr, sqlalchemy.exc.OperationalError)
                    else "."
                )
            )
            if flask.request:
                raise DeletionError(
                    project=project.public_id,
                    message=str(sqlerr),
                    alt_message=alt_message,
                ) from sqlerr
            else:
                flask.current_app.logger.error(f"Error: {alt_message}")

@rv0lt
Copy link
Member Author

rv0lt commented Apr 17, 2024

hm, why should logging the error be difficult. Maybe I am naive but shouldn't something like this work:

        ) as sqlerr:
            alt_message=(
                "Project bucket contents were deleted, but they were not deleted from the "
                "database. Please contact SciLifeLab Data Centre."
                + (
                    "Database malfunction."
                    if isinstance(sqlerr, sqlalchemy.exc.OperationalError)
                    else "."
                )
            )
            if flask.request:
                raise DeletionError(
                    project=project.public_id,
                    message=str(sqlerr),
                    alt_message=alt_message,
                ) from sqlerr
            else:
                flask.current_app.logger.error(f"Error: {alt_message}")

Ah yes, that could work. Only thing that I am thinking is that maybe that generates two alerts? This logger and then the one by the catch in the upper function? Worth researching anyway

@valyo
Copy link
Member

valyo commented Apr 17, 2024

yes, I think I saw an error from commands.py too, when I made a quick try. But this shouldn't be a big issue!?!, as long as we get the "Project bucket contents were deleted, but they were not deleted from the ... blah blah" logged and the command continues.
May be we should have short a discussion, just to be sure that I am not suggesting something completely stupid

Copy link
Member

@valyo valyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the suggested change, I have a question: in commands.py, set_expired_to_archived, inside the main try...except block (lines 657 to 720) there are two more try...except blocks, one for calling archive_project and one for db.session.commit()
Then, in project.py, archive_project calls delete_project_contents inside a try...except block, which in turn has a try...except
Maybe I am just ignorant, but is this really necessary?

dds_web/api/project.py Outdated Show resolved Hide resolved
@rv0lt
Copy link
Member Author

rv0lt commented Apr 19, 2024

Apart from the suggested change, I have a question: in commands.py, set_expired_to_archived, inside the main try...except block (lines 657 to 720) there are two more try...except blocks, one for calling archive_project and one for db.session.commit() Then, in project.py, archive_project calls delete_project_contents inside a try...except block, which in turn has a try...except Maybe I am just ignorant, but is this really necessary?

Yes, I was still checking the changes. That's why I didn't re-requested the review yet! I added the second try-catch to see if it can be avoided to register two different errors (thus generating two alerts)

@rv0lt rv0lt requested a review from valyo April 22, 2024 09:58
SPRINTLOG.md Outdated Show resolved Hide resolved
rv0lt and others added 2 commits May 28, 2024 10:31
@rv0lt rv0lt requested a review from valyo May 28, 2024 08:49
rv0lt and others added 2 commits May 28, 2024 12:50
@rv0lt rv0lt merged commit 06e97d9 into dev May 28, 2024
13 checks passed
@rv0lt rv0lt deleted the DDS-1920-Fix-raising-of-error-when-bucket-successfully-deleted-from-S3-but-DB-update-fails branch May 28, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants