Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError on releasing resources of a Connection when closing the DB. #208

Closed
sunew opened this issue Jun 24, 2018 · 17 comments
Closed

KeyError on releasing resources of a Connection when closing the DB. #208

sunew opened this issue Jun 24, 2018 · 17 comments
Labels

Comments

@sunew
Copy link
Contributor

sunew commented Jun 24, 2018

UPDATE: see this comment for an up to date description: #208 (comment)

When running tests (in plone.restapi) we sometimes get the following KeyError. It is related to timing issues - inserting a time.sleep(2) in the test setup causes the problem to occur more frequently.

It can also be reproduced outside of a testing environment, when zope is shutting down in the middle of ongoing requests. See a comment below for a method of easy reproduction.

ZODB 5.4.0
It does not happen with ZODB 5.3.0

It is related to this commit: 1b9475d

In line 948 we see: c.close(False)

This is part of the traceback:
File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 948, in _release_resources c.close(False)
In 5.3.0 the corresponding line just sets the transaction manager to None:
c.transaction_manager = None

  Set up plone.restapi.testing.PloneRestApiDXLayer:Functional in 0.000 seconds.
  Running:
    1/77 (1.3%) test_comments_add_root (plone.restapi.tests.test_documentation.TestCommenting)/work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/PubCore/ZServerPublisher.py:22: DeprecationWarning: publish_module is deprecated. Please import from ZServer.ZPublisher.Publish.
  from ZPublisher.Publish import publish_module
    8/77 (10.4%) test_documentation_expansion (plone.restapi.tests.test_documentation.TestCommenting)/work/buildout_cache/eggs/zope.deprecation-4.3.0-py2.7.egg/zope/deprecation/deprecation.py:88: DeprecationWarning: isDefaultPage is deprecated. Import from Products.CMFPlone.defaultpage instead
  name)
    22/77 (28.6%) test_documentation_file (plone.restapi.tests.test_documentation.TestDocumentation)/work/buildout_cache/eggs/Products.PortalTransforms-3.1.3-py2.7.egg/Products/PortalTransforms/libtransforms/commandtransform.py:105: DeprecationWarning: os.popen4 is deprecated.  Use the subprocess module.
  cin, couterr = os.popen4(command, 'b')
    32/77 (41.6%) test_documentation_jwt_login (plone.restapi.tests.test_documentation.TestDocumentation) (1.199 s)Traceback (most recent call last):
  File "./bin/test", line 442, in <module>
    '--test-path', '/work/buildout_cache/eggs/repoze.xmliter-0.6-py2.7.egg',
  File "/work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py", line 60, in run
    failed = run_internal(defaults, args, script_parts=script_parts)
  File "/work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py", line 73, in run_internal
    runner.run()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 189, in run
    self.run_tests()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 291, in run_tests
    self.skipped, self.import_errors)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 460, in run_layer
    import_errors)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 383, in run_tests
    test(result)
  File "/usr/lib/python2.7/unittest/case.py", line 393, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python2.7/unittest/case.py", line 370, in run
    result.stopTest(self)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 897, in stopTest
    self.testTearDown()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 821, in testTearDown
    layer.testTearDown()
  File "/work/playground/plone/plone52devel/src/plone.app.testing/plone/app/testing/layers.py", line 288, in testTearDown
    super(PloneTestLifecycle, self).testTearDown()
  File "/work/playground/plone/plone52devel/src/plone.testing/src/plone/testing/z2.py", line 945, in testTearDown
    self['zodbDB'].close()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 651, in close
    @self._connectionMap
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 518, in _connectionMap
    self.pool.map(f)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 213, in map
    self.all.map(f)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 62, in map
    f(elt)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 659, in _
    conn._release_resources()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 948, in _release_resources
    c.close(False)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 304, in close
    self.transaction_manager.unregisterSynch(self)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/_manager.py", line 106, in unregisterSynch
    self._synchs.remove(synch)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 51, in remove
    del self.data[id(obj)]
  File "/usr/lib/python2.7/weakref.py", line 86, in __delitem__
    del self.data[key]
KeyError: 139902031255440


@sunew
Copy link
Contributor Author

sunew commented Jun 24, 2018

@pbauer same error as in coredev Jenkins 5.2 pull reguest build here:
https://jenkins.plone.org/job/pull-request-5.2/891/consoleFull

@jmuchemb
Copy link
Member

jmuchemb commented Jun 24, 2018

In #181, I wrote the following comment about the change from c.transaction_manager = None to c.close(False):

I didn't plan to change this at the beginning but while running tests, I discovered that closing a DB left Connections registered to the global transaction manager. Before, these Connections were ignored by patching their afterCompletion/newTransaction.

But the call to unregisterSynch depends on the condition self.opened and self.transaction_manager is not None. Can you debug to see why your connection is in an inconsistent state ? For example, Is it possible that the call to Connection.open was interrupted between self.opened = time.time() and transaction_manager.registerSynch(self) ?

@sunew
Copy link
Contributor Author

sunew commented Jun 25, 2018

@jmuchemb Thanks, I will take a look!

@sunew
Copy link
Contributor Author

sunew commented Jun 28, 2018

Hi @jmuchemb , I did some debugging, and things are beginning to make sense.

Context:
It happens when running tests with plone.testing, using the FUNCTIONAL_TESTING layer.

There are two threads:

  1. There is a thread (MainThread), starting the application (Zope 4.0b5), opening the DB.
  2. A http request (from a testcase) is made to Zope, resulting in a new thread (Dummy-1) to process the request.
  3. The http request succeeds.
  4. The testcase is done, testTeardown is called.
  5. On tear down of the test, the database is closed, with ZODB.DB.DB.close()

In some cases, depending on timing, machine speed, etc., we get the failure described above in the previous comments.

A quick walkthrough (I skip over some bits) of the teardown:
MainThread:
DB.close(), (lines 636-668 ZODB.DB.py):
iterates over connections in the pool,
calls conn._release_resources for each connection.

_release_resources calls connection.close()
close() calls self.transaction_manager.unregisterSynch(self)
and this calls self._synchs.remove(synch)

Meanwhile, with really bad timing ;) Dummy-1 is finishing the request (see 2.) and closing its connection.

t1: MainThread: DB.Close -> get connections in pool and begin the iteration of all the DB connections
t2: Dummy-1: connection.close() and unregisterSynch
t3: MainThread: during close() on the connection of Dummy-1 - The synch will be gone from the transactionmanager._synchs
Causing the KeyError.

To put it short:
The problem happens when the DB is being closed and is closing all the registered connections, while another thread is finishing a request and closing its connection.

Pattern:
1 mainthread registers 7f16efd94550
2 dummy1 registers 7f16efd73110
3 dummy1 unregisters 7f16efd73110
4 mainthread unregisters 7f16efd94550
5 mainthread tries to unregister the connection of dummy1 7f16efd73110

  • and we have failure

Raw debug output:

registerSynch: <Connection at 7f16efd94550> id: 139736490001744 thread:MainThread
synchs before register: [] 

registerSynch: <Connection at 7f16efd73110> id: 139736489865488 thread:Dummy-1
synchs before register: [] 

unregisterSynch: <Connection at 7f16efd73110> id: 139736489865488 thread:Dummy-1
synchs before unregister: [<Connection at 7f16efd73110>] 

unregisterSynch: <Connection at 7f16efd94550> id: 139736490001744 thread:MainThread
synchs before unregister: [<Connection at 7f16efd94550>] 

unregisterSynch: <Connection at 7f16efd73110> id: 139736489865488 thread:MainThread
synchs before unregister: [] 

Traceback (most recent call last):
  File "./bin/test", line 436, in <module>
    '--test-path', '/work/buildout_cache/eggs/zope.globalrequest-1.4-py2.7.egg',
  File "/work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py", line 60, in run
    failed = run_internal(defaults, args, script_parts=script_parts)
  File "/work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py", line 73, in run_internal
    runner.run()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 189, in run
    self.run_tests()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 291, in run_tests
    self.skipped, self.import_errors)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 460, in run_layer
    import_errors)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 383, in run_tests
    test(result)
  File "/usr/lib/python2.7/unittest/case.py", line 393, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python2.7/unittest/case.py", line 370, in run
    result.stopTest(self)
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 897, in stopTest
    self.testTearDown()
  File "/work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py", line 821, in testTearDown
    layer.testTearDown()
  File "/work/playground/plone/plone52devel/src/plone.app.testing/plone/app/testing/layers.py", line 288, in testTearDown
    super(PloneTestLifecycle, self).testTearDown()
  File "/work/playground/plone/plone52devel/src/plone.testing/src/plone/testing/z2.py", line 950, in testTearDown
    self['zodbDB'].close()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 651, in close
    @self._connectionMap
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 518, in _connectionMap
    self.pool.map(f)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 213, in map
    self.all.map(f)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 62, in map
    f(elt)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 659, in _
    conn._release_resources()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 948, in _release_resources
    c.close(False)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 304, in close
    self.transaction_manager.unregisterSynch(self)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/_manager.py", line 120, in unregisterSynch
    self._synchs.remove(synch)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 51, in remove
    del self.data[id(obj)]
  File "/usr/lib/python2.7/weakref.py", line 86, in __delitem__
    del self.data[key]
KeyError: 139736489865488

Stack of Dummy-1

(Pdb)   /work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/PubCore/ZServerPublisher.py(32)__init__()
-> response=b)
  /work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/ZPublisher/Publish.py(405)publish_module()
-> environ, debug, request, response)
  /work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/ZPublisher/Publish.py(291)publish_module_standard()
-> request.close()
  /work/buildout_cache/eggs/Zope-4.0b5-py2.7.egg/ZPublisher/BaseRequest.py(224)close()
-> self.clear()
  /work/buildout_cache/eggs/Zope-4.0b5-py2.7.egg/ZPublisher/HTTPRequest.py(211)clear()
-> BaseRequest.clear(self)
  /work/buildout_cache/eggs/Zope-4.0b5-py2.7.egg/ZPublisher/BaseRequest.py(216)clear()
-> self._held = None
  /work/buildout_cache/eggs/Zope-4.0b5-py2.7.egg/App/ZApplication.py(90)__del__()
-> self._jar.close()
> /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py(304)close()
-> self.transaction_manager.unregisterSynch(self)
  /work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/_manager.py(123)unregisterSynch()

Stack of MainThread

(Pdb)  /work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py(60)run()
-> failed = run_internal(defaults, args, script_parts=script_parts)
  /work/buildout_cache/eggs/collective.xmltestreport-1.3.4-py2.7.egg/collective/xmltestreport/runner.py(73)run_internal()
-> runner.run()
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(189)run()
-> self.run_tests()
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(291)run_tests()
-> self.skipped, self.import_errors)
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(460)run_layer()
-> import_errors)
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(383)run_tests()
-> test(result)
  /usr/lib/python2.7/unittest/case.py(393)__call__()
-> return self.run(*args, **kwds)
  /usr/lib/python2.7/unittest/case.py(370)run()
-> result.stopTest(self)
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(897)stopTest()
-> self.testTearDown()
  /work/buildout_cache/eggs/zope.testrunner-4.8.1-py2.7.egg/zope/testrunner/runner.py(821)testTearDown()
-> layer.testTearDown()
  /work/playground/plone/plone52devel/src/plone.app.testing/plone/app/testing/layers.py(288)testTearDown()
-> super(PloneTestLifecycle, self).testTearDown()
  /work/playground/plone/plone52devel/src/plone.testing/src/plone/testing/z2.py(950)testTearDown()
-> self['zodbDB'].close()
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py(651)close()
-> @self._connectionMap
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py(518)_connectionMap()
-> self.pool.map(f)
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py(213)map()
-> self.all.map(f)
  /work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py(62)map()
-> f(elt)
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py(659)_()
-> conn._release_resources()
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py(948)_release_resources()
-> c.close(False)
  /work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py(304)close()
-> self.transaction_manager.unregisterSynch(self)
> /work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/_manager.py(126)unregisterSynch()
-> self._synchs.remove(synch)

@sunew
Copy link
Contributor Author

sunew commented Jun 28, 2018

I can reproduce it outside of a testing environment.

  1. insert time.sleep(10) somewhere in code run by a http request
  2. start the request from a browser
  3. before the 10 seconds have passed, stop zope (gracefully with "kill `cat var/instance.pid`")

This is the output incl. my debugging statements:

registerSynch: <Connection at 7f419b096810> id: 139919750686736 thread:Dummy-1
synchs before register: [] 

/work/buildout_cache/eggs/zope.deprecation-4.3.0-py2.7.egg/zope/deprecation/deprecation.py:88: DeprecationWarning: isDefaultPage is deprecated. Import from Products.CMFPlone.defaultpage instead
  name)
/work/buildout_cache/eggs/Chameleon-3.2-py2.7.egg/chameleon/tal.py:471: DeprecationWarning: Using len() is deprecated. Use the `length` attribute for the size of the current page, which is what we return now. Use the `sequence_length` attribute for the size of the entire sequence. 
  iterable = list(iterable) if iterable is not None else ()
2018-06-28 23:57:02 INFO SignalHandler Caught signal SIGTERM
2018-06-28 23:57:02 INFO Z2 Shutting down fast
2018-06-28 23:57:02 INFO ZServer closing HTTP to new connections

unregisterSynch: <Connection at 7f419b096810> id: 139919750686736 thread:MainThread
synchs before unregister: [] 

Traceback (most recent call last):
  File "/work/playground/plone/plone52devel/parts/instance/bin/interpreter", line 281, in <module>
    exec(compile(__file__f.read(), __file__, "exec"))
  File "/work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/Zope2/Startup/run.py", line 64, in <module>
    run()
  File "/work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/Zope2/Startup/run.py", line 29, in run
    starter.run()
  File "/work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/Zope2/Startup/starter.py", line 98, in run
    self.shutdown()
  File "/work/buildout_cache/eggs/ZServer-4.0b1-py2.7.egg/ZServer/Zope2/Startup/starter.py", line 103, in shutdown
    db.close()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 651, in close
    @self._connectionMap
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 518, in _connectionMap
    self.pool.map(f)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 213, in map
    self.all.map(f)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 62, in map
    f(elt)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/DB.py", line 659, in _
    conn._release_resources()
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 948, in _release_resources
    c.close(False)
  File "/work/buildout_cache/eggs/ZODB-5.4.0-py2.7.egg/ZODB/Connection.py", line 304, in close
    self.transaction_manager.unregisterSynch(self)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/_manager.py", line 126, in unregisterSynch
    self._synchs.remove(synch)
  File "/work/buildout_cache/eggs/transaction-2.2.1-py2.7.egg/transaction/weakset.py", line 51, in remove
    del self.data[id(obj)]
  File "/usr/lib/python2.7/weakref.py", line 86, in __delitem__
    del self.data[key]
KeyError: 139919750686736

@sunew
Copy link
Contributor Author

sunew commented Jun 28, 2018

my debug statements:
(transaction/_manager.py)

    def registerSynch(self, synch):
        """ See ITransactionManager.
        """
        import threading
        # if threading.currentThread().getName() == 'Dummy-1':
        #     from pdb import set_trace; set_trace()
        print '\n\x1b[1;33;40m' + \
            'registerSynch: ' + str(synch) + \
            ' id: ' + str(id(synch)) + \
            ' thread:' + threading.currentThread().getName() + \
            '\n' + 'synchs before register: ' + \
            str([ref() for ref in self._synchs.as_weakref_list()]) + '\x1b[0m \n'
        self._synchs.add(synch)
        if self._txn is not None:
            synch.newTransaction(self._txn)

    def unregisterSynch(self, synch):
        """ See ITransactionManager.
        """
        import threading
        # if threading.currentThread().getName() == 'Dummy-1':
        #     from pdb import set_trace; set_trace()
        print '\n\x1b[1;33;40m' + \
            'unregisterSynch: ' + str(synch) + \
            ' id: ' + str(id(synch)) + \
            ' thread:' + threading.currentThread().getName() + \
            '\n' + 'synchs before unregister: ' + \
            str([ref() for ref in self._synchs.as_weakref_list()]) + '\x1b[0m \n'
        # if synch not in self._synchs:
        #     from pdb import set_trace; set_trace()
        self._synchs.remove(synch)

@sunew
Copy link
Contributor Author

sunew commented Jul 4, 2018

@jmuchemb Hi, did you have time to take a look at my results?

@sunew sunew changed the title intermittent KeyError on releasing resources of a Connection KeyError on releasing resources of a Connection when closing the DB. Jul 5, 2018
@jmuchemb
Copy link
Member

jmuchemb commented Jul 6, 2018

So to sum up, Connection.close() is not thread-safe and from your report, it looks legitimate to close a connection at the same time as the DB is being closed. And it worked before because DB.close didn't actually close connections.

It should be noted also that the bug is not limited to unregisterSynch(). We could also have race conditions while processing __onCloseCallbacks or else.

For this kind of issue, the first idea that comes to mind is to protect Connection.close() with a new lock on the Connection object. There may be better solutions so I keep thinking about this. I doubt that reverting my change is correct: leaving connections registered to a transaction was really bad.

@jmuchemb jmuchemb added the bug label Jul 6, 2018
@sunew
Copy link
Contributor Author

sunew commented Jul 6, 2018

@jmuchemb
Your summary: Yes, I agree with your conclusion.

And yes, there could be other similar race conditions. Sorry for the overload of data in my report 😄, but I was reporting while digging deeper.

And I agree with not reverting your commit, it is just making the issue surface, it is not real cause of the issue.

Maybe @jimfulton has an opinion on the protection of Connection.close() ?

@sunew
Copy link
Contributor Author

sunew commented Aug 14, 2018

any news on this one?

sunew added a commit to sunew/transaction that referenced this issue Sep 15, 2018
@sunew
Copy link
Contributor Author

sunew commented Oct 2, 2018

I have investigated a bit more:

We are in a zope4 setting, running the application normally, not as part of a test suite. (The test suite runs mentioned above made the issue surface more frequently).

There is a main thread and a thread per request.

A) A new request thread opens a connection, that registers with the transaction manager (in Connection.open() - calls transaction/_manager.py - registerSynch).
The connection is added to _synchs WeakSet of the transaction_manager.

B) The application is shut down - perhaps with SIGINT or SIGTERM. The request of A) is long running, so is still active while the application begins to shutdown. This means the request thread has not yet closed its connection.
The connection is still in DB pool, while shutdown calls ZODB.DB.close()

C) DB.close() iterates over all connections in the pool (with _connectionMap): For each connection:
conn._release_resources() -> conn.close() -> self.transaction_manager.unregisterSynch(self)

D) But the transaction_manager is a ThreadTransactionManager, a subclass of threading.local, so the _synchs WeakSet is local for each thread - this means the main thread sees a different _synchs than the request thread, causing the keyerror when trying to unregister the connection with the transaction manager from the main thread during db.close()

I'm not sure what a good solution would be.
@jimfulton any ideas or hints?

sunew added a commit that referenced this issue Oct 3, 2018
@sunew
Copy link
Contributor Author

sunew commented Oct 3, 2018

@jimfulton
Copy link
Member

Here's an update. @sunew thanks for the test, that helped a lot. Your initial assessment was correct and mine was wrong. :) I'm pondering a fix.

@jimfulton
Copy link
Member

I've added a fix on https://github.com/zopefoundation/ZODB/tree/sunew-failing-test-for-208 that depends on zopefoundation/transaction#68. When the transaction change is released, then I can release a ZODB fix.

@pbauer
Copy link
Member

pbauer commented Oct 17, 2018

I tested that against Plone and found that the fix also requires changes in Zope where the transaction is used like this (I already added the additional .manager):

@contextmanager
def load_app(module_info):
    app_wrapper, realm, debug_mode = module_info
    # Loads the 'OFS.Application' from ZODB.
    app = app_wrapper()

    try:
        yield (app, realm, debug_mode)
    finally:
        if transaction.manager.manager._txn is not None:
            # Only abort a transaction, if one exists. Otherwise the
            # abort creates a new transaction just to abort it.
            transaction.abort()
        app._p_jar.close()

@jmuchemb
Copy link
Member

That becomes quite invasive, no ?

(actually, on performance side, I'm not fond of all the extra method calls in transaction)

pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 17, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 18, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 20, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 24, 2018
icemac pushed a commit to zopefoundation/Zope that referenced this issue Oct 24, 2018
icemac pushed a commit to zopefoundation/Zope that referenced this issue Oct 24, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 26, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 28, 2018
pbauer added a commit to plone/buildout.coredev that referenced this issue Oct 29, 2018
@sunew
Copy link
Contributor Author

sunew commented Apr 1, 2019

Sorry for being so late here, but great to see a fix. Thanks!

NexediGitlab pushed a commit to Nexedi/wendelin.core that referenced this issue Jun 2, 2019
…every connection reopen

This continues c7c01ce (bigfile/zodb: ZODB.Connection can migrate
between threads on close/open and we have to care): Until now we were
retrieving zconn.transaction_manager on _ZBigFileH init, and further using
that transaction manager for every connection reopen. However that is
not correct because on every reopen connection can be given new
transaction manager.

We were not practically hitting the bug because until recently ZODB was,
by default, using the same ThreadTransactionManager manager instance as
Connection.transaction_manager for all connections, and not doing all
steps needed to keep _ZBigFileH.transaction_manager in sync to
Connection was forgiven - a particular transaction manager that was used
was TransactionManager instance implicitly associated with current
thread by global threading.Local transaction.manager . However starting
from ZODB 5.5.0 Connection code was changed to remember as
.transaction_manager the particular TransactionManager instance without
any threading.Local games:

    zopefoundation/ZODB@b6ac40f153
    zopefoundation/ZODB#208
    zopefoundation/ZODB#226

Given that we were not syncing properly that broke wendelin.core tests, for
example:

    bigfile/tests/test_filezodb.py::test_bigfile_filezodb_vs_conn_migration Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
        self.run()
      File "/usr/lib/python2.7/threading.py", line 754, in run
        self.__target(*self.__args, **self.__kwargs)
      File "/home/kirr/src/wendelin/wendelin.core/bigfile/tests/test_filezodb.py", line 401, in T11
        transaction.commit()    # should be nothing
      File "/home/kirr/src/wendelin/venv/z-dev/local/lib/python2.7/site-packages/transaction/_manager.py", line 252, in commit
        return self.manager.commit()
      File "/home/kirr/src/wendelin/venv/z-dev/local/lib/python2.7/site-packages/transaction/_manager.py", line 131, in commit
        return self.get().commit()
      File "/home/kirr/src/wendelin/venv/z-dev/local/lib/python2.7/site-packages/transaction/_transaction.py", line 298, in commit
        self._synchronizers.map(lambda s: s.beforeCompletion(self))
      File "/home/kirr/src/wendelin/venv/z-dev/local/lib/python2.7/site-packages/transaction/weakset.py", line 61, in map
        f(elt)
      File "/home/kirr/src/wendelin/venv/z-dev/local/lib/python2.7/site-packages/transaction/_transaction.py", line 298, in <lambda>
        self._synchronizers.map(lambda s: s.beforeCompletion(self))
      File "/home/kirr/src/wendelin/wendelin.core/bigfile/file_zodb.py", line 671, in beforeCompletion
        assert txn is zconn.transaction_manager.get()
    AssertionError

What is happening here is that one thread used the connection and
ZBigFile/_ZBigFileH associated with it, then the connection was closed
and released to DB pool. Then the connection was reopened but by another
thread and thus with different TransactionManager instance and oops -
_ZBigFileH.transaction_manager is different because it is
TransactionManager instance that was used by the first thread.

Fix it by resyncing _ZBigFileH.transaction_manager on every
connection reopen. No new test as existing tests already cover the
problem when run with ZODB >= 5.5.0 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants