Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to retrieve raw extensions data #183

Closed
wants to merge 1 commit into from

Conversation

navytux
Copy link
Contributor

@navytux navytux commented Nov 2, 2017

Currently when client has IStorageTransactionMetaData instance (e.g. on
storage iteration) it is possible to read transaction's .user and
.description in raw form, but .extension is always returned unpickled.

This creates several problems:

  • tools like zodb dump [1] cannot dump data exactly as stored on a
    storage. This makes database potentially not bit-to-bit identical to
    its original after restoring from such dump.

  • zodb dump output could be changing from run to run on the same
    database. This comes from the fact that e.g. python dictionaries are
    unordered and so when pickling a dict back to bytes the result could
    be not the same as original.

    ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

To fix we provide a way to retrieve raw extension from transaction
metadata. This is done in backward-compatible way by introducing new
interface

IStorageTransactionInformationRaw

which can be optionally provided by an IStorageTransactionInformation to
indicate that transaction metadata can be also read in raw from.

Then BaseStorage.TransactionRecord ctor is extended with optional
raw_extension argument which if != None is remembered and makes
IStorageTransactionInformationRaw to be also provided by constructed
TransactionRecord instance.

We also adapt FileStorage & friends to provide/use new functionality.

Similarly to e.g. undo, storages should be indicating they will return
iterated transactions provided with raw metadata via implementing

supportsTransactionInformationRaw() -> True

[1] https://lab.nexedi.com/nexedi/zodbtools

@navytux
Copy link
Contributor Author

navytux commented Nov 2, 2017

Related ZEO change: zopefoundation/ZEO#100

@navytux
Copy link
Contributor Author

navytux commented Nov 2, 2017

https://lab.nexedi.com/nexedi/zodbtools/commit/2634c956 shows how new functionality is used in zodb dump.

Please see https://lab.nexedi.com/nexedi/zodbtools/commit/75c03368#04505d86ceff8e0a7a413120968ef91d0842bc29_67_114 about all the hops one needs to jump without this support and still it cannot work fully reliably to dump extension in raw form as stored on disk.

@navytux
Copy link
Contributor Author

navytux commented Nov 17, 2017

Ping anyone? @jimfulton ?

@navytux
Copy link
Contributor Author

navytux commented Jan 11, 2018

@jimfulton, I understand everyone is busy, but it is pity there is no feedback on this at all for 2 months.

@jimfulton
Copy link
Member

Sorry for the slow response. Thanks for squeaking. :)

I imagine that part of the lack of response is that this is an area where we (I) got things wrong fixing it is fraught with subtle opportunities to fail, as I discovered a year or two ago when I tried to clean up transaction metadata.

@jimfulton
Copy link
Member

I appreciate why you want this. The solution seems overly complex to me. It adds a lot of code for a subtle benefit.

I'd like to see this implemented differently. I'd also like @jamadden's thoughts.

Here's an alternative approach.

Change ZODB.Connection.TransactionMetaData to save both extension and a new extension_bytes attribute. Currently, if given bytes, it creates a dict. Change this:

  • save as both extension and extension_bytes, deserializing or serializing as needed, depending on what's passed in.

  • Change callers that now deserialize before creating transaction records to stop doing that, at least for cases in ZODB. This pre-deserialization hasn't been needed for over a year. So for anything that iterates over raw data, the raw data will be preserved.

I would also avoid creating a new interface for now. For now, callers that want extsnion_bytes can do duck typing ro deal with implementations that don't have this. We can even add a note to the current interface. Pragmatically, you can address your needs by implementing this in ZODB proper and NEO.

I think my suggestion addresses your need and leads to a small net reduction in code rather than a significant increase.

Thoughts?

@navytux
Copy link
Contributor Author

navytux commented Jan 14, 2018

@jimfulton thanks for feedback. If I understood you right in essence what you propose is that we extend IStorageTransactionMetaData with optional extension_bytes attribute. If that could be done it would be indeed simpler.

However when I was working on the topic, if I recall correctly, I thought changing IStorageTransactionMetaData would be not good fit because:

  • if we just add new non-optional attribute it will change the interface thus rendering its all older implementers now non-complying to it thus breaking backward compatibility,
  • so far (to my knowledge) there is no "optional attribute" precedent in ZODB,
  • and in terms of interfaces speak optionally providing something extra is optionally implementing some extra interface.

Though practically speaking on my side I would be ok if IStorageTransactionMetaData is extended with such optional extension_bytes attribute with documentation explicitly saying it might be missing. If so the patch could be as follows (https://lab.nexedi.com/kirr/ZODB/commit/d5d43ba0):

diff --git a/src/ZODB/Connection.py b/src/ZODB/Connection.py
index 9468a3b6f..b651f664e 100644
--- a/src/ZODB/Connection.py
+++ b/src/ZODB/Connection.py
@@ -1312,6 +1312,12 @@ def __init__(self, user=u'', description=u'', extension=b''):
             description = description.encode('utf-8')
         self.description = description
 
+        # Provide .extension_bytes if we are given extension in its raw form.
+        # If not - leave created TransactionMetaData instance without
+        # .extension_bytes attribute at all.
+        if isinstance(extension, bytes):
+            self.extension_bytes = extension
+
         if not isinstance(extension, dict):
             extension = _compat.loads(extension) if extension else {}
         self.extension = extension
diff --git a/src/ZODB/FileStorage/FileStorage.py b/src/ZODB/FileStorage/FileStorage.py
index af3857ea5..e86e6ea20 100644
--- a/src/ZODB/FileStorage/FileStorage.py
+++ b/src/ZODB/FileStorage/FileStorage.py
@@ -1988,15 +1988,8 @@ def __next__(self):
 
             if h.status != "u":
                 pos = tpos + h.headerlen()
-                e = {}
-                if h.elen:
-                    try:
-                        e = loads(h.ext)
-                    except:
-                        pass
-
                 result = TransactionRecord(h.tid, h.status, h.user, h.descr,
-                                           e, pos, tend, self._file, tpos)
+                                           h.ext, pos, tend, self._file, tpos)
 
             # Read the (intentionally redundant) transaction length
             self._file.seek(tend)
diff --git a/src/ZODB/fsrecover.py b/src/ZODB/fsrecover.py
index 7bdc47fca..81359a5d6 100644
--- a/src/ZODB/fsrecover.py
+++ b/src/ZODB/fsrecover.py
@@ -143,12 +143,14 @@ def read_txn_header(f, pos, file_size, outp, ltid):
     pos = tpos+(23+ul+dl+el)
     user = f.read(ul)
     description = f.read(dl)
-    if el:
-        try: e = loads(f.read(el))
-        except: e = {}
-    else: e = {}
+    ext = f.read(el)
+    # reset extension to {} if it cannot be unpickled
+    try:
+        loads(ext)
+    except:
+        ext = b''
 
-    result = TransactionRecord(tid, status, user, description, e, pos, tend,
+    result = TransactionRecord(tid, status, user, description, ext, pos, tend,
                                f, tpos)
     pos = tend
 
diff --git a/src/ZODB/interfaces.py b/src/ZODB/interfaces.py
index 113e71824..fb2c785c3 100644
--- a/src/ZODB/interfaces.py
+++ b/src/ZODB/interfaces.py
@@ -546,6 +546,12 @@ class IStorageTransactionMetaData(Interface):
     extension = Attribute(
         "A dictionary carrying a transaction's extended_info data")
 
+    extension_bytes = Attribute(
+        """Transaction's extension data as raw bytes.
+
+        This attribute is optional and is not always present.
+        """)
+
 
     def set_data(ob, data):
         """Hold data on behalf of an object
diff --git a/src/ZODB/tests/IteratorStorage.py b/src/ZODB/tests/IteratorStorage.py
index 48fcbc9ee..de814ca24 100644
--- a/src/ZODB/tests/IteratorStorage.py
+++ b/src/ZODB/tests/IteratorStorage.py
@@ -22,6 +22,7 @@
 from ZODB.tests.MinPO import MinPO
 from ZODB.tests.StorageTestBase import zodb_pickle, zodb_unpickle
 from ZODB.utils import U64, p64, load_current
+from ZODB._compat import loads
 
 import ZODB.blob
 
@@ -49,6 +50,10 @@ def iter_verify(self, txniter, revids, val0):
 
 class IteratorStorage(IteratorCompare):
 
+    # override this in tests for storages that have support for
+    # .extension_bytes in iterated transactions.
+    supportsExtensionBytes = False
+
     def checkSimpleIteration(self):
         # Store a bunch of revisions of a single object
         self._oid = oid = self._storage.new_oid()
@@ -86,11 +91,20 @@ def checkUndoZombie(self):
 
     def checkTransactionExtensionFromIterator(self):
         oid = self._storage.new_oid()
-        revid = self._dostore(oid, data=MinPO(1))
+        ext_dict = {"hello": "world"}
+        revid = self._dostore(oid, data=MinPO(1), extension=ext_dict)
         iter = self._storage.iterator()
         count = 0
         for txn in iter:
-            self.assertEqual(txn.extension, {})
+            self.assertEqual(txn.extension, ext_dict)
+            # missing: not None - something unique - not to allow
+            # implementations to erroneously return None in .extension_bytes .
+            missing = object()
+            ext_bytes = getattr(txn, "extension_bytes", missing)
+            self.assertEqual(self.supportsExtensionBytes, ext_bytes is not missing)
+            if ext_bytes is not missing:
+                ext_dict_fromraw = loads(ext_bytes)
+                self.assertEqual(ext_dict_fromraw, ext_dict)
             count +=1
         self.assertEqual(count, 1)
 
@@ -215,6 +229,9 @@ def compare(self, storage1, storage2):
             eq(txn1.user,        txn2.user)
             eq(txn1.description, txn2.description)
             eq(txn1.extension,  txn2.extension)
+            rawext1 = getattr(txn1, "extension_bytes", None)
+            rawext2 = getattr(txn2, "extension_bytes", None)
+            eq(rawext1, rawext2)
             itxn1 = iter(txn1)
             itxn2 = iter(txn2)
             for rec1, rec2 in zip(itxn1, itxn2):
diff --git a/src/ZODB/tests/StorageTestBase.py b/src/ZODB/tests/StorageTestBase.py
index 1488e2330..275404a94 100644
--- a/src/ZODB/tests/StorageTestBase.py
+++ b/src/ZODB/tests/StorageTestBase.py
@@ -124,7 +124,7 @@ def tearDown(self):
         ZODB.tests.util.TestCase.tearDown(self)
 
     def _dostore(self, oid=None, revid=None, data=None,
-                 already_pickled=0, user=None, description=None):
+                 already_pickled=0, user=None, description=None, extension=None):
         """Do a complete storage transaction.  The defaults are:
 
          - oid=None, ask the storage for a new oid
@@ -149,6 +149,8 @@ def _dostore(self, oid=None, revid=None, data=None,
             t.user = user
         if description is not None:
             t.description = description
+        if extension is not None:
+            t.extension = extension
         try:
             self._storage.tpc_begin(t)
             # Store an object
diff --git a/src/ZODB/tests/testFileStorage.py b/src/ZODB/tests/testFileStorage.py
index 273304817..7f3bc9933 100644
--- a/src/ZODB/tests/testFileStorage.py
+++ b/src/ZODB/tests/testFileStorage.py
@@ -56,6 +56,8 @@ class FileStorageTests(
     ReadOnlyStorage.ReadOnlyStorage
     ):
 
+    supportsExtensionBytes = True
+
     def open(self, **kwargs):
         self._storage = ZODB.FileStorage.FileStorage('FileStorageTests.fs',
                                                      **kwargs)
@@ -77,7 +79,14 @@ def checkLongMetadata(self):
         except POSException.StorageError:
             pass
         else:
-            self.fail("expect long user field to raise error")
+            self.fail("expect long description field to raise error")
+        try:
+            self._dostore(extension={s: 1})
+        except POSException.StorageError:
+            pass
+        else:
+            self.fail("expect long extension field to raise error")
+
 
     def check_use_fsIndex(self):
 
-- 
2.15.1.716.g6967081ed5

Would that be ok?

Thanks,
Kirill

@jimfulton
Copy link
Member

jimfulton commented Jan 14, 2018 via email

@navytux
Copy link
Contributor Author

navytux commented Jan 14, 2018

FTR. I hate reading diffs? You went through the trouble to create changes.
Why not just make a new PR?

Well, some of as are, yes, a bit old style. I've updated the diff in this PR and will comment on you feedback in the diff context.

@navytux
Copy link
Contributor Author

navytux commented Jan 14, 2018

@jimfulton thanks again for feedback. I went through your comments and put my replies there in the diff context.

I strongly suggest not to use only duck-typing and specify interface for the functionality first. Imho duck-typing, while convenient for quick-hacks, hurts the codebase in the longer term. And especially when in ZODB we have ZODB/interfaces.py as the specification it would be a pity to workaround it.

if el:
try: e = loads(f.read(el))
except: e = {}
else: e = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimfulton says:

The original code is flawed. IDK (IDR) why we have a bare except.

To any tests fail wo it?

I don't know whether any test fails without it, but since it is FS recover to me it looks logical that if e.g. extension is corrupt, it is better to drop it, but still try to recover rest of transaction metadata.

In any way I would preserve current semantic within this patch and try to do one step at a time but well.

BTW this code is from you from 2001:

2d0d8c08#diff-b18ad2268d655f3885a3b21941d8f8f9R209


This attribute is optional and is not always present.
""")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimfulton says:

No. I'd avoid this for now.

Why? If we don't expose the functionality via interface it will be very obscure about what is going on when tools start to depend on internal behaviour of some internal implementation.

What is wrong via explicity specifying the interface?

self.assertEqual(self.supportsExtensionBytes, ext_bytes is not missing)
if ext_bytes is not missing:
ext_dict_fromraw = loads(ext_bytes)
self.assertEqual(ext_dict_fromraw, ext_dict)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimfulton says:

Drop the supportsExtensionBytes.

If code wants to work with code that might not support extension_bytes, it
can try/except on the attribute access, at least for now.

This is test code which tests that particular storage provides .extension_bytes as expected. What is expected is set via .supportsExtensionBytes and it is not universally there - i.e. FileStorage provides it while MappingStorage - does not.

BTW, what's up with the camelCase? :)

Maybe my intertia to name it similar to supportsUndo() function.
I'm ok to redo if you insist it should be supports_extension_bytes.

pass
else:
self.fail("expect long extension field to raise error")

Copy link
Contributor Author

@navytux navytux Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimfulton says:

What is this testing?

There is already-existing FileStorage test which checks if a commit with long metadata fails:

https://github.com/zopefoundation/ZODB/blob/de1f24ca/src/ZODB/tests/testFileStorage.py#L67

Since the commits in FileStorage tests are performed via _dostore() and my patch extends _dostore() to also accept extension, I natrally extend this test as well. (I grepped all places where _dostore was used and found it this way).

# .extension_bytes attribute at all.
if isinstance(extension, bytes):
self.extension_bytes = extension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimfulton says:

I would go further. If the input is bytes, save it, and save the
deserialization, otherwise save the serialization and the dict value.

IOW, save both extension_bytes and extension, serializing or deserializing
as necessary. The choice of the name "extension_bytes" rather than
"raw_extension" was conscious.

While this seem to be possible and maybe I'm wrong, my feeling is that this brings too much magic and we'll be bitten by it later.

My preference would be we get the interface for this functionality right at first and only then improve implementation step by step.

Currently when client has IStorageTransactionMetaData instance (e.g. on
storage iteration) it is possible to read transaction's .user and
.description in raw form, but .extension is always returned unpickled.

This creates several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

To fix we provide a way to retrieve raw extension from transaction
metadata. This is done in backward-compatible way by introducing
new IStorageTransactionMetaDataRaw interface, which may be optionally
provided by an IStorageTransactionMetaData instance, and if so there
will be additional

	.extension_bytes

attribute.

Then ZODB.Connection.TransactionMetaData (the only in-tree
IStorageTransactionMetaData implementer) constructor is extended to see
whether it was given extension in raw form, and if yes, remember it as
.extension_bytes in addition to deserialized .extension dict.

This way storages that now pass raw extension to TransactionMetaData will
automatically provide .extension_bytes in their iterated transactions.

Teach FileStorage & friends to provide/use new functionality.

/helped-by: Jim Fulton.

[1] https://lab.nexedi.com/nexedi/zodbtools
@navytux
Copy link
Contributor Author

navytux commented Apr 5, 2018

@jimfulton, @jamadden I've updated the patch and personally I think it should be now all OK to go:

  • there is no more "optional .extension_bytes attribute" in IStorageTransactionMetaData -> the optional functionality that an instance of IStorageTransactionMetaData might also provide is now represented by IStorageTransactionMetaDataRaw.
  • this way all the interfaces are now clearly defined without anything not covered by zope.interfaces.
  • I've also corrected some minor style issues caught last time by Jim's review.

Once again, the patch is now not complex like it was in the beginning, and the access interface is now all well defined. I understand everyone is busy and getting this "raw" thing could be "that this is an area where we (I) got things wrong fixing it is fraught with subtle opportunities to fail" (#183 (comment)).

However let's please get this resolved.

It's been already 5 months while this patch is here and going through circles creates only frustrations and things like that. ZODB ecosystem is already not big, like it used to be some time ago, and at this point fragmenting it even more, I think, would be not good for anyone.

On my side I care first about getting interfaces well defined and only as a second step about implementation.

If needed I'm ready to open new PR or do whatever it makes sense for people to make reviewing easier.

Thanks beforehand,
Kirill


The patch in this PR is already updated with latest version. Full interdiff to previous patch revision follows:

diff --git a/src/ZODB/Connection.py b/src/ZODB/Connection.py
index b651f664e..d4df30bea 100644
--- a/src/ZODB/Connection.py
+++ b/src/ZODB/Connection.py
@@ -32,7 +32,7 @@
 from transaction.interfaces import ISavepointDataManager
 from transaction.interfaces import IDataManagerSavepoint
 from transaction.interfaces import ISynchronizer
-from zope.interface import implementer
+from zope.interface import implementer, alsoProvides
 
 import transaction
 
@@ -1317,6 +1317,7 @@ def __init__(self, user=u'', description=u'', extension=b''):
         # .extension_bytes attribute at all.
         if isinstance(extension, bytes):
             self.extension_bytes = extension
+            alsoProvides(self, ZODB.interfaces.IStorageTransactionMetaDataRaw)
 
         if not isinstance(extension, dict):
             extension = _compat.loads(extension) if extension else {}
diff --git a/src/ZODB/interfaces.py b/src/ZODB/interfaces.py
index fb2c785c3..e396a2c58 100644
--- a/src/ZODB/interfaces.py
+++ b/src/ZODB/interfaces.py
@@ -540,18 +540,15 @@ class IStorageTransactionMetaData(Interface):
 
     Note that unlike transaction.interfaces.ITransaction, the ``user``
     and ``description`` attributes are bytes, not text.
+
+    An instance of IStorageTransactionMetaData may also optionally provide
+    IStorageTransactionMetaDataRaw with raw form of transaction meta data.
     """
     user = Attribute("Bytes transaction user")
     description = Attribute("Bytes transaction Description")
     extension = Attribute(
         "A dictionary carrying a transaction's extended_info data")
 
-    extension_bytes = Attribute(
-        """Transaction's extension data as raw bytes.
-
-        This attribute is optional and is not always present.
-        """)
-
 
     def set_data(ob, data):
         """Hold data on behalf of an object
@@ -573,6 +570,21 @@ def data(ob):
         See set_data.
         """
 
+class IStorageTransactionMetaDataRaw(IStorageTransactionMetaData):
+    """Provide storage transaction meta data in raw form.
+
+    IStorageTransactionMetaDataRaw is additional interface which may
+    be optionally provided by an instance of IStorageTransactionMetaData.
+    If so, it provides information about raw form of transaction's meta data as
+    saved on storage.
+
+    NOTE as IStorageTransactionMetaData already provides ``user`` and
+    ``description`` as raw bytes, not text, IStorageTransactionMetaDataRaw
+    amends it only for ``extension`` part.
+    """
+
+    extension_bytes = Attribute("Transaction's extension data as raw bytes.")
+
 
 class IStorage(Interface):
     """A storage is responsible for storing and retrieving data of objects.
diff --git a/src/ZODB/tests/IteratorStorage.py b/src/ZODB/tests/IteratorStorage.py
index d87623f72..8a4e82b9b 100644
--- a/src/ZODB/tests/IteratorStorage.py
+++ b/src/ZODB/tests/IteratorStorage.py
@@ -22,6 +22,7 @@
 from ZODB.tests.MinPO import MinPO
 from ZODB.tests.StorageTestBase import zodb_pickle, zodb_unpickle
 from ZODB.utils import U64, p64, load_current
+from ZODB.interfaces import IStorageTransactionMetaDataRaw
 from ZODB._compat import loads
 
 import ZODB.blob
@@ -52,7 +53,7 @@ class IteratorStorage(IteratorCompare):
 
     # override this in tests for storages that have support for
     # .extension_bytes in iterated transactions.
-    supportsExtensionBytes = False
+    supports_extension_bytes = False
 
     def checkSimpleIteration(self):
         # Store a bunch of revisions of a single object
@@ -101,7 +102,9 @@ def checkTransactionExtensionFromIterator(self):
             # implementations to erroneously return None in .extension_bytes .
             missing = object()
             ext_bytes = getattr(txn, "extension_bytes", missing)
-            self.assertEqual(self.supportsExtensionBytes, ext_bytes is not missing)
+            self.assertEqual(self.supports_extension_bytes, ext_bytes is not missing)
+            self.assertEqual(self.supports_extension_bytes,
+                    IStorageTransactionMetaDataRaw.providedBy(txn))
             if ext_bytes is not missing:
                 ext_dict_fromraw = loads(ext_bytes)
                 self.assertEqual(ext_dict_fromraw, ext_dict)
@@ -221,6 +224,7 @@ class IteratorDeepCompare(object):
 
     def compare(self, storage1, storage2):
         eq = self.assertEqual
+        missing = object()
         iter1 = storage1.iterator()
         iter2 = storage2.iterator()
         for txn1, txn2 in zip(iter1, iter2):
@@ -229,8 +233,8 @@ def compare(self, storage1, storage2):
             eq(txn1.user,        txn2.user)
             eq(txn1.description, txn2.description)
             eq(txn1.extension,  txn2.extension)
-            rawext1 = getattr(txn1, "extension_bytes", None)
-            rawext2 = getattr(txn2, "extension_bytes", None)
+            rawext1 = getattr(txn1, "extension_bytes", missing)
+            rawext2 = getattr(txn2, "extension_bytes", missing)
             eq(rawext1, rawext2)
             itxn1 = iter(txn1)
             itxn2 = iter(txn2)
diff --git a/src/ZODB/tests/testFileStorage.py b/src/ZODB/tests/testFileStorage.py
index 7f3bc9933..90493fa16 100644
--- a/src/ZODB/tests/testFileStorage.py
+++ b/src/ZODB/tests/testFileStorage.py
@@ -56,7 +56,7 @@ class FileStorageTests(
     ReadOnlyStorage.ReadOnlyStorage
     ):
 
-    supportsExtensionBytes = True
+    supports_extension_bytes = True
 
     def open(self, **kwargs):
         self._storage = ZODB.FileStorage.FileStorage('FileStorageTests.fs',

NexediGitlab pushed a commit to Nexedi/zodbtools that referenced this pull request May 16, 2018
Since zodbdump start (c0a6299 "zodbdump - Tool to dump content of a
ZODB database   (draft)") and up till now zodbdump output format was not
good. For example user and description transaction properties were
output without proper quoting, which in situation when there would be
fancy characters in there would break the output.

So start the format stabilization:

- user and description are output as quoted, so now they are guaranteed
  to be on one line. The quoting character is always " (instead of e.g.
  smartly quoting either by ' or " as python does) for easier
  compatibility with ZODB implementations in other languages.

- transaction extension is now printed as raw bytes, not as dict.
  The idea here is that `zodb dump`

  * should perform dump of raw data as stored inside ZODB so that later
    `zodb restore` could restore the database identically to the same state.

  * we should dump raw data instead of unpickled ones because generally
    on-disk extension's format can be any raw bytes and this information
    should be preserved.

- transaction status is now also output as quoted to preserve line
  breakage on fancy status codes.

- it is documented that sha1 is not the only allowed hash function that
  might be used.

- in hashonly mode we add trailing " -" to obj string so that it is
  possible to distinguish outputs of `zodb dump` and `zodb dump -hashonly`
  without knowing a-priory the way it was produced.

  The reason to do so is that it would be not good to e.g. by accident
  feed hashonly output to (future) `zodb restore`, which, without having
  a way to see it should not consume object data would read following
  transaction information as raw object data with confusing later
  errors (and a small chance to restore completely different database
  without reporting error at all).

Because ZODB iteration API gives us already unpickled extension and only
that, for now to dump it as raw we get a long road to pickle it back
also caring to try to pickle in stable order.

Hopefully this will be only a fallback because of

zopefoundation/ZODB#183

and next zodbtools patch.

~~~~

For testing purposes (currently only quoting function unit test) py.test
usage is introduced.

The code is also generally polished here and there.
NexediGitlab pushed a commit to Nexedi/zodbtools that referenced this pull request May 16, 2018
…upport is available

This patch test at runtime whether used storage can provide transaction
metadata in raw form and if so used thi form directly without going
through long fragile way of stable pickling extension back to bytes.

Corresponding ZODB support is at

zopefoundation/ZODB#183
NexediGitlab pushed a commit to Nexedi/zodbtools that referenced this pull request May 16, 2018
We start to stabilize output format of `zodb dump`. It is actually now robust and the only thing I would contemplate to potentially change is to also cover transaction metadata by hash checksum. So please take a look at updated format (details in patch 1) to provide feedback because it is likely close to what it  will be in its final form.

We also add a program to generate test database which uses various fancy ZODB features and check `zodb dump` output on it to golden one (patch 3).

To be able to dump transaction metadata in raw form ZODB is patched a bit:

zopefoundation/ZODB#183

and we try to detect whether appropriate support is there at runtime and if yes use it to streamline obtaining transaction extension as raw (patch 2).
Pleae see patch 1 (second half of `zodbdump.py` about what has to be taken on without such support and that it still can't work fully reliably).

/cc @Nexedi
/reviewed-on https://lab.nexedi.com/nexedi/zodbtools/merge_requests/3
jmuchemb pushed a commit that referenced this pull request May 31, 2018
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becames:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

Conversion between 'extension' and 'extension_bytes' is automatic and lazy,
so that these attributes remain writable.

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request May 31, 2018
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becames:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

Conversion between 'extension' and 'extension_bytes' is automatic and lazy,
so that these attributes remain writable.

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request May 31, 2018
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

Conversion between 'extension' and 'extension_bytes' is automatic and lazy,
so that these attributes remain writable.

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request Sep 3, 2018
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

Conversion between 'extension' and 'extension_bytes' is automatic and lazy,
so that these attributes remain writable.

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
@navytux
Copy link
Contributor Author

navytux commented Dec 6, 2018

@jimfulton, @jamadden, it is sad to see no progress neither here, nor in #207. Personally I think this PR should be ok, but I would accept it to be any way. Whatever way it is the worst would be to remain with current status quo - to not being able to load raw data from the database at all.

Kirill

@jmuchemb
Copy link
Member

jmuchemb commented Dec 6, 2018

#207 is almost ready to be merged. I just have to decide. So busy currently.

NexediGitlab pushed a commit to Nexedi/zodbtools that referenced this pull request Jan 11, 2019
Currently we exercise zodbdump and zodbcommit+zodbdump with non-empty
extensions, which works if ZODB is patched for txn.extension_bytes
support, but fails on pristine ZODB.

Support for txn.extension_bytes cannot get into upstream ZODB for
more than a year:

zopefoundation/ZODB#183
zopefoundation/ZODB#207

and  even if it somehow will make it, it will likely be only in ZODB5,
while we still care to support ZODB4 and ZODB3.

Skipping zodbdump / zodbcommit tests, if a ZODB does not have
txn.extension_bytes support, would result in significant reduction of
zodbtools test coverage, because practically that is the current
situation with all upstream ZODB{3,4,5}. Dropping test coverage for
non-empty extensions is neither a good option.

For those reason, let's rework the tests and test both zodbdump and
zodbcommit with two scenarios:

1. on a test database where transactions extensions are always empty.
   This should work on all ZODB irregardless of whether
   txn.extension_bytes patch is there or not.

2. on a test database where transactions extensions are present.
   This should work if ZODB has txn.extension_bytes support, but if not,
   we can mark this case as xfail, since the failure is expected.

This way we make the testsuite pass irregardless of whether
txn.extension_bytes support is there, and we don't abandon dump/commit
testing coverage.

/helped-by Jérome Perrin <[email protected]>
NexediGitlab pushed a commit to Nexedi/zodbtools that referenced this pull request Jan 31, 2019
until zopefoundation/ZODB#183 gets merged, let's
run also the tests for this, since we have support for this extension.
jmuchemb added a commit that referenced this pull request Aug 29, 2019
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

Conversion between 'extension' and 'extension_bytes' is automatic and lazy,
so that these attributes remain writable.

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request Aug 29, 2019
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request Sep 23, 2019
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
jmuchemb added a commit that referenced this pull request Sep 23, 2019
IStorage implementations used to do this task themselves which leads to code
duplication and sometimes bugs (one was fixed recently in NEO). Like for object
serialization, this should be done by the upper layer (Connection).

This commit also provides a way to get raw extensions data while iterating
over transactions (this is actually the original purpose[2]). So far, extension
data could only be retrieved unpickled, which caused several problems:

- tools like `zodb dump` [1] cannot dump data exactly as stored on a
  storage. This makes database potentially not bit-to-bit identical to
  its original after restoring from such dump.

- `zodb dump` output could be changing from run to run on the same
  database. This comes from the fact that e.g. python dictionaries are
  unordered and so when pickling a dict back to bytes the result could
  be not the same as original.

  ( this problem can be worked-around partly to work reliably for e.g.
    dict with str keys - by always emitting items in key sorted order,
    but it is hard to make it work reliably for arbitrary types )

Both issues make it hard to verify integrity of database at the lowest
possible level after restoration, and make it hard to verify bit-to-bit
compatibility with non-python ZODB implementations.

For this, TransactionMetaData gets a new 'extension_bytes' attribute and
and common usage becomes:

* Application committing a transaction:

  - 'extension' is set with a dictionary
  - the storage gets the bytes via 'extension_bytes'

* Iteration:

  - the storage passes bytes as 'extension' parameter of TransactionMetaData
  - the application can get extension data either as bytes ('extension_bytes')
    or deserialized ('extension'): in the former case, no deserialization
    happens and the returned value is exactly what was passed by the storage

[1] https://lab.nexedi.com/nexedi/zodbtools
[2] #183

Co-Authored-By: Kirill Smelkov <[email protected]>
@jamadden
Copy link
Member

I believe this was resolved with #207

@jamadden jamadden closed this Sep 26, 2019
@navytux navytux deleted the y/rawext branch July 7, 2020 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants