Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unflatten: Stream all unflattening. #377

Open
wants to merge 2 commits into
base: 315-lower-memory-usage
Choose a base branch
from

Conversation

kindly
Copy link
Contributor

@kindly kindly commented Feb 9, 2021

Unflatten:  Stream all unflattening.

* Uses openpyxl read_only mode
* Uses zodb storage to save incoming data into buckets based on id_name
  in top level objects.
* Any object without id_name gets given random key.
* Runs unflatten seperately on all these buckets.
* For JSON use jsonstreams to stream output into both result JSON and
  cell_source_map JSON. These are the only two files that are likely to
  be large.
* For XML use lxml xmlfile to stream unflattened xmldata.

* Use ijson
* Use pyopenxl write_only mode
* Store sheet lines in an embedded btree ZODB index

#316
* Uses openpyxl read_only mode
* Uses zodb storage to save incoming data into buckets based on id_name
  in top level objects.
* Any object without id_name gets given random key.
* Runs unflatten seperately on all these buckets.
* For JSON use jsonstreams to stream output into both result JSON and
  cell_source_map JSON. These are the only two files that are likely to
  be large.
* For XML use lxml xmlfile to stream unflattened xmldata.
@kindly
Copy link
Contributor Author

kindly commented Feb 9, 2021

@kindly kindly requested a review from Bjwebb February 9, 2021 20:52
@jpmckinney
Copy link
Contributor

Nice! I hadn't seen the jsonstreams library before. In OCDS Kit, we use json.iterencode, generators and a custom encoder: https://github.com/open-contracting/ocdskit/blob/master/ocdskit/util.py#L24

@kindly
Copy link
Contributor Author

kindly commented Feb 10, 2021

@jpmckinney that approach is interesting and fairly straitforward. For this case I needed one iterator to write to 2 files (result and cell_source_map) and did not want to iterate over the data again, so having an generator for a list value would not have worked. It also deals with object streaming.

I think jsonstreams could be faster though, so was thinking of replacing the encoder with ujson which will be a bit of a hack but seems possible. However, the profile data shows that the work is fairly evenly split between persisting, doing the actual work and writing. So optimizing just one of these will not make much difference.

Interesting just using pypy instead of cpython does raise the memory usage (for my test case to ~250MB from ~150MB) but more than doubles the speed to ~50sec from 110sec. I think the increase of memory is just pypy overhead though and should grow dis-proportionally with the size of file. It also makes the "doing the work" section a lower proportion of the runtime to ~1/5 from about ~1/3. Without persistence but streaming pypy takes 35sec/900MB and cpython 90sec/700MB.

So overall I think we should encourage pypy, if speed is needed, and it most likely makes a larger difference than any set of optimizations in the code itself.

@jpmckinney
Copy link
Contributor

What's the difference between persisting and writing? I/O will be the ultimate bottleneck, but optimizing the rest can make a difference, like using PyPy. In my experience (and it seems ujson's metrics), orjson is the fastest encoder. OCDS Kit uses orjson for encoding if it's available, and otherwise uses the standard library: https://github.com/open-contracting/ocdskit/blob/master/ocdskit/util.py#L74-L92

@kindly
Copy link
Contributor Author

kindly commented Feb 12, 2021

Persisting is getting the original data from a spreadsheet onto disk intially so that that it can be sorted by top level id (release.id or ocid in OCDS case). We can not stream this as the same top level ids exists across sheets.
Writing is then actually wring the JSON/XML file at the end.

Yes orjson is faster but will not be compatable with jsonstreams as it uses bytes not str. So I was not sure it would be faster than ujson if you then convert to str before putting it into jsonstreams. orjson does not support pypy and seems harder to compile for windows users.

@kindly
Copy link
Contributor Author

kindly commented Feb 12, 2021

Actually I tried changing the persistance backend to leveldb from zodb and have a private branch for that. This was actually faster for both flattening and unflattening and it used orjson to encode and decode from leveldb. leveldb also needs bytes only for its keys and values.

The reason for not using this though was that leveldb is also hard to compile for windows and it does not support an in memory mode and would be probably be slower if you needed to convert everything to bytes when using the normal JSON encoder.

Copy link
Member

@Bjwebb Bjwebb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toxml in xml_output.py can now be removed, as its been replaced with your new functions. (I couldn't comment on it directly as its not in the diff).

As with the flatten PR, some high level comments could be good.

index = 0

for sheet, rows in self.get_sub_sheets_lines():
for row_numbar, row in enumerate(rows):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_numbar sb row_number

@@ -179,7 +187,103 @@ def decimal_default(o):
raise TypeError(repr(o) + " is not JSON serializable")


# This is to just to make ensure_ascii and default are correct for streaming library
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't quite grammar correctly.

Possible change:

This is to just to make sure ensure_ascii and default are correct for streaming library

@@ -502,10 +587,12 @@ def fancy_unflatten(self, with_cell_source_map, with_heading_source_map):
return result, ordered_cell_source_map, heading_source_map


def extract_list_to_error_path(path, input):
def extract_list_to_error_path(path, input, index=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is a bit confusing. The actual implementation is fine once I got my head around it, so maybe it just requires a bit of explanation.

@@ -1,24 +1,24 @@
[
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the indentation mismatch here between this opening { and the closing one. All the other examples look okay, so I think this has to do with having a list at the root.

Copy link
Member

@Bjwebb Bjwebb Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is a jsonstreams bug, I can reproduce with:

import jsonstreams

with jsonstreams.Stream(jsonstreams.Type.array, filename='test.json', indent=4) as s:
    s.write({"a": 1, "b": 2})

I will report this upstream. Happy to see this merged as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants