Skip to content
This repository has been archived by the owner on Jan 22, 2022. It is now read-only.

JSON list as Python generator? #24

Open
Mec-iS opened this issue Feb 23, 2016 · 4 comments
Open

JSON list as Python generator? #24

Mec-iS opened this issue Feb 23, 2016 · 4 comments

Comments

@Mec-iS
Copy link
Contributor

Mec-iS commented Feb 23, 2016

I am collecting information about the possibility to use a generator instead of loading the full JSON in memory as manager get called:

Possible algorithm:

  • manager loads the JSON and create two generators: one to be kept as a blueprint, the other to be consumed at every filtering operation
  • filtering or any other action consumes the generator,
  • function return the resulting filtered output
  • a new generator is copied from the blueprint to serve the next operation
  • optional: create a index (a dictionary JSON value > JSON position in the array) for subsequent functions' calls, or instead create a copy of the generator to keep in memory (to avoid the generator to be built again at each filtering calls, see caveat below).

I couldn't find any memory/CPU-attentive method in the Standard Library to accomplish the cloning or the deep copy of a generator in memory, the only one is tee() but it seems to have downsides for our usecase:

  • This answer here underlines the fact that creating the generator twice is CPU-intensive while dumping a copy into a list() can be better if you think of consuming the generator until the end
  • Consider these three cases
  • See the snippet here for creating an iterable class

Does it sounds like a good idea?

@agateblue
Copy link
Owner

I just pushed a release (0.2) an hour ago that implements lazy querysets (they will probably get some improvements soon). You can also pass generators to the manager and it will only be iterated accessing queryset data (note that the resulted data will still be in memory though). I think it partially adress your issue, at least the part regarding memory usage.

However, once the generator is consumed, lifter won't be able to consume it again.

The solution that comes to mind it to allow passing a callable to load(). When the times come to filter the values, the callable will return a generator. Example:

def return_json_generator():
    return generator

manager = lifter.load(return_json_generator)

This seems easier to implement than the blueprint you suggested.

When #25 will be fixed, it will also increase performance (generator will only be looped once, regardless the number of filters/excludes applied).

I'm not really fond of the index, at least currently: the package is still in alpha state and I'd rather not reinvent a whole database system at this point. Also, in your present situation, I think any effort you'll deploy to reduce the memory footprint of your queries will be useless if you need to maintain an index of your whole data in memory.

@Mec-iS
Copy link
Contributor Author

Mec-iS commented Mar 10, 2016

I was thinking to something like:

#
# pseudocode
#

import copy

def create_generator(json_list):
    for object in json_list:
        yield object

generator = create_generator(JSON)
generator_copy = copy.deepcopy(generator)
while True:
    result = filter(next(generator_copy))

This way you can save memory by using a generator for all the filtering operations you apply. The manager creates the generator; each time a filter operation is required, a deep copy of the generator is made and consumed.

@agateblue
Copy link
Owner

Yes this is exaclty that, the only difference is that you won't even need to deepcopy the generator, instead, you pass a callable that returns a generator to the manager, and the manager will call this function to get a ready-to-loop generator or iterable.

The main advantage over your proposal is that you can call the manager a thousand time if you want, without providing a different copy each time, and it will still work. With your example, after you run result = filter(next(generator_copy)), you will have to feed your manager with another copy, which is not really convenient.

@Mec-iS Mec-iS closed this as completed Mar 11, 2016
@agateblue
Copy link
Owner

I'll leave this open since I still need to implement the callable feature ;)

@agateblue agateblue reopened this Mar 11, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants