Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PG Adapter Presently Only Fakes Streaming #251

Open
danfinlay opened this issue Feb 7, 2015 · 7 comments
Open

PG Adapter Presently Only Fakes Streaming #251

danfinlay opened this issue Feb 7, 2015 · 7 comments

Comments

@danfinlay
Copy link
Contributor

I was processing a pretty big dataset the other day, and naturally was feeling pretty grateful that Model supports streaming results, until der_On and I discovered:

The PG Adapter Only Fakes Streaming!

My shocked face

It inherits from EventedQueryProcessor, which inherits from an EventEmitter, and only fakes a Stream object, but without the critical back-pressure handling and memory-efficiency of actual streams!

You can see an example of how this is presently implemented here.

One good reason this may have been done originally, is that the core PG module does not support streams.

Fortunately, there is a module, node-pg-query-stream that does exactly what we need here.

I'm adding this issue to help guide efforts at adding proper streaming to the PG adapter, and am going to be taking a shot myself, although this will definitely be the deepest I've ever delved into Model, so no promises! (Only callbacks)

@mde
Copy link
Contributor

mde commented Feb 7, 2015

So excited to see this happening!

@danfinlay
Copy link
Contributor Author

Does the two-year-old streams branch have tests that might be useful here?

@mde
Copy link
Contributor

mde commented Feb 7, 2015

I believe the only streaming tests we have are here: https://github.com/geddy/model/blob/master/test/integration/adapters/streaming.js

@danfinlay
Copy link
Contributor Author

Since that's the base sql adapter using the EventedQueryProcessor, it's looking to me like the mysql adapter isn't using using real streams either, even though the Mysql module natively supports streams.

One thing that seems a little tricky to me right now is how to swap out the individual database's streaming mechanisms.

I've also always wondered where exactly the SQL responses are converted into Model objects. Seems like it would be pretty cool to set up a through stream that converts the SQL responses into Model objects with no extra work...

@danfinlay
Copy link
Contributor Author

Converting to real streams with the current architecture is a little strange, posting my thoughts here so others could chime in:

There is currently an innerProcessor that initially makes a minimal query to retrieve an array of IDs before querying for full models and associations. My current considerations center around whether this is necessary for a streaming adapter, and if so, how should it properly be implemented.

While a simple solution would be to make the initial query non-streaming, and then stream back the full results, I feel like if we're going to do streaming, we should do it really properly, and this means making no assumptions like "The array of IDs is a reasonable size to return without streaming and hold in memory".

One way would be for the initial query to stream back the IDs, in which case we have to decide how frequently to make the larger association requests.

My hope is that the initial query isn't necessary if using true streams. Curious for someone who is familiar with this pre-query to chime in on its purpose. I swear you could do most of these sql queries without a pre-query, but I may be missing something.

@danfinlay
Copy link
Contributor Author

Oh I see now: JOINs don't work with limit/offset.

So basically we'd have to implement both ways:

Non-limiting/offsetting streaming requests can have a direct stream, but requests with limit or offset need IDs first, and so the question of "how often to query for subsequent results" remains.

It almost makes me think it could be a config variable or request option, how many records to fetch at a time when streaming with a limit or offset. That's a pretty obscure variable, but I wouldn't want to bury it as an assumption...

@danfinlay
Copy link
Contributor Author

What if instead of making a pre-query, we made a subquery?

I'm pretty sure both Postgres and MySQL support this:

Instead of requesting IDs, then making a second request with an IN clause to that array, you simply include that first query inside the IN clause to begin with. There might be an extra step to select the id field from that returned table, if you had to request other values to filter/sort on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants