Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How should I paginate my data using DataLoader? #231

Open
Tracked by #297
lffg opened this issue Dec 5, 2019 · 20 comments
Open
Tracked by #297

[QUESTION] How should I paginate my data using DataLoader? #231

lffg opened this issue Dec 5, 2019 · 20 comments

Comments

@lffg
Copy link

lffg commented Dec 5, 2019

I have the following query:

{
  user(id: $id) {
    username
    comments(last: 10) { # How can I handle pagination here?
      id
      body
      author {
        username
      }
      replies(last: 3) { # How can I handle pagination here?
        id
        body
      }
    }
  }
}

How can I handle pagination in those cases? Should I use DataLoader?

Until now, everything that I found on the web suggests using the splice array method, but if I use it, my server will run out of memory, as I should paginate the data on the SQL query, am I right?

So how I am able to query my database with such GraphQL query? Do you handle those cases in any of your projects?

I am really struggling with that. It is even possible to use DataLoader in this case? And if I not use it, I my server will suffer a lot with the N+1 problem...

@CreatCodeBuild
Copy link

One question @lffg , are you using dataloader already in your project? Dataloader does not have any concept of database. It's only a utility that maps many calls to a function f(x) to a single call of f2(Array<x>)

Therefore, for replies case, you just need to implement a function f2 that takes an array of page numbers of replies and ids of comments probably and then manually do SQL in this f2.

@koistya
Copy link
Contributor

koistya commented Dec 6, 2019

@lffg Most likely there won't be any UI screens where you would need to fetch the list of users alongside their comments (thinking of user profile page for example, where you would need to fetch comments just for this specific user).

If that's true, you may want to move the "comments" field away from the "user" object higher to the top-level (root) object type.

You may want to introduce a convention in your API design stating that the top-level fields must fetch data directly (from database), instead of using data loaders.

The GraphQL query for fetching a user + his comments may look like this:

query {
  user(username: "john_wick") { ... }
  comments(author: "john_wick", first: 50, after: 100) { ... }
}

# OR, if you like to fetch data by user ID

query {
  user: node(id: "123") { ... on User { ... } }
  comments(authorId: "123", first: 50, after: 100) { ... }
}

Note that the same top-level comments field will be used multi-purposely for all kind of scenarios that the API needs to support, for example:

query {
  story(slug: "hello-world-ae3c21") { ... }
  comments(story: "hello-world-aw3c21") { ...}
}

And here is an example of the comments field implementation supporting pagination:

import { GraphQLNonNull, GraphQLString } from 'graphql';
import {
  connectionDefinitions,
  forwardConnectionArgs,
  connectionFromArraySlice,
  cursorToOffset,
} from 'graphql-relay';

import db from '../db';
import { countField } from '../utils';
import { CommentType } from '../types';

export const comments = {
  type: connectionDefinitions({
    name: 'Comment',
    nodeType: CommentType,
    connectionFields: { totalCount: countField },
  }).connectionType,

  args: {
    ...forwardConnectionArgs,
    story: { type: GraphQLString },
    author: { type: GraphQLString },
  },

  async resolve(root, args, ctx) {
    const query = db.table('comments');

    // Optionally, filter by story slug
    if (args.story) {
      query.leftJoin('stories', 'story.id', 'comments.story_id');
      query.where('stories.slug', '=', args.story);
    }

    // Optionally, filter by author
    if (args.author) {
      query.leftJoin('users', 'users.id', 'comments.author_id');
      query.where('users.username', '=', args.author);
    }

    const limit = args.first === undefined ? 20 : args.first;
    const offset = args.after ? cursorToOffset(args.after) + 1 : 0;

    const data = await query
      .clone()
      .limit(limit)
      .offset(offset)
      .select('comments.*');

    return {
      ...connectionFromArraySlice(data, args, {
        sliceStart: offset,
        arrayLength: offset + data.length,
      }),
      query,
    };
  },
};

See also https://github.com/kriasoft/relay-starter-kit -> api/queries.

@lffg
Copy link
Author

lffg commented Dec 6, 2019

That query was an example... I don't have any page that needs that. I am just trying to experiment some pagination with DataLoader. Still, no success with that.

Nevertheless, thanks @koistya, changing the query might help, although I think I lose some of GraphQL's power by doing so.


If anybody can help me with further information or advice, I will appreciate.

@klausXR
Copy link

klausXR commented Jul 26, 2020

I read through the discussion but it did not become clear to me what the preferred practice would be. I read through the readme carefully and found this section, so it appears as though we are meant to ( I do realize that this is not a "you can only implement it like so" scenario ) fetch the user id's and order them accordingly, and then fire another request for the actual records with the corresponding id's, which plays well with their Ent pattern. I do not realize, however, how the queryLoader is supposed to be implemented. I'm using Sequelize as an ORM which uses symbols for the operations, like [Op.in] for an in clause, and they don't serialize well. But even apart of that, it is not clear to me how it is supposed to be implemented with another orm or a query builder.

@onderonur
Copy link

onderonur commented Aug 16, 2020

I have a little solution for pagination with dataloader. It's not that advanced but makes the job done for me now.
I have a scructure like this:

{
  posts (first: $first, after: $after) {
    id
    title
    content
    comments(last: 10) {
         totalCount
         pageInfo {
              hasNextPage
              endCursor
         }
         edges {
               id
               content
         }
      }    
   }
}

So, let's say that we have a "Feeds" page like Twitter. It loads posts, shows comment counts of each post, maybe likes/dislikes etc too. To simplify it let's just say that we need totalCount of each post's comments. If we don't use dataloader, we will query db for each post's comment count. If we list 20 posts, there will be 20 commentCount queries as you know. We can create a dataloader like commentCountByPostId and use it. It's actually a pretty simple GROUP BY SQL query.
We saved a lot of performance for counting comments, reactions etc and each of them costs only 1 query.
But let's say that we need last 10 comments of each post. If we don't use dataloader, there will be 20 queries for each post as you would expect. So,how can we use dataloader here?
What I do is, I use object-hash npm package. It creates a hash string from any object and if the objects are deeply equal, hashes are equal too. I use this hash as my dataloader cache key. Now we have our parameters array, and hashes to map these parameters to dataloader. But how will we implement the SQL query?
This part is where it gets a little bit of fuzzy. We have multiple pagination queries and we want to query them with one db hit. What I do is, I loop the pagination parameters array, create a raw SQL query string for each of them (using sql utility function from @prisma/client), and merge these queries with UNION ALL. For each query, I also add a cacheKey column and it shows which cacheKey that row belongs to. It's like this:

argsArray.map(args => {
    const sqlQuery = sql`
        SELECT *, ${objectHash(args)} AS cacheKey
        ...
    `
})

So in the end, we have a query like this:

SELECT *, 'qwerty' as cacheKey
....
UNION ALL
SELECT *, 'asdfg' as cacheKey
.....
UNION ALL
....

When we got the result, I just map the rows to their cacheKeys and return the result to dataloader.

This method helps me to reduce db hits. 1 for posts, 1 for comment counts of each post and one for paginated comments.

I couldn't find a ORM to support a query like this. I ended up using raw SQL after all my trials. That is a little downside of this method. And actually, even if this is a simple scenario, this can get messy for complex scenarios.

I'm not a deep-dive graphql expert, but this should be a pretty common problem. But let alone the answers, there are nearly no questions around. Only a few topics and issues. So, I don't know if people just don't care about how many times they queried the db for a simple list, or if they use another caching/batching/query method. This issue is one of the the biggest problems with graphql APIs for me.

@onderonur
Copy link

I have a little solution for pagination with dataloader. It's not that advanced but makes the job done for me now.
I have a scructure like this:

{
  posts (first: $first, after: $after) {
    id
    title
    content
    comments(last: 10) {
         totalCount
         pageInfo {
              hasNextPage
              endCursor
         }
         edges {
               id
               content
         }
      }    
   }
}

So, let's say that we have a "Feeds" page like Twitter. It loads posts, shows comment counts of each post, maybe likes/dislikes etc too. To simplify it let's just say that we need totalCount of each post's comments. If we don't use dataloader, we will query db for each post's comment count. If we list 20 posts, there will be 20 commentCount queries as you know. We can create a dataloader like commentCountByPostId and use it. It's actually a pretty simple GROUP BY SQL query.
We saved a lot of performance for counting comments, reactions etc and each of them costs only 1 query.
But let's say that we need last 10 comments of each post. If we don't use dataloader, there will be 20 queries for each post as you would expect. So,how can we use dataloader here?
What I do is, I use object-hash npm package. It creates a hash string from any object and if the objects are deeply equal, hashes are equal too. I use this hash as my dataloader cache key. Now we have our parameters array, and hashes to map these parameters to dataloader. But how will we implement the SQL query?
This part is where it gets a little bit of fuzzy. We have multiple pagination queries and we want to query them with one db hit. What I do is, I loop the pagination parameters array, create a raw SQL query string for each of them (using sql utility function from @prisma/client), and merge these queries with UNION ALL. For each query, I also add a cacheKey column and it shows which cacheKey that row belongs to. It's like this:

argsArray.map(args => {
    const sqlQuery = sql`
        SELECT *, ${objectHash(args)} AS cacheKey
        ...
    `
})

So in the end, we have a query like this:

SELECT *, 'qwerty' as cacheKey
....
UNION ALL
SELECT *, 'asdfg' as cacheKey
.....
UNION ALL
....

When we got the result, I just map the rows to their cacheKeys and return the result to dataloader.

This method helps me to reduce db hits. 1 for posts, 1 for comment counts of each post and one for paginated comments.

I couldn't find a ORM to support a query like this. I ended up using raw SQL after all my trials. That is a little downside of this method. And actually, even if this is a simple scenario, this can get messy for complex scenarios.

I'm not a deep-dive graphql expert, but this should be a pretty common problem. But let alone the answers, there are nearly no questions around. Only a few topics and issues. So, I don't know if people just don't care about how many times they queried the db for a simple list, or if they use another caching/batching/query method. This issue is one of the the biggest problems with graphql APIs for me.

A little update: Using UNION ALL + writing SQL queries by hand works fine. But as you can guess, it adds a lot of complexity and not that flexible. There is an answer for this kind of stuff at here and it explains everything very well. Even mentions the UNION ALL solution. It has a link to a Twitter thread and that's just enlightning.
Over optimizing and creating complexity is hard to maintain. Even if you have nice abstractions, it is.
So, I'll try to keep things simple now. For example, I'll separate totalCount from connections like:

{
    commentsCount # or commentCount, that's up to you
    comments {
        edges { ... }
        pageInfo { .... }
    }
}

If you look at schema of withspectrum/spectrum or Admin API of Shopify, they do this too. Throwing pagination and counting stuff exactly same place and trying to optimize things increases complexity. Even Relay's connection spec doesn't force you to keep totalCount inside your connections and even if it did, you would always create your own solution/spec.
Second thing is, let's say that we need to list some posts and show some of the comments under each of them. Instead of making clients query for these comments, we can add another field to Post like featuredComments and query for last 5 comments for each postId easily using DataLoader. The idea is from the here, the Twitter thread that mentioned above. Just creating specialized fields for your client apps and keeping things simple.
Trying to create too much flexibility brings too much complexity. And complexity decreases flexibility very badly.

@klausXR
Copy link

klausXR commented Jan 9, 2021

Note: Consider this more of a brainstorming ideas as opposed to an actual solution and do not use it in your own project before investigating the performance impact of the queries.

I would also suggest that you read the previous comment, as after some digging on this issue, I found that the best solution is to actually not having to do it at all.

Still, should you end up in a position where you need to accomplish this, the information below might be of some value to you.

You may be able to accomplish the query through an ORM or a QueryBuilder ( I'm using Sequelize and its syntax for this query is more complex than the raw sql counterpart ) and the example assumes some sql familiarity.

The example I will be using assumes that you want to fetch and limit/order the comments for a post.

Do also note, the example doesn't really solve any real-world scenario.

post(id: 1) {
  title
  content
  comments(first: 24, orderBy: "createdAt", sortOrder: "desc") {
     text
     author {
       name
     }
  }
}

We will be using two database queries to retrieve the data.

Query 1 - Load the commentIds for the specific post, ordered/limited accordingly
Query 2 - Load the comment entities by using a commentById.loadMany(commentIds)

I will not be covering DataLoader Factory patterns ( there are multiple ways of abstracting this, but we will be using the simplest way possible ).

This example assumes the following database tables

posts - id, title, content
postComments - id, postId, authorId, text

const commentLoader = new Dataloader(...) // Fetches comments by their id's

const postCommentsLoader = new Dataloader(async postIds => {
    const { rows } = await knex.raw(`
        SELECT 
            ARRAY (
                SELECT pc.id FROM "postComments" pc 
                WHERE pc."postId" = p.id 
                ORDER BY pc."order" DESC -- Custom ordering
                LIMIT 24 -- Custom limiting
            ) AS postCommentIds, 
        p.id AS "postId" 
        FROM "posts" p
        WHERE p.id IN(${postIds.join(',')}) -- Use variables instead, this is prone to sql injection
    `)

    /**
     * Objects in `rows` have this format
     */
    [
        {
            postId: 1,
            postComments: [3,2,1] // Comment ids
        },
        ...
    ]

    const commentIdsByPostId = rows.reduce((acc, { postId, postCommentIds }) => {
        acc[postId] = postCommentIds

        return acc
    }, {})

    return postIds.map(postId => commentIdsByPostId[postId])
})

const resolvers = {
    Post: {
        async comments(post, args, context) {
            const postCommentIds = await context.postCommentsLoader.load(post.id)
            const comments = await context.commentLoader.loadMany(postCommentIds)

            return comments
        }
    }
}

If you think that this is quite some boilerplate - you are correct. Most of this can be abstracted with smarter code organization patterns but its still not ideal.

For example, in my application, this can be translated to something similar to this

const commentsById = createDataloader("postComments")
const postCommentsByPostId = createDataloader({
    queryFn: postIds => knex.raw(`sql`).then(({ rows }) => rows),
    key: "postId",
    rowToValue: ({ postComments }) => postComments,
    pipeThrough: postCommentIds => commentsById.loadMany(postCommentIds)
})

const resolvers = {
    Post: {
        async comments(post, args, context) {
            // The chaining is done via the `pipeThrough`, so we must issue a single call here
            const comments = await context.postCommentsByPostId.load(post.id)

            return comments
        }
    }
}

If you read carefully, however, you'd see that this example does not use runtime ordering or filtering. The limit and order parameters are hardcoded inside the query.

After the digging mentioned in the beginning, I have come to the conclusion ( like others ), that it would be better to provide a specialized query ( like mentioned in that twitter post from the comment above ) instead, and that specialized query is the query in my example.

So, what would the solution be and what exactly does a specialized query mean? I interpreted it like so ( albeit the chosen example is not the best fit ).

If you have two ( or more ) use-cases in your application where you need to fetch a different number of comments, you can create different fields for them, with different resolvers that have hard-coded arguments.

So, instead of our initial example with last/order/sortOrder varaibles, we would have

Post {
  comments: [Comment!]! # last: 24 orderBy: "createdAt" sortOrder: "DESC"
  recentComments: [Comment!] # last: 12 orderBy: "createdAt" sortOrder: "DESC"
}

That way, instead of one field with dynamic variables, you have two fields with hard-coded variables.

If you still need to use the single field method with dynamic variables, according to my findings you can use the union all technique from the previous comment or a modified version of the example above.

Since the only way to provide variables to a Dataloader instance is through arguments to load/loadMany methods, possible solution would be to pass not id's, but objects ( similar to the previous solution ).

postCommentsByPostId.load({
  postId: 1,
  order: `args.orderBy args.sortOrder`,
  limit: args.first,
})

Then, inside the Dataloader resolver, you can grab the first key and construct the query based on its properties

const postCommentsByPostId = new Dataloader(postIds => {
  // postIds argument contains objects, we only care about the id's
  const ids = postIds.map(({ postId }) => postId)
  
  // We could ( not always ) assume that order/limit will be the same across all `postIds` objects
  const { order, limit } = postIds[0]
     
  // Use sql variables instead, this is also prone to sql injection
  const { rows } = await knex.raw(`
        SELECT 
            ARRAY (
                SELECT pc.id FROM "postComments" pc 
                WHERE pc."postId" = p.id 
                ORDER BY pc."${order}" -- Custom ordering
                LIMIT ${limit}-- Custom limiting
            ) AS postCommentIds, 
        p.id AS "postId" 
        FROM "posts" p
        WHERE p.id IN(${postIds.join(',')}) 
  `)
  
  ...
})

const resolvers = {
    Post: {
        async comments(post, args, context) {
            // This is why we would assume all properties would be the same,
            // Because the are called from the same resolver and collected in a single tick
            const postCommentIds = await context.postCommentsLoader.load({
              postId: post.id,
              limit: args.first,
              order: `${args.orderBy} ${args.sortOrder}`
            })
            
            const comments = await context.commentLoader.loadMany(postCommentIds)

            return comments
        }
    }
}

Because the Dataloader collects all calls to load that happened within a single tick, I'm ( not tested or confirmed ) assuming that all arguments passed to the .load method will have the same order/limit properties ( this example does not work if they are not all the same, because we only read the first one ).

There is one better solution that was mentioned in the beginning and it is to design your schemas in a way that you don't end up in a situation where you need to do this.

@KDKHD
Copy link

KDKHD commented Feb 21, 2021

If you are using a newer DB like Mysql 8.0 <, then you could try

SELECT * FROM 
(SELECT 
  ROW_NUMBER() OVER(PARTITION BY userId ORDER BY commentId ASC) 
    AS userCommentNum
FROM comments) as t1
WHERE t1.userCommentNum > OFFSET AND t1.userCommentNum < LIMIT

https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver15
Pretty simple and if you use a query builder like typeORM you can declare a custom helper method to reuse this code.

@acro5piano
Copy link

acro5piano commented Mar 5, 2021

My brain storming here.

How about UNION ALL and limit, mixing solutions from @onderonur @klausXR ?

I didn't test huge data performance, but it looks working.

Complete example here.

import 'dotenv/config'
import Dataloader from 'dataloader'
import knex from 'knex'

const db = knex({
  client: 'pg',
  connection: process.env.DATABASE_URL,
  useNullAsDefault: false,
  debug: true,
})

interface PostsByUserIdParams {
  limit: number
  id: number
}

const postsByUserId = new Dataloader(
  (params: readonly PostsByUserIdParams[]) => {
    // Assuming all params limits are the same value
    const limit = params[0].limit

    // Get ids to load
    const ids = params.map((param) => param.id)

    // Here is the magic, this SQL loads each ids with each limit
    // Note that is POC code and could be SQL injection
    const query = ids
      .map((id) => `(select * from posts where user_id = ${id} limit ${limit})`)
      .join(' union all ')

    return db
      .raw(query)
      .then(({ rows }) =>
        ids.map((id) => rows.filter((x: number) => x.user_id === id)),
      )
  },
)

async function main() {
  await db.schema.dropTableIfExists('posts')
  await db.schema.createTable('posts', (t) => {
    t.increments('id')
    t.integer('user_id')
  })

  await db('posts').insert([
    { user_id: 1 }, { user_id: 1 },
    { user_id: 2 }
  ])

  const user1 = postsByUserId.load({ limit: 10, id: 1 })
  const user2 = postsByUserId.load({ limit: 10, id: 2 })
  const user3 = postsByUserId.load({ limit: 10, id: 3 })
  const user4 = postsByUserId.load({ limit: 10, id: 4 })
  const user5 = postsByUserId.load({ limit: 10, id: 5 })

  console.log('user1', await user1)
  console.log('user2', await user2)
}

main()

Logs:

image

@onderonur
Copy link

After nearly 1 year, I can agree on using dataloader on paginated queries can be a requirement and using UNION ALL can solve this. But making it flexible and writing a code that we can maintain easily can be a little hard.
As I wrote in my previous comment, I would highly suggest to solve this problem at your schema level, if you can of course. Creating special fields for your pages/screens/clients is a much easier way, rather than making everything generic and optimizing it heavily to let your clients to reach everything with giant queries.
Of course, this might not be possible everytime. But it's a nice and easy way to solve this kind of stuff most of the time.

@acro5piano
Copy link

@onderonur Thanks for sharing your opinion. Yes, it's pretty complicated and hard to implement it.

@KDKHD
Copy link

KDKHD commented Mar 9, 2021

My brain storming here.

How about UNION ALL and limit, mixing solutions from @onderonur @klausXR ?

I didn't test huge data performance, but it looks working.

Complete example here.

import 'dotenv/config'
import Dataloader from 'dataloader'
import knex from 'knex'

const db = knex({
  client: 'pg',
  connection: process.env.DATABASE_URL,
  useNullAsDefault: false,
  debug: true,
})

interface PostsByUserIdParams {
  limit: number
  id: number
}

const postsByUserId = new Dataloader(
  (params: readonly PostsByUserIdParams[]) => {
    // Assuming all params limits are the same value
    const limit = params[0].limit

    // Get ids to load
    const ids = params.map((param) => param.id)

    // Here is the magic, this SQL loads each ids with each limit
    // Note that is POC code and could be SQL injection
    const query = ids
      .map((id) => `(select * from posts where user_id = ${id} limit ${limit})`)
      .join(' union all ')

    return db
      .raw(query)
      .then(({ rows }) =>
        ids.map((id) => rows.filter((x: number) => x.user_id === id)),
      )
  },
)

async function main() {
  await db.schema.dropTableIfExists('posts')
  await db.schema.createTable('posts', (t) => {
    t.increments('id')
    t.integer('user_id')
  })

  await db('posts').insert([
    { user_id: 1 }, { user_id: 1 },
    { user_id: 2 }
  ])

  const user1 = postsByUserId.load({ limit: 10, id: 1 })
  const user2 = postsByUserId.load({ limit: 10, id: 2 })
  const user3 = postsByUserId.load({ limit: 10, id: 3 })
  const user4 = postsByUserId.load({ limit: 10, id: 4 })
  const user5 = postsByUserId.load({ limit: 10, id: 5 })

  console.log('user1', await user1)
  console.log('user2', await user2)
}

main()

Logs:

image

UNION works alright but its performance is worse than using a PARTITION BY query.

P.s UNIONS are not supported well by sequelize, use a query builder instead

@acro5piano
Copy link

@KDKHD
Thanks for pointing out. I don't have performance benchmarks yet, but this project actually uses UNION to paginate related tables.

https://lighthouse-php.com/

Yeah ORM support is limited for this operation. Do you have a good example code for PARTITION BY? I'm curious about it.

@KDKHD
Copy link

KDKHD commented Mar 16, 2021

@KDKHD
Thanks for pointing out. I don't have performance benchmarks yet, but this project actually uses UNION to paginate related tables.

https://lighthouse-php.com/

Yeah ORM support is limited for this operation. Do you have a good example code for PARTITION BY? I'm curious about it.

Hi, this is a quick example. Imagine your database stores this data: Your database stores information about different sensors in your house for example. Each row in your sensors table represents a sensor and has a location column. Let's say you want to get the first 5 sensors for each location (this is where you paginate). Your partition by query would look like this:

  const offset = 0
  const limit = 5

  const query = `SELECT * FROM (
    SELECT 
      ROW_NUMBER() 
      OVER (PARTITION BY sensorLocation 
      ORDER BY id ASC, id) -1 AS partition_index, 
      t.* 
    FROM "sensors" t)
  WHERE partition_index>${offset} AND partition_index<${offset+limit}`

When you want to get the next 5, you just change:

  const offset = 5
  const limit = 5

in typeORM you can add prototypes like this:

//Paginate data.
SelectQueryBuilder.prototype.paginate = function <Entity>(
  this: SelectQueryBuilder<Entity>,
  ARGS: paginateArgs
): SelectQueryBuilder<Entity> {
  const { ORDER_BY, DIRECTION, PARTITION_BY } = ARGS;
  this.addSelect(
    `ROW_NUMBER() OVER (${
      PARTITION_BY ? `PARTITION BY ${PARTITION_BY}` : ""
    } ORDER BY ${ORDER_BY} ${DIRECTION}) -1`,
    "r"
  );
  return this;
};

and use it like this:
image

This is a pretty vague description, might make a proper writeup about this since this seems like a big issue

@acro5piano
Copy link

@KDKHD Thank you for the code. The solution to use PARTITION BY and over is much simpler!!

@alexojegu
Copy link

alexojegu commented Apr 25, 2021

For this cases I am using lateral join:

SELECT e0.id, e1.* FROM gallery AS e0 LEFT JOIN LATERAL (
    SELECT e2.* FROM photo AS e2 WHERE e2.gallery_id = e0.id LIMIT ?
) AS e1 ON e0.id = e1.gallery_id WHERE e1.gallery_id IN (?)

I am using mikro-orm that internally use knex but unfortunately this last don`t support it, so I have to write the queries in SQL.

@YarekTyshchenko
Copy link

For my data model the right thing seems to be to split all queries into two types: Search queries that fetch IDs of entities, and entity fetch queries that fetch the actual entities. This allows implementing search type of graphQL fields where there may be multiple predicates and support paging, but still get the benefit of using DataLoaders for further queries down the graph that may fetch the same items, going back up the graph.

Example using C# graphql-dotnet:

this.Field<PagedCollectionType<SensorGroup, SensorGroupGraphType>>()
    .Name("sensorGroups")
    .PaginationArguments()
    .ResolveAsync(async context =>
    {
        // Get a paginated list of SensorGroup IDs
        var sensorGroupIds = await sensorGroupService
            .GetPagedSensorGroupIdsAsync(
                context.PaginationOptions());

        // Fetch the sensor groups by IDs with data loader
        return sensorGroupDataLoader
            .LoadAsync(sensorGroupIds)
            .Then(sensorGroups => sensorGroups.AsPagedCollection(
                sensorGroupIdsByLocation.AsOptions(),
                sensorGroupIdsByLocation.Count));
    });

The search isn't cached by a DataLoader because its very unlikely that you'd have two exactly the same search fields in the same query. The DB query for the search can return data straight from the index.

@cuongvuong-phoenix
Copy link

cuongvuong-phoenix commented Feb 5, 2022

For data that has DDL look like this:

CREATE TABLE author (
    id 		SERIAL PRIMARY KEY,
    name        VARCHAR NOT NULL,
    created_at 	TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE book (
    id 		SERIAL PRIMARY KEY,
    name        VARCHAR NOT NULL,
    created_at 	TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    author_id 	INTEGER NOT NULL REFERENCES author(id) ON DELETE CASCADE
);

I can come up with 3 solutions:

  1. Naively run a transaction and iterate over the ids to query and set the LIMIT and OFFSET for each query, then UNION all of them. Remind that the N+1 problem raises because we need to create a separate connection for each query if we don't use DataLoader to batch ids and the overhead of making those connections is too much to deal with. But by using this solution, we just need to make 1 connection to run our transaction which contains multiple queries.

  2. Another solution could be using the ROW_NUMBER() window function and PARTITION like the following:

    SELECT *
    FROM (
        SELECT
            id, name, created_at, author_id,
            row_number() OVER (PARTITION BY author_id ORDER BY id) AS author_row_num
        FROM book
        WHERE author_id = ANY($1)
    ) t
    WHERE t.author_row_num <= $2 AND t.author_row_num > $3;

    Note: $2 is LIMIT + OFFSET and $3 is OFFSET

  3. The above solution seems to be better than the first one, however, by using PARTITION, we traverse all the books in our book table early, and then filter the result with our last WHERE condition. If even for 1 author_id we have millions of books, that query will be very costly. Therefore, a better way is using the LATERAL JOIN query below:

    SELECT t.*
    FROM
        author a
        JOIN LATERAL (
            SELECT b.id, b.name, b.created_at, b.author_id
            FROM book b
            WHERE b.author_id = a.id
            ORDER BY b.id
            LIMIT $1
            OFFSET $2
        ) t
        ON TRUE
    WHERE a.id = ANY($3);

@KDKHD
Copy link

KDKHD commented Mar 5, 2022

I've been checking out postgraphile which builds pretty compact queries.

This is my gql query:

{
  galleries {
    nodes {
      galleryItems(offset: 2, first: 3) {
        nodes {
          mediaId
        }
      }
    }
  }
}

This is the PSQL query generated (idk how this query works exactly or its efficiency vs the PARTITION BY query that I shared earlier):

with __local_0__ as (
  select to_json(
    (
      json_build_object(
        '__identifiers'::text,
        json_build_array(__local_1__."gallery_id"),
        '@galleryItems'::text,
        (
          with __local_2__ as (
            select to_json(
              (
                json_build_object(
                  '__identifiers'::text,
                  json_build_array(__local_3__."gallery_items_id"),
                  'mediaId'::text,
                  (__local_3__."media_id")
                )
              )
            ) as "@nodes"
            from (
              select __local_3__.*
              from "public"."gallery_items" as __local_3__
              where (
                __local_3__."gallery_id" = __local_1__."gallery_id"
              ) and (TRUE) and (TRUE)
              order by __local_3__."gallery_items_id" ASC
              limit 3 offset 2
            ) __local_3__
          ),
          __local_4__ as (
            select json_agg(
              to_json(__local_2__)
            ) as data
            from __local_2__
          )
          select json_build_object(
            'data'::text,
            coalesce(
              (
                select __local_4__.data
                from __local_4__
              ),
              '[]'::json
            )
          )
        )
      )
    )
  ) as "@nodes"
  from (
    select __local_1__.*
    from "public"."galleries" as __local_1__
    where (TRUE) and (TRUE)
    order by __local_1__."gallery_id" ASC
  ) __local_1__
),
__local_5__ as (
  select json_agg(
    to_json(__local_0__)
  ) as data
  from __local_0__
)
select coalesce(
  (
    select __local_5__.data
    from __local_5__
  ),
  '[]'::json
) as "data"

and this is the PSQL result:

[{"@nodes":{"__identifiers" : [1], "@galleryItems" : {"data" : [{"@nodes":{"__identifiers" : [3], "mediaId" : 3}}, {"@nodes":{"__identifiers" : [4], "mediaId" : 4}}, {"@nodes":{"__identifiers" : [5], "mediaId" : 5}}]}}}, {"@nodes":{"__identifiers" : [2], "@galleryItems" : {"data" : [{"@nodes":{"__identifiers" : [10], "mediaId" : 3}}, {"@nodes":{"__identifiers" : [11], "mediaId" : 4}}, {"@nodes":{"__identifiers" : [12], "mediaId" : 5}}]}}}]

Which I guess through some post-processing gets converted into the expected gql response. I wish I knew about postgraphile while I was trying to build my own gql + dataloader + pagination implementation which is why I am sharing this here. Also, this query could maybe give some inspiration on how to use dataloader + pagination better.

@KDKHD
Copy link

KDKHD commented Mar 5, 2022

If you are using a newer DB like Mysql 8.0 <, then you could try

SELECT * FROM 
(SELECT 
  ROW_NUMBER() OVER(PARTITION BY userId ORDER BY commentId ASC) 
    AS userCommentNum
FROM comments) as t1
WHERE t1.userCommentNum > OFFSET AND t1.userCommentNum < LIMIT

https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver15 Pretty simple and if you use a query builder like typeORM you can declare a custom helper method to reuse this code.

Here is an example prisma implementation of the custom PARTITION BY helper function I mentioned above: https://github.com/KDKHD/GraphQL/blob/44185c080e3c54e5cdda9bdc2be92e24424ea3ac/utils/queryHelpers.ts#L30

@saihaj saihaj mentioned this issue Mar 11, 2022
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants