graphql: Updated mutation rewriting to fix OOM issue #5536

harshil-goel · 2020-05-28T15:28:52Z

Fixes GRAPHQL-451
Fixes #5400

I also wrote a small benchmark to see the profile difference.

On the current release branch, Benchmark fails for 100 elements in the innermost loop. (Starts to fail at 20 elements). The memory profile is for 19 case

Benchmark3LevelDeep5-12             1036           1077422 ns/op
Benchmark3LevelDeep19-12               1        41056766621 ns/op
signal: terminated

Showing nodes accounting for 11704.49MB, 95.88% of 12207.24MB total
Dropped 10 nodes (cum <= 61.04MB)
Showing top 10 nodes out of 31
      flat  flat%   sum%        cum   cum%
 2730.32MB 22.37% 22.37%  8434.67MB 69.10%  encoding/json.mapEncoder.encode
 1914.68MB 15.68% 38.05%  1914.68MB 15.68%  reflect.mapiterinit
 1527.52MB 12.51% 50.56%  1527.52MB 12.51%  reflect.copyVal
 1286.11MB 10.54% 61.10%  3952.80MB 32.38%  reflect.Value.MapKeys
 1182.70MB  9.69% 70.79%  9617.37MB 78.78%  encoding/json.Marshal
 1105.79MB  9.06% 79.85%  1866.95MB 15.29%  github.com/dgraph-io/dgraph/graphql/resolve.squashFragments
  976.04MB  8.00% 87.84%   976.04MB  8.00%  internal/reflectlite.Swapper
  347.59MB  2.85% 90.69%   347.59MB  2.85%  github.com/dgraph-io/dgraph/graphql/resolve.squashIntoList
  328.08MB  2.69% 93.38%   328.08MB  2.69%  github.com/dgraph-io/dgraph/graphql/resolve.squashIntoObject.func1
  305.67MB  2.50% 95.88%   305.67MB  2.50%  strings.(*Builder).grow

As you can see, in the PR, even 100 elements in the innermost loop work. The memory profile (for 19) is attached along.

[Decoder]: Using assembly version of decoder
Benchmark3LevelDeep5-12             4693            229221 ns/op
Benchmark3LevelDeep19-12            1591            834201 ns/op
Benchmark3LevelDeep100-12            286           3938028 ns/op
Benchmark3LevelDeep100-12            343           3321058 ns/op               
Benchmark3LevelDeep1000-12            36          32947170 ns/op               
Benchmark3LevelDeep10000-12            3         337876067 ns/op

howing nodes accounting for 297.55MB, 70.16% of 424.10MB total
Dropped 47 nodes (cum <= 2.12MB)
Showing top 10 nodes out of 48
      flat  flat%   sum%        cum   cum%
   58.01MB 13.68% 13.68%   270.05MB 63.68%  github.com/dgraph-io/dgraph/graphql/resolve.rewriteObject
   53.51MB 12.62% 26.30%    72.01MB 16.98%  github.com/dgraph-io/dgraph/graphql/resolve.addInverseLink
   37.51MB  8.84% 35.14%    44.01MB 10.38%  github.com/dgraph-io/dgraph/graphql/resolve.xidQuery
   28.50MB  6.72% 41.86%    86.01MB 20.28%  encoding/json.mapEncoder.encode
      27MB  6.37% 48.23%   142.01MB 33.49%  github.com/dgraph-io/dgraph/graphql/resolve.mutationsFromFragments
   25.50MB  6.01% 54.24%    25.50MB  6.01%  reflect.mapiterinit
   19.50MB  4.60% 58.84%    19.50MB  4.60%  github.com/dgraph-io/dgraph/graphql/schema.(*fieldDefinition).Type
      18MB  4.24% 63.09%       18MB  4.24%  github.com/dgraph-io/dgraph/graphql/resolve.newFragment
      15MB  3.54% 66.62%    94.52MB 22.29%  github.com/dgraph-io/dgraph/graphql/resolve.asXIDReference
      15MB  3.54% 70.16%    15.50MB  3.66%  fmt.Sprintf

This change is

Docs Preview:

pawanrawal

I get the gist of the change, we are doing 2 mutations instead of 1. The first one creates the nested deep mutation object if it doesn't exist and then we create the actual objects assuming that the deep nested object already exists. We should add some comments mentioning this.

Looks like a nice change! Also looks like earlier the number of mutations would now be at max M + N where M is the number of top level objects and N the number of deep objects whereas we were earlier doing a lot more.

Also, would be good to have some benchmarking tests which try to add random data with 100, 1000, 10000 entities being created for both the xid and the other case.

Reviewed 9 of 9 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @abhimanyusinghgaur, @harshil-goel, and @MichaelJCompton)

graphql/resolve/add_mutation_test.yaml, line 30 at r1 (raw file):

-
  name: "Add deep mutation with variables"

This is fine but it's not the best place for doing load testing. We should also have a benchmark test which tests the time and allocations for adding say 100, 1000 authors and remove this after.

graphql/resolve/add_mutation_test.yaml, line 1169 at r1 (raw file):

  error:
    message: |-
      failed to rewrite mutation payload because duplicate XID found: S1

Was this incorrect before?

graphql/resolve/add_mutation_test.yaml, line 1755 at r1 (raw file):

      }
    }
  dgquery: |-

This might better be situated up there with dgmutation.

graphql/resolve/mutation.go, line 234 at r1 (raw file):

		if req.Query != "" && len(mutResp.GetJson()) != 0 {
			if err := json.Unmarshal(mutResp.GetJson(), &result); err != nil {

result would have the response of the final mutation, is that going to be OK.

MichaelJCompton

Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @abhimanyusinghgaur, @harshil-goel, and @pawanrawal)

graphql/resolve/add_mutation_test.yaml, line 30 at r1 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

This is fine but it's not the best place for doing load testing. We should also have a benchmark test which tests the time and allocations for adding say 100, 1000 authors and remove this after.

Yeah, I think in general we should have something like all the simple examples just in yaml, then the complex ones should be benchmarks elsewhere.

Also, it might be worth having some larger tests that mint this sort of example up dynamically and just test that there's the right number of queries, right number of mutations, etc.

graphql/resolve/add_mutation_test.yaml, line 100 at r1 (raw file):

  dgquerysec: |-
    query {
      PostSecret4 as PostSecret4(func: eq(PostSecret.title, "ps1")) @filter(type(PostSecret)) {

I'm a bit confused here. How does the first and second query thing work? I'f we've just queried for it, why do we need to go back again, wouldn't we have the answer?

graphql/resolve/add_mutation_test.yaml, line 125 at r1 (raw file):

      }
    }
  dgmutations:

So we run dgquery and dgmutations as a single upsert? So, I'm expecting 1 case in the dgmutations for every possible object creation in the mutation - that's every id reference, xid reference and everything that's definitely a new object?

The we rerun the same queries and use the result in a single upsert. Questions

why do we need the condition on the final upsert? Shouldn't the intial case ensure success?
why do we need the second set of queries? Don't we know the id's already from the first queries?

graphql/resolve/add_mutation_test.yaml, line 1169 at r1 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

Was this incorrect before?

Nice - it's removing a duplicate!

graphql/resolve/add_mutation_test.yaml, line 1367 at r1 (raw file):

          "uid":"_:City1"
        }
      cond: "@if(eq(len(District3), 1))"

I really like the way this has made the mutations simpler. But I don't understand why the condition is still needed. It looks to me like the first mutation is set up to ensure that it's there.

graphql/resolve/mutation.go, line 216 at r1 (raw file):

	result := make(map[string]interface{})
	req := &dgoapi.Request{}

is req used outside the loop? Looks like you could just move it inside with the same instantiation as before.

graphql/resolve/mutation.go, line 217 at r1 (raw file):

	req := &dgoapi.Request{
		Query:     dgraph.AsString(upsert.Query),
		CommitNow: true,

why was that still CommitNow: true ... doesn't look right

graphql/resolve/mutation.go, line 219 at r1 (raw file):

	for _, upsert := range upserts {
		if len(upsert.Mutations) == 0 {

when can this happen?

graphql/resolve/mutation.go, line 238 at r1 (raw file):

						schema.GQLWrapf(err, "Couldn't unmarshal response from Dgraph mutation")),
					resolverFailed
			}

I think I understand why you've got the two sets of query now. It's so you can use this cute loop pattern and just run two upserts without having any other logic.

I think that's kinda nice, but it does look to me like another spot that'd be worth thinking about - is it faster to rerun those queries, or to pull out the data from the first query and do it in code. Though, those queries should be super fast because they are always ID or @id lookups.

graphql/resolve/mutation.go, line 250 at r1 (raw file):

	}

	if mutResp != nil {

when can mutResp be nil?

graphql/resolve/mutation_rewriter.go, line 234 at r1 (raw file):

		return mrw.handleMultipleMutations(m)
	}

I think it's time to squash these things together. Can we replace the below by a generic function that handles both cases? What's below looks like one of the cases of the multiple.

What was the original reason for the split

graphql/resolve/mutation_rewriter.go, line 257 at r1 (raw file):

	}

	return []*UpsertMutation{upsert}, schema.GQLWrapf(err, "failed to rewrite mutation payload")

one reason I'm worried about the two cases is that this doesn't look the same as the other case ... why can't this have creations as well?

graphql/resolve/mutation_rewriter.go, line 1007 at r1 (raw file):

Quoted 5 lines of code…

				if xid != nil && !atTopLevel {
					frags = &mutationRes{firstPass: []*mutationFragment{newFragment(val)}}
				} else {
					frags = &mutationRes{secondPass: []*mutationFragment{newFragment(val)}}
				}

so everything in firstPass is the creation steps, everything in secondPass is the reference parts?

graphql/resolve/mutation_rewriter.go, line 1431 at r1 (raw file):

	}

	left = append(left, right...)

is this right? I thought there were reasons why we made sure we didn't write into the existing fragments?

abhimanyusinghgaur

Reviewable status: 6 of 9 files reviewed, 18 unresolved discussions (waiting on @harshil-goel and @pawanrawal)

graphql/resolve/add_mutation_test.yaml, line 1169 at r1 (raw file):

Previously, MichaelJCompton (Michael Compton) wrote…

Nice - it's removing a duplicate!

Yeah! Earlier if there was an error, squashFragments would squash it, and so it would result in multiple errors for the same thing. But, now its better.

graphql/resolve/add_mutation_test.yaml, line 1691 at r2 (raw file):

  gqlmutation: |
    mutation addAuthor($auth: AddLabInput!) {
      addLab(input: [$auth]) {

its a bit confusing, it should be:

mutation addLab($lab: AddLabInput!) {
  addLab(input: [$lab]) {

And same for the variable name in gqlvariables.

graphql/resolve/add_mutation_test.yaml, line 1738 at r2 (raw file):

                        {
                            "ComputerOwner.computers": {
                                "uid": "_:Computer4"

shouldn't it be:

"uid": "uid(Computer4)"

As Computer4 was already created?
Will it work using a blank node from first pass in second pass?

Also, at present we don't have e2e tests for XIDs which are 3-4 level deep.
So, we should add two kind of e2e test cases each with 3-4 levels:

there is no XID in top level object, but all the nested level objects use xid
There is XID at every level.
It can be done using the existing Teacher, Student schema in e2e.

graphql/resolve/mutation.go, line 234 at r1 (raw file):

Previously, pawanrawal (Pawan Rawal) wrote…

result would have the response of the final mutation, is that going to be OK.

Also, as result would contain only the last upsert response, so should we skip the json unmarshalling for rest of the upserts?

abhimanyusinghgaur

Reviewable status: 2 of 13 files reviewed, 18 unresolved discussions (waiting on @harshil-goel, @manishrjain, and @pawanrawal)

MichaelJCompton

Reviewable status: 2 of 13 files reviewed, 17 unresolved discussions (waiting on @harshil-goel, @manishrjain, and @pawanrawal)

harshil added 2 commits May 28, 2020 20:47

graphql: Updated mutation rewriting to fix OOM issue

705f74a

added a big test

2855adb

harshil-goel added the area/graphql Issues related to GraphQL support on Dgraph. label May 28, 2020

harshil-goel requested review from abhimanyusinghgaur and MichaelJCompton May 28, 2020 15:28

fixed e2e tests

058fcdf

pawanrawal reviewed Jun 1, 2020

View reviewed changes

MichaelJCompton suggested changes Jun 2, 2020

View reviewed changes

fixed nested uid issue

e4cf0b3

abhimanyusinghgaur reviewed Jun 3, 2020

View reviewed changes

harshil added 2 commits June 5, 2020 19:14

fixed rewrite issue

653f2e0

fixed commnets

eb77dca

harshil-goel marked this pull request as ready for review June 8, 2020 13:13

harshil-goel requested review from manishrjain and a team as code owners June 8, 2020 13:13

harshil added 15 commits June 8, 2020 19:02

removed print statements

1550a4f

Merge branch 'release/v20.03' into harshil-goel/graphql-oom

078aaea

fixed some tests

816aa73

added a benchmark

4b27416

removed print statement

a986ed0

fixed a bug in benchmark

e7fa2ff

added benchmarks

4a06fef

fixed some tests

705aba6

fixed all tests

9d1e9b7

added a failing test

bfffbbb

refactored code

b1540ca

removed print statements

2e53e78

refactored a little bit

6fb43db

minor refactor

b9aeb25

Merge branch 'release/v20.03' into harshil-goel/graphql-oom

4de8529

abhimanyusinghgaur approved these changes Jun 11, 2020

View reviewed changes

harshil added 2 commits June 12, 2020 14:48

added little bit more comments

63a0f95

added another test

8377217

MichaelJCompton approved these changes Jun 16, 2020

View reviewed changes

removed a error handling to expose panics

6858556

harshil-goel merged commit 5e89162 into release/v20.03 Jun 16, 2020

harshil-goel deleted the harshil-goel/graphql-oom branch June 16, 2020 11:15

harshil-goel added a commit that referenced this pull request Jun 17, 2020

graphql: Updated mutation rewriting to fix OOM issue (#5536)

f39a170

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphql: Updated mutation rewriting to fix OOM issue #5536

graphql: Updated mutation rewriting to fix OOM issue #5536

harshil-goel commented May 28, 2020 •

edited by dgraph-bot

Loading

pawanrawal left a comment

MichaelJCompton left a comment

abhimanyusinghgaur left a comment

abhimanyusinghgaur left a comment

MichaelJCompton left a comment

graphql: Updated mutation rewriting to fix OOM issue #5536

graphql: Updated mutation rewriting to fix OOM issue #5536

Conversation

harshil-goel commented May 28, 2020 • edited by dgraph-bot Loading

pawanrawal left a comment

Choose a reason for hiding this comment

MichaelJCompton left a comment

Choose a reason for hiding this comment

abhimanyusinghgaur left a comment

Choose a reason for hiding this comment

abhimanyusinghgaur left a comment

Choose a reason for hiding this comment

MichaelJCompton left a comment

Choose a reason for hiding this comment

harshil-goel commented May 28, 2020 •

edited by dgraph-bot

Loading