Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove _predicate_ from Dgraph. #3262

Merged
merged 11 commits into from
May 9, 2019
Merged

Conversation

martinmr
Copy link
Contributor

@martinmr martinmr commented Apr 5, 2019

From now on, expand queries will rely solely on the type of the node.
This change removes all mentions of the _predicate_ internal pred that
was previously being used to serve this type of queries.

There is also another small change to allow the value of a list variable
to be used as a list of predicates in expand queries.

Also, some of the expand tests in the query package were simplified so that the
output of the test queries is not as long.

Most of the changes due to removing or replacing tests. These changes
are due to one of the following reasons.

  1. Tests that were checking for the state of _predicate_ and are no
    longer valid.
  2. Tests that were doing checks that should be done inside of the query
    package. They have been removed if there was an existing query that was
    doing the same or a new test has been added if no similar test existed.
  3. Some of the expand tests inside the query package were removed
    because there are tests verifying the expand functionality using types.
    Since types are now the only way to make expand queries work, these
    tests became obsolete.

This change is Reviewable

From now on, expand queries will rely solely on the type of the node.
This change removes all mentions of the _predicate_ internal pred that
was previously being used to serve this type of queries.

There is also another small change to allow the value of a list variable
to be used as a list of predicates in expand queries.

Most of the changes due to removing or replacing tests. These changes
are due to one of the following reasons.

1. Tests that were checking for the state of _predicate_ and are no
longer valid.
2. Tests that were doing checks that should be done inside of the query
package. They have been removed if there was an existing query that was
doing the same or a new test has been added if no similar test existed.
3. Some of the expand tests inside the query package were removed
because there are tests verifying the expand functionality using types.
Since types are now the only way to make expand queries work, these
tests became obsolete.
@martinmr martinmr requested a review from a team April 5, 2019 22:30
dgraph/cmd/bulk/schema.go Outdated Show resolved Hide resolved
dgraph/cmd/zero/oracle.go Outdated Show resolved Hide resolved
@martinmr
Copy link
Contributor Author

martinmr commented Apr 8, 2019

Ran some benchmarks comparing the performance of the live and bulk loaders when _predicate_ is enabled/disabled.

For the bulkloader (using the 21 million dataset), I observed an improvement but it was not drastic. The number of edges went down from 120 million to 99 million. We do not observe a 50% reduction in the number of edges because there are nodes that contain more than an edge for a given predicate.

During the mapping phase, it seems that the cost of adding a new _predicate_ edge is exercised when the first edge is added and repeated edges are more of a no-op. The reduced number of edges speeds up the reduce phase but only by the same percent in the reduction of total edges.

In average, the version without _predicate_ took around 1m45s to complete the bulk load while the current version took around 2m. The reduction in edges is slightly bigger than the reduction in total time.

For the live loader I used the smaller 1 million dataset. I could not observe any performance differences between the two versions. My hypothesis is that the process is throttled by transactions (which are not involved when the bulk loader is used) so the reduction in edges does not result in a reduction in the total time.

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 31 files reviewed, 2 unresolved discussions (waiting on @golangcibot)


dgraph/cmd/bulk/schema.go, line 40 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

File is not goimports-ed (from goimports)

Done.


dgraph/cmd/zero/oracle.go, line 336 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

File is not goimports-ed (from goimports)

Done.

@martinmr
Copy link
Contributor Author

martinmr commented Apr 9, 2019

Re did the bulk loader benchmark, this time without any indices in the schema. This time I observe the 50% reduction although the time savings are not 50% but are nonetheless noticeable (1m5s vs 42s).

@martinmr
Copy link
Contributor Author

martinmr commented Apr 9, 2019

I ran the live loader benchmarks. Without _predicate_, the running time (with the 21 million dataset) is 3m40s. With _predicate_ is 4m45s. I see a decrease in the running time but not by 50%, just like in the bulk loader.

Copy link
Contributor

@codexnull codexnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 27 of 30 files at r1, 4 of 4 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @golangcibot)

Copy link
Contributor

@codexnull codexnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pretty big changeset, but I think I looked at everything. :lgtm:

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @golangcibot)

@martinmr martinmr requested a review from manishrjain April 11, 2019 17:56
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wao! This PR is a dream come true.

This PR also has a real potential of breaking transactions and queries. So, before you merge this, please do these things:

  1. Run Jepsen bank tests overnight. That means, 20 times or more, each test running for half an hour.
  2. Try out as many expand all queries as you can, manually with the 21 million dataset. Do this irrespective of whether you think unit tests cover all the cases or not.

Make the outcome of this verification process part of your PR commit message (when you squash and merge).

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @golangcibot and @manishrjain)

@martinmr martinmr merged commit d7f7d5f into master May 9, 2019
@martinmr martinmr deleted the martinmr/remove-underscore-predicate branch May 9, 2019 19:15
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
From now on, expand queries will rely solely on the type of the node.
This change removes all mentions of the _predicate_ internal pred that
was previously being used to serve this type of queries.

There is also another small change to allow the value of a list variable
to be used as a list of predicates in expand queries.

Most of the changes due to removing or replacing tests. These changes
are due to one of the following reasons.

1. Tests that were checking for the state of _predicate_ and are no
longer valid.
2. Tests that were doing checks that should be done inside of the query
package. They have been removed if there was an existing query that was
doing the same or a new test has been added if no similar test existed.
3. Some of the expand tests inside the query package were removed
because there are tests verifying the expand functionality using types.
Since types are now the only way to make expand queries work, these
tests became obsolete.

Verified jepsen tests work and that the 21 million dataset works as expected
by manually adding types to certain nodes. Types and type definitons
will be added to the entire dataset later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants