Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost of each mutation grows as more mutations are in a transaction #3046

Closed
mooncake4132 opened this issue Feb 20, 2019 · 3 comments
Closed
Labels
area/performance Performance related issues. kind/enhancement Something could be better. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate/work on it. status/needs-attention This issue needs more eyes on it, more investigation might be required before accepting/rejecting it

Comments

@mooncake4132
Copy link

I originally asked this on slack, but it might be more useful to track it as an issue.

Every few days our application will need to insert up to 3 million (this number may grow) predicates into the database. To assess dgraph's performance, I wrote this little python script below to benchmark the time it takes to insert 1000, 10000, 30000, 50000, and 100000 predicates. Results are as follows:

Updated schema in 1.824007272720337 seconds.
Mutating 1000 N-Quads took 0.0899970531463623 seconds.
Mutating 10000 N-Quads took 1.6726512908935547 seconds.
Mutating 30000 N-Quads took 11.846931219100952 seconds.
Mutating 50000 N-Quads took 27.030992031097412 seconds.
Mutating 100000 N-Quads took 111.02126455307007 seconds.

The growth of the time is a bit worrying. Why does inserting 100 thousand predicates take 70x the time to insert 10 thousand predicates?

Here's the script:

#!/usr/bin/env python3
import time

import pydgraph


client_stub = pydgraph.DgraphClientStub('localhost:9080')
client = pydgraph.DgraphClient(client_stub)
client.alter(pydgraph.Operation(drop_all=True))

schema = """
test: string @index(fulltext) @lang .
"""
start_time = time.time()
client.alter(pydgraph.Operation(schema=schema))
print('Updated schema in {} seconds.'.format(time.time() - start_time))

for n in (1_000, 10_000, 30_000, 50_000, 100_000):
    rdf = '\n'.join('<_:node_{}> <test> "test" .'.format(i) for i in range(n))
    transaction = client.txn()
    start_time = time.time()
    transaction.mutate(set_nquads=rdf, commit_now=True)
    print('Mutating {} N-Quads took {} seconds.'.format(n, time.time() - start_time))

Initially, I thought it's because of the fulltext index. So I also tried without without @index(fulltext). Here are the results:

Updated schema in 0.004003763198852539 seconds.
Mutating 1000 N-Quads took 0.07899928092956543 seconds.
Mutating 10000 N-Quads took 1.236546277999878 seconds.
Mutating 30000 N-Quads took 7.040283203125 seconds.
Mutating 50000 N-Quads took 16.69643545150757 seconds.
Mutating 100000 N-Quads took 59.379029989242554 seconds.

It's slightly better, but the time growth is still worrying.

Any guidance is appreciated.

Configurations:

  • Running in docker on Windows.
  • One zero and one alpha.
    Dgraph version : v1.0.11
    Commit SHA-1 : b2a09c5
    Commit timestamp : 2018-12-17 09:50:56 -0800
    Branch : HEAD
    Go version : go1.11.1
@manishrjain manishrjain added the investigate Requires further investigation label Feb 20, 2019
@codexnull
Copy link
Contributor

Thanks for the report and for providing the test script. We confirmed that the transaction time does grow more than linearly with the transaction size and will dig deeper for improvements.

In the mean time, we suggest clients use transaction sizes of 1000 or so and use concurrency instead to increase throughput.

@manishrjain manishrjain removed the investigate Requires further investigation label Feb 21, 2019
@mooncake4132
Copy link
Author

Thanks for confirming. We can definitely split the mutations into different transactions.

I'll let you decide if you want to close this issue or leave it open for tracking.

@campoy campoy added area/performance Performance related issues. and removed optimization labels May 31, 2019
@campoy campoy added status/accepted We accept to investigate/work on it. status/needs-attention This issue needs more eyes on it, more investigation might be required before accepting/rejecting it priority/P2 Somehow important but would not block a release. kind/enhancement Something could be better. labels Sep 13, 2019
@lgalatin lgalatin added priority/P1 Serious issue that requires eventual attention (can wait a bit) and removed priority/P2 Somehow important but would not block a release. labels Apr 6, 2020
@minhaj-shakeel
Copy link
Contributor

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance related issues. kind/enhancement Something could be better. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate/work on it. status/needs-attention This issue needs more eyes on it, more investigation might be required before accepting/rejecting it
Development

No branches or pull requests

6 participants