-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent bulk loader failures #5361
Comments
Hey @dzygmundfelt, thanks for reporting this issue. How much data were you trying to insert(data size and RDF count)? Also it will be very helpful if you can share memory profile with us when usage are at peak. |
@ashish-goswami Size of the data is 37GB, with about 425 million nquads. I'll see about setting up some memory profiling and rerunning. |
I've seen similar behavior running the bulk loader in v1.2.2 and v02.30.1. However, I'm able to run the reduce phase to completion without the process running out of memory using v2.0.0-rc1. The memory stats below were captured from a bulk process where the reduce phase eventually crashed due to OOM.
|
Based on the information given (issue is not in 2.0.0-rc1 but it's present in 20.03.0). I took a log of the changes in the bulk loader between those two releases. Here they are:
Most of these commits are minor changes and I don't expect them to change the memory output too much. I think commit f7d0371 might be the cause. I'll look at it to see if I can spot any changes that might have caused the memory to increase. @balajijinnah can you look at the commit as well? You probably have better context on this issue as well. |
Hey @dzygmundfelt , could you provide me the number of predicates and their count? |
2 of the 4 items in the heap profile below can be optimized
|
This is fixed via #5537. It's part of v20.03.3 and v1.2.5. |
@danielmai I have read all source code of bulkloader, both v1.1.1 and v20.03.1. I approve @martinmr that the commit f7d0371 caused the memory to increase. I have tested v1.1.1, v1.2.1, v1.2.2, v20.03.1 and the master, bulkdloader cousumed much more momory during the reduce stage and took more time to load dataset since v1.2.2.
|
The toList function does not return an int anymore so the bug is no longer relevant but thanks for pointing it out. I'll let @balajijinnah answer questions about the commit since he has more context as the author. |
Hey @xiangzhao632 , Yep that commit will increase the memory usage. The main reason for us to bring that change is to bulk load large dataset. By heap-based method is not working well with big dataset. Regarding the performance, we're tossing new ideas to improve. (eg: parallel sorting) . I will update you once, I land there. |
Thank you @balajijinnah, I want to share some points:
|
At reduce stage, it costs too much memory. We have 390G rdf files, and our 256G memory machine just crashed due to OOM issue in the reduce stage even when |
Github issues have been deprecated. |
What version of Dgraph are you using?
v1.1.1, v1.2.2, v20.03.0
Have you tried reproducing the issue with the latest release?
No. Latest is currently v20.03.1.
What is the hardware spec (RAM, OS)?
Two different machines running Ubuntu 16.04, one with 4cpu/30gb ram, the other with 8cpu/64gb ram.
Steps to reproduce the issue (command/config used to run Dgraph).
dgraph bulk -f {directory with rdf files} -s {schema file} --map_shards=2 --reduce_shards=1 --http=localhost:8000 --zero=localhost:5080 --format=rdf
Expected behaviour and actual result.
On both aforementioned machines, I tried running v1.2.2 and v20.03.0 with the same result: after successfully completing the MAP phase, the reduce phase failed with ~98.5M edge count (note that this edge count was consistent in the failures over both versions and both machines). Over a short period of time, the bulk loader would ramp up its memory usage to the entirety of the machine's RAM, then the bulk loader would freeze and crash.
When I downgraded the dgraph version to v1.1.1, the MAP and REDUCE phases completely successfully, and I was able to run a new dgraph cluster on top of the resultant p directory.
The text was updated successfully, but these errors were encountered: