HDFS-15937. Reduce memory used during datanode layout upgrade#2838
HDFS-15937. Reduce memory used during datanode layout upgrade#2838sodonnel merged 3 commits intoapache:trunkfrom
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
Hexiaoqiao
left a comment
There was a problem hiding this comment.
Thanks @sodonnel for your works. It makes sense to me. And I leave some nit comments. Please FYI.
...fs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
Show resolved
Hide resolved
| compare(a.src, b.src). | ||
| compare(a.dst, b.dst). | ||
| compare(asrc.getName(), bsrc.getName()). | ||
| compare(asrc, bsrc). |
There was a problem hiding this comment.
is it duplicated comparator between compare(asrc.getName(), bsrc.getName()) and compare(asrc, bsrc)?
For UnixFileSystem, the implement is as following. Not sure if it is same for other FileSystem instance.
public int compare(File f1, File f2) { return f1.getPath().compareTo(f2.getPath()); }
There was a problem hiding this comment.
This method already had the two compares, but your comment caused me to look at it more closely.
What this seems to be doing is sorting by the block_filename, then by the path, then by the dst.
Eg, if we had this as src:
/a/blk1
/b/blk2
/c/blk1
The natural sort order by path is as above. However the method would return:
/a/blk1
/c/blk1
/b/blk2
This puts duplicate blocks beside each other in the list, even though they are in different paths. Later in the method it uses that to check for duplicate blocks and then handles them.
Therefore I think we need the chained compares like this. However rather than a.src.getName() I could use a.blockFile() which should be slightly more efficient.
I will push another commit with that change.
...fs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java
Show resolved
Hide resolved
|
Thanks @sodonnel and @jojochuang for your detailed comments. It makes sense to me. +1 from my side. |
|
💔 -1 overall
This message was automatically generated. |
|
Thanks @Hexiaoqiao ! @jojochuang have you got any more comments or do you want to take another look before I commit this? |
…buted by Stephen O'Donnell (apache#2838)
…buted by Stephen O'Donnell (apache#2838) (cherry picked from commit 4c567fc) Change-Id: Ibc35f055c41a454ee19c17878bd2b3ce54510209
When the datanode block layout is upgrade from -56 (256x256) to -57 (32x32), we have found the datanode uses a lot more memory than usual.
For each volume, the blocks are scanned and a list is created holding a series of LinkArgs objects. This object contains a File object for the block source and destination. The file object stores the path as a string, eg:
/data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta
/data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825
This is string is repeated for every block and meta file on the DN, and much of the string is the same each time, leading to a large amount of memory.
If we change the linkArgs to store:
Then ensure were reuse the same file object for repeated src and dest paths, we can save most of the memory without reworking the logic of the code.
The current logic works along the source paths recursively, so you can easily re-use the src path object.
For the destination path, there are only 32x32 (1024) distinct paths, so we can simply cache them in a hashMap and lookup the re-useable object each time.
I tested locally by generating 100k block files and attempting the layout upgrade. A heap dump showed the 100k blocks using about 140MB of heap. That is close to 1.5GB per 1M blocks.
After the change outlined above the same 100K blocks used about 20MB of heap, so 200MB per million blocks.
A general DN sizing recommendation is 1GB of heap per 1M blocks, so the upgrade should be able to happen within the pre-upgrade heap.