-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-18596. Distcp -update to use modification time while checking for file skip. #5308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
5d5228d
ee9a856
d23f13b
f094248
8c427bd
4ff7f36
d64e6b6
0f63b45
58d8f84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -631,14 +631,39 @@ hadoop distcp -update -numListstatusThreads 20 \ | |
| Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation | ||
| on a large directory tree (the limit is 40 threads). | ||
|
|
||
| When `DistCp -update` is used with object stores, | ||
| generally only the modification time and length of the individual files are compared, | ||
| not any checksums. The fact that most object stores do have valid timestamps | ||
| for directories is irrelevant; only the file timestamps are compared. | ||
| However, it is important to have the clock of the client computers close | ||
| to that of the infrastructure, so that timestamps are consistent between | ||
| the client/HDFS cluster and that of the object store. Otherwise, changed files may be | ||
| missed/copied too often. | ||
| When `DistCp -update` is used with object stores, generally only the | ||
| modification time and length of the individual files are compared, not any | ||
| checksums if the checksum algorithm between the two stores is different. | ||
|
|
||
| * The `distcp -update` between two object stores with different checksum | ||
| algorithm compares the modification times of source and target files along | ||
| with the file size to determine whether to skip the file copy. The behavior | ||
| is controlled by the property `distcp.update.modification.time`, which is | ||
| set to true by default. If the source file is more recently modified than | ||
| the target file, it is assumed that the content has changed, and the file | ||
| should be updated. | ||
| We need to ensure that there is no clock skew between the machines. | ||
| The fact that most object stores do have valid timestamps for directories | ||
| is irrelevant; only the file timestamps are compared. However, it is | ||
| important to have the clock of the client computers close to that of the | ||
| infrastructure, so that timestamps are consistent between the client/HDFS | ||
| cluster and that of the object store. Otherwise, changed files may be | ||
| missed/copied too often. | ||
|
|
||
| * `distcp.update.modification.time` can be used alongside the checksum check | ||
| in stores with same checksum algorithm as well. if set to true we check | ||
| both modification time and checksum between the files, but if this property | ||
|
||
| is set to false we only compare the checksum between the files to determine | ||
| if we should skip the copy or not. | ||
|
|
||
| To turn off, set this in your core-site.xml | ||
|
|
||
| ```xml | ||
| <property> | ||
| <name>distcp.update.modification.time</name> | ||
| <value>true</value> | ||
| </property> | ||
| ``` | ||
|
|
||
| **Notes** | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.