Skip to content

Conversation

@jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

When executing the ldb command, if the data is very large, a very large file will be generated, which is not friendly. This pr will add a new function that can control the maximum number of records allowed to be saved in each file.
Details:
HDDS-10568

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10568

How was this patch tested?

@jianghuazhu
Copy link
Contributor Author

The new command can be combined with --pre-file-records, using:
./bin/ozone debug ldb --db=/xxxx/xxxx/xxxx/om.db scan --column_family=fileTable --out=/xxxx/xxxx/xxxx/om/data/tmp_fileTable --pre-file-records=1000000

Effective file result set:
image

@jianghuazhu
Copy link
Contributor Author

Copy link
Contributor

@xichen01 xichen01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, few comments to handle

return displayTable(iterator, dbColumnFamilyDef, out, schemaV3);
while (iterator.get().isValid()) {
try (PrintWriter out = new PrintWriter(new BufferedWriter(
new PrintWriter(fileName + fileSuffix, UTF_8.name())))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preFileRecords is not specified, we'd better make the filename the same as the previous filename (without fileSuffix)

batch = new ArrayList<>(batchSize);
sequenceId++;
}
if ((preFileRecords > -1) && (count >= preFileRecords)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the ldb will generate unlimited empty file If the preFileRecords is zero.

@CommandLine.Option(names = {"--pre-file-records"},
description = "The number of print records per file.",
defaultValue = "-1")
private long preFileRecords;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: suggest --max-records-per-file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment and review, @xichen01 .
I will update soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also rename preFileRecords to recordsPerFile.

(pre means "before")

@jianghuazhu
Copy link
Contributor Author

Can you help review this PR again, @xichen01 ?
Thanks.

@jianghuazhu
Copy link
Contributor Author

Can you help review this pr, @kerneltime @errose28 .
Thanks.

}
fileSuffix++;
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can simplify this if... else
Like:

//...
String  fileNameXXX = preFileRecords > 0 ? fileName + fileSuffix++ : fileName;
//...
new PrintWriter(fileNameXXX, UTF_8.name())

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement @jianghuazhu. I think the idea is solid since just using split on a stdout stream may produce individual files that are not valid json. Let's add some tests to TestLDBCli to make sure we have all the corner cases around various flag combinations working.

private int threadCount;

@CommandLine.Option(names = {"--max-records-per-file"},
description = "The number of print records per file.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "The number of print records per file.",
description = "The number of records to print per file.",

Comment on lines 312 to 316
if ((preFileRecords > 0) && (count >= preFileRecords)) {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected behavior when this new --max-records-per-file flag is used without --out? Right now it looks like the choice that stdout is considered "one file" and so this flag overrides the --length option:

# The DB here has many more than 3 entries
$ ./ozone debug ldb --db=om.db scan --column_family=fileTable -l3 --max-records-per-file=2 | jq '.[].keyName' | wc -l 
2

$ ./ozone debug ldb --db=om.db scan --column_family=fileTable -l2 --max-records-per-file=3 | jq '.[].keyName' | wc -l
2

Maybe we should disallow --max-records-per-file without --out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-l is also broken with this new option and I got a bit of a surprise trying to test this 😄 I would have expected 5 files, here not 57 thousand.

$ ./ozone debug ldb --db=om.db scan --column_family=fileTable -l10 --max-records-per-file=2 --out=foo
^C
$ ls -l | grep foo | wc -l                                                                           
57343

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment and review.
I will update soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-l is also broken with this new option and I got a bit of a surprise trying to test this 😄 I would have expected 5 files, here not 57 thousand.

$ ./ozone debug ldb --db=om.db scan --column_family=fileTable -l10 --max-records-per-file=2 --out=foo
^C
$ ls -l | grep foo | wc -l                                                                           
57343

When --out is not set, all records are output to stdout.
When --max-records-per-file and -l exist at the same time, --max-records-per-file shall prevail.

@adoroszlai
Copy link
Contributor

/pending

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)

/pending

@jianghuazhu
Copy link
Contributor Author

Sorry, I had some other work some time ago.
I will make up for this issue in the near future.

@jianghuazhu
Copy link
Contributor Author

The unit tests run fine on my local machine:
image

Can you help me look at this PR again, @errose28 @xichen01 .
Thanks.

@jianghuazhu
Copy link
Contributor Author

jianghuazhu commented Apr 29, 2024

https://github.com/apache/ozone/actions/runs/8849501009/job/24349282946?pr=6420
In CI, it is prompted that File-related operations cannot be performed, such as mkdirs().
Is it because it has something to do with our testing environment?
@errose28 @kerneltime @adoroszlai , do you know the reason?

@adoroszlai
Copy link
Contributor

In CI, it is prompted that File-related operations cannot be performed, such as mkdirs().

M B RV: Exceptional return value of java.io.File.mkdirs() ignored in org.apache.hadoop.ozone.debug.DBScanner.displayTable(ManagedRocksIterator, DBColumnFamilyDefinition, boolean)  At DBScanner.java:[line 232]

https://spotbugs.readthedocs.io/en/latest/bugDescriptions.html#rv-method-ignores-exceptional-return-value-rv-return-value-ignored-bad-practice

// If there are no parent directories, create them
File file = new File(fileName);
File parentFile = file.getParentFile();
if (!parentFile.exists()) {
parentFile.mkdirs();
}

File.mkdirs() returns false if it could not create the directories. This return value needs to be handled (e.g. exit with error message).

M D NP: Possible null pointer dereference in org.apache.hadoop.ozone.debug.TestLDBCli.testScanWithRecordsPerFile() due to return value of called method  Dereferenced at TestLDBCli.java:[line 286]

https://spotbugs.readthedocs.io/en/latest/bugDescriptions.html#np-possible-null-pointer-dereference-due-to-return-value-of-called-method-np-null-on-some-path-from-return-value

File tmpDir1 = new File(scanDir1);
for (File tmpFile : tmpDir1.listFiles()) {

File.listFiles returns null if the path is not a directory. Try checking if tmpDir1 is a directory. The same applies to tmpDir2 below.

@jianghuazhu
Copy link
Contributor Author

ci: https://github.com/jianghuazhu/ozone/actions/runs/8876637070
I've updated it.
Can you help me look at this PR again, @errose28 @xichen01 @adoroszlai ?
Thanks.

@adoroszlai
Copy link
Contributor

/ready

@github-actions github-actions bot dismissed their stale review April 29, 2024 12:16

Blocking review request is removed.

@github-actions github-actions bot removed the pending label Apr 29, 2024

private boolean withinLimit(long i) {
return limit == -1L || i < limit;
return recordsPerFile > 0 || limit == -1L || i < limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the recordsPerFile > 0 is true, the subsequent judgments will be short-circuited, including the i < limit then the limit will be invalidated. This is not a expected function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the comment and review.
When recordsPerFile>0, it means that --max-records-per-file has taken effect, and --limit should be ignored at this time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --limit is used to limit the total count of records, the --max-records-per-file is used to limit the max records count of specific file.

Such as:
ozone debug ldb ... --limit 10 --max-records-per-file 1 --out result.txt

This command should generate 10 files, like result.txt0, result.txt1, ..., and each of them has 1 record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update later.

@adoroszlai
Copy link
Contributor

/pending "I'll update later"

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this issue as un-mergeable as requested.

Please use /ready comment when it's resolved.

Please note that the PR will be closed after 21 days of inactivity from now. (But can be re-opened anytime later...)

"I'll update later"

@github-actions
Copy link

Thank you very much for the patch. I am closing this PR temporarily as there was no activity recently and it is waiting for response from its author.

It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time.

It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs.

If you need ANY help to finish this PR, please contact the community on the mailing list or the slack channel."

@github-actions github-actions bot closed this Jul 21, 2024
@adoroszlai
Copy link
Contributor

Continued in #7467.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants