Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Time-Specific Metrics #819

Closed
wants to merge 7 commits into from
Closed

Added Time-Specific Metrics #819

wants to merge 7 commits into from

Conversation

guljain
Copy link
Contributor

@guljain guljain commented Jul 6, 2022

Added time-related metrics.
Functions chosen are -

  1. Open : Present in GoogleHadoopFileSystem class with unit test in GoogleHadoopFileSystemIntegrationTest class.
  2. Rename : Presentin GoogleHadoopFileSystem class with unit test in GoogleHadoopFileSystemIntegrationTest class.
  3. Delete : Present in GoogleHadoopFileSystem class with unit test in GoogleHadoopFileSystemIntegrationTest class.
  4. Create : Present in GoogleHadoopFileSystem class with unit test in GoogleHadoopFileSystemIntegrationTest class.
  5. Hsync : Present in GoogleHadoopOutputStream class with unit test in GoogleHadoopOutputStreamTest class.
  6. Hflush : Present in GoogleHadoopOutputStream class with unit test in GoogleHadoopOutputStreamTest class.
  7. Close : Present in GoogleHadoopOutputStream class with unit test in GoogleHadoopOutputStreamTest class.
  8. Seek : Present in GoogleHadoopFSInputStream class with unit test in GoogleHadoopFSInputStreamIntegrationTest class.


/** Time Statistics which are collected in GCS. */
@InterfaceStability.Unstable
public enum GhfsTimeStatistic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure creating a mutable enum is the right thing to do. Why not create class instead? Also having this singleton make the testing harder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the plugin GhfsTimeStatisticPlugin, I had references to loop over various metrics. This is the reason why mutable enum was used. Also by class, do you mean an implementation like https://github.com/LucaCanali/hadoop/blob/s3aAndHDFSTimeInstrumentation/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ATimeInstrumentation.java ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mutable enums are an anti-pattern since it is sort of like global variables, which are hard to test etc. I was curious why can't we achieve the same functionality we need by using Java class instead of using Java Enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having enum class makes the metrics look more organized in my opinion. I agree with mutable enum issue so I have mapped them with AtomicLong and used it to store the metric's value. Does that seem like a good alternative ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is practically a global mutable variable and testing this will be difficult. Hence I do not think this is the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a bug for this and assign to me. Also add a TODO comment in the code with the bug number.

@@ -496,14 +496,21 @@ public String getScheme() {

@Override
public FSDataInputStream open(Path hadoopPath, int bufferSize) throws IOException {
long startTime = System.nanoTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not create a lambda for timing the methods, instead of duplicating the time duration tracking logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unable to understand on how to use lambda function here. In best case scenario, there will be addition of 2 lines (for startTime and endTime) in each function along with try-finally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have shared a sample code with you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean, need to wrap this entire method call inside lambda? or export existing contents of method to another method and call that from lambda?

Clearly cannot wrap the public method call itself!

@arunkumarchacko
Copy link
Contributor

/gcbrun

@@ -496,14 +496,21 @@ public String getScheme() {

@Override
public FSDataInputStream open(Path hadoopPath, int bufferSize) throws IOException {
long startTime = System.nanoTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have shared a sample code with you


/** Time Statistics capturing total time taken by a function in micro seconds */
@InterfaceStability.Unstable
public enum GhfsTimeStatistic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any plans to capture read/write time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is captures in hflush, hsync and close statistics from GoogleHadoopOutputStream. If not, can you point to specific functions ?


/** Time Statistics which are collected in GCS. */
@InterfaceStability.Unstable
public enum GhfsTimeStatistic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is practically a global mutable variable and testing this will be difficult. Hence I do not think this is the right thing to do.

public void hsync_increment_timeStatistics() throws IOException {
Path objectPath = new Path(ghfs.getUri().resolve("/dir/object2.txt"));
FileSystem.Statistics statistics = new FileSystem.Statistics(ghfs.getScheme());
GoogleHadoopOutputStream fout =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider reducing code duplication in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a single function to test all 3 statistics

instrumentation.fileCreated();
return response;
} finally {
GhfsTimeStatistic.incrementTimeElapsedMicrosec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these stats should be FS instance-specific, not global for JVM.

*
* @return current map of minimums
*/
private Map<String, Long> minimums() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please move private methods below public methods.


/** Time Statistics which are collected in GCS. */
@InterfaceStability.Unstable
public enum GhfsTimeStatistic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a bug for this and assign to me. Also add a TODO comment in the code with the bug number.

@arunkumarchacko
Copy link
Contributor

/gcbrun

arunkumarchacko pushed a commit to arunkumarchacko/hadoop-connectors that referenced this pull request Sep 5, 2022
arunkumarchacko added a commit that referenced this pull request Sep 22, 2022
* Added Time-Specific Metrics

Original PR: #819

Co-authored-by: guljain <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants