Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLO part 3 #2

Closed
wants to merge 21 commits into from
Closed

DLO part 3 #2

wants to merge 21 commits into from

Conversation

teamurko
Copy link
Owner

@teamurko teamurko commented Jun 4, 2024

Summary

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@anjagruenheid
Copy link
Collaborator

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

int maxNumCommits =
(int)
Math.min(
1000L,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we arrive at maxNumCommits of 1000?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's heuristic, i'll add comments on how it's chosen, i guess i suspect that this param will need to be tuned, maybe i should make it configurable


DataCompactionConfig.DataCompactionConfigBuilder configBuilder = DataCompactionConfig.builder();

long targetBytesSize = calculateTargetBytesSize();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a constant value from implementation of the method, do we need a method or constant?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially i thought to have different target size depending on table size, since we're doing constant in our deployment, we can make it constant for now

// DataCompactionConfig.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
configBuilder.maxConcurrentFileGroupRewrites(
Math.min(
50,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rationale behind 50, introduce constant?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

/ estimatedNumTasksPerFileGroup));

// don't split large files
configBuilder.maxByteSizeRatio(10);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto?

return (double) (numFiles - totalSizeBytes / targetBytesSize);
}

/** Calculate the cost of compaction in seconds. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a descriptive comment on the computation math.

}

/** Calculate the gain of compaction in terms of number of files reduced. */
private double calculateCompactionGain() {
Copy link

@sumedhsakdeo sumedhsakdeo Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right layering to have computing gain and cost as part of OpenHouseDataLayoutGenerator? I thought OpenHouseDataLayoutGenerator will generate strategy. Then there is a separate module that executes the optimization strategy, which should compute gain/score and stamp it with the selected strategy in table properties.

Trying to understand layering. Description of that in code would help, maybe in a README

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit variable, but let me add README on how it's currently designed to work

…or statistics/query logs (linkedin#109)

## Summary

This is part 2 of a new feature: data layout optimization library,
strategy generation.
Added data source interface/implementation. This PR builds on top of
linkedin#108

The following 3 components will be added eventually:
1) DLO library that has primitives for generating data layout
optimization strategies
2) App that generates strategies for all tables
3) Scheduling of the app

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
@teamurko
Copy link
Owner Author

teamurko commented Jun 21, 2024

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

@anjagruenheid I thought to do score = cost / gain. Maybe we can integrate, in most cases the query time window is relative to current time, and over time if layout doesn't change, gain should decrease, integral should be finite, this for query optimization. For storage optimization, it should be static, right?

@anjagruenheid
Copy link
Collaborator

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

@anjagruenheid I thought to do score = cost / gain. Maybe we can integrate, in most cases the query time window is relative to current time, and over time if layout doesn't change, gain should decrease, integral should be finite, this for query optimization. For storage optimization, it should be static, right?

It's always temporal, I think. And yes cost/gain is the basis, i.e., if that's below 1, it's a no-brainer to do it. With temporal, I mean that the cost might be higher right now (value > 1) but the more we access the files, the more we gain. Say every time I access the files right now, I have to access 5 files. When I do maintenance, I'll only have 3 files left so cost for rewriting = 3. The access gain is 2. So in a snapshot 3/2 = 1.5 > 1 => cost outweighs the gain. But as soon as I access the files a second time, 3/4 => gain outweighs the cost. So 'a second time' would be the temporal aspect here. If you have information (access statistics), you can compute the gain over a time window and weigh that against the cost of rewrite. Does that make sense?

cbb330 and others added 3 commits July 8, 2024 16:41
## Summary
problem: swagger ui returns 401 when requesting api-docs via browser
<img width="989" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/7f0c2f35-676f-4999-b11d-c9601c0969a7">

solution: expose swagger ui to unauthenticated access (just like the
existing, non-ui version `/v3/api-docs`)

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

Swagger-ui is beneficial for browsing API configuration for features
like, what are the required client headers for a given API endpoint.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

I manually ran tests by configuring
infra/recipes/docker-compose/oh-only/docker-compose.yml to also start
the jobs rest service, then
```bash
➜ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_282-msft.jdk/Contents/Home ./gradlew clean build -x test -x javadoc
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down --rmi all
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up
```

then querying each endpoint:
tables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/b5411c96-0f52-4010-96fc-ce4db8e34ac5">
housetables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/6a2dda20-f501-4298-be5a-4919ed4a1075">
jobs
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/eb7a6cb8-1a28-4d23-b353-e2c6e0ca54a3">

> No tests added or updated. Please explain why. If unsure, please feel
free to ask for help.

When atttempting to add unittests, using e.g. MockMvcBuilder, the
swagger endpoint returns 404 for our services. I believe this is because
the service unittests use a Mocked version of the controller, but none
of our controllers specify swagger. Swagger is configured at spring
application start, so would require a different method of testing than
using a mocked controller. I tried to use a tomcat/h2 local server but
still was getting issues of 404.

I would be open to continuing to try unittests with some pointers in the
right direction for testing springboot application server with
configured swagger.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary
Snapshots expiration job skips removing files as we wanted to localize
files removal to one job. The job traverses the files tree though, and
that is expensive and unnecessary. As the result of this change, it
updates snapshots list in the metadata without traversing the files
tree.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

No change in the job effect is expected, it's purely optimization.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary

Deprecate getAllTables api in favor of searchTables api. The
getAllTables api causes latency issues. It will come back with paging
support.

## Changes

- [x] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
jiang95-dev and others added 9 commits July 12, 2024 10:49
…#135)

## Summary

The [PR](linkedin#127) has not cleaned
all the tests related to the getAllTables api (not sure why the build
succeeded). This PR is to remove the tests to make github build succeed.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary

The build failed because of missing `khttp` library that is required by
`springdoc-openapi` plugin. This library has been removed from maven2
repo from 2022, so not sure why the previous build succeeded (maybe
github caches some gradle libraries). The `springdoc-openapi` plugin has
removed the usage of `khttp` too since `1.5.0`. I'm upgrading the
version to `1.6.0` which is the latest that gradle 6 supports.

More details in this issue:
springdoc/springdoc-openapi-gradle-plugin#92.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
…inkedin#140)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

tables test fixtures fat jar ships with a version of OSS hadoop thats
incompatible with newer version of li-hadoop (specifically 2.10.0.780).
When li-openhouse was upgraded to this version of li-hadoop, unit tests
depending on tables test fixtures fat jar started to fail when starting
a test hdfs cluster.

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [x] Some other form of testing like staging or soak time in
production. Please explain.

Unpacked the jar locally after changes and verified that all hadoop
classes are relocated
<img width="1431" alt="image"
src="https://github.com/user-attachments/assets/6c61ac46-0209-4630-99b7-4b6f14a66736">

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Vikram Bohra <[email protected]>
@@ -58,3 +58,45 @@ scripts/python/env
# Ignore autogenerated dummy tokens for local docker-compose cluster
infra/recipes/docker-compose/oh-hadoop-spark/openhouse.token
infra/recipes/docker-compose/oh-hadoop-spark/u_tableowner.token

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a merge problem in this branch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, wrong pr

@teamurko teamurko closed this Aug 2, 2024
@teamurko
Copy link
Owner Author

teamurko commented Aug 2, 2024

See linkedin#116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants