New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DLO part 3 #2

Closed

teamurko wants to merge 21 commits into dlo_part2_data_source from dlo_part3_gen

Owner

teamurko commented Jun 4, 2024

Summary

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

teamurko mentioned this pull request

Data layout optimization (strategy generation). Part 3: compaction strategy generation with cost/gain scores linkedin/openhouse#116

Merged

17 tasks

Collaborator

anjagruenheid commented Jun 12, 2024

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated

+                  int maxNumCommits =
+                      (int)
+                          Math.min(
+L,

sumedhsakdeo Jun 18, 2024

How did we arrive at maxNumCommits of 1000?

Owner Author

teamurko Jun 21, 2024

it's heuristic, i'll add comments on how it's chosen, i guess i suspect that this param will need to be tuned, maybe i should make it configurable

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated


		DataCompactionConfig.DataCompactionConfigBuilder configBuilder = DataCompactionConfig.builder();

		long targetBytesSize = calculateTargetBytesSize();

sumedhsakdeo Jun 18, 2024

Looks like a constant value from implementation of the method, do we need a method or constant?

Owner Author

teamurko Jun 21, 2024

initially i thought to have different target size depending on table size, since we're doing constant in our deployment, we can make it constant for now

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated

+                  // DataCompactionConfig.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT
+                  configBuilder.maxConcurrentFileGroupRewrites(
+                      Math.min(
+,

sumedhsakdeo Jun 18, 2024

rationale behind 50, introduce constant?

Owner Author

teamurko Jun 21, 2024

will do

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated

+                              / estimatedNumTasksPerFileGroup));
+                  // don't split large files
+                  configBuilder.maxByteSizeRatio(10);

sumedhsakdeo Jun 18, 2024

ditto?

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated

+                  return (double) (numFiles - totalSizeBytes / targetBytesSize);
+                }
+                /** Calculate the cost of compaction in seconds. */

sumedhsakdeo Jun 18, 2024

Please add a descriptive comment on the computation math.

sumedhsakdeo reviewed

View reviewed changes

...ain/java/com/linkedin/openhouse/datalayout/layoutselection/OpenHouseDataLayoutGenerator.java Outdated

+                }
+                /** Calculate the gain of compaction in terms of number of files reduced. */
+                private double calculateCompactionGain() {

sumedhsakdeo Jun 18, 2024 •

edited

Loading

Is this the right layering to have computing gain and cost as part of OpenHouseDataLayoutGenerator? I thought OpenHouseDataLayoutGenerator will generate strategy. Then there is a separate module that executes the optimization strategy, which should compute gain/score and stamp it with the selected strategy in table properties.

Trying to understand layering. Description of that in code would help, maybe in a README

Owner Author

teamurko Jun 21, 2024

It's a bit variable, but let me add README on how it's currently designed to work

teamurko force-pushed the dlo_part2_data_source branch from 17abead to 98b9096 Compare

June 19, 2024 18:36


          Data layout optimization (strategy generation). Part 2: data source f…

397e483

…or statistics/query logs (linkedin#109)

## Summary

This is part 2 of a new feature: data layout optimization library,
strategy generation.
Added data source interface/implementation. This PR builds on top of
linkedin#108

The following 3 components will be added eventually:
1) DLO library that has primitives for generating data layout
optimization strategies
2) App that generates strategies for all tables
3) Scheduling of the app

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Owner Author

teamurko commented Jun 21, 2024 •

edited

Loading

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

@anjagruenheid I thought to do score = cost / gain. Maybe we can integrate, in most cases the query time window is relative to current time, and over time if layout doesn't change, gain should decrease, integral should be finite, this for query optimization. For storage optimization, it should be static, right?

Collaborator

anjagruenheid commented Jun 21, 2024

Adding a comment based on our discussion: Right now, the cost and gain are calculated in different metrics, so we'll need to figure out how we can eventually combine them. Additionally, the gain has a temporal component that it is not currently modeled in this PR, i.e., the cost is usually occurred once for (re-)writing the data, while the gain is substantiated through repeated accesses of users over time.

@anjagruenheid I thought to do score = cost / gain. Maybe we can integrate, in most cases the query time window is relative to current time, and over time if layout doesn't change, gain should decrease, integral should be finite, this for query optimization. For storage optimization, it should be static, right?

It's always temporal, I think. And yes cost/gain is the basis, i.e., if that's below 1, it's a no-brainer to do it. With temporal, I mean that the cost might be higher right now (value > 1) but the more we access the files, the more we gain. Say every time I access the files right now, I have to access 5 files. When I do maintenance, I'll only have 3 files left so cost for rewriting = 3. The access gain is 2. So in a snapshot 3/2 = 1.5 > 1 => cost outweighs the gain. But as soon as I access the files a second time, 3/4 => gain outweighs the cost. So 'a second time' would be the temporal aspect here. If you have information (access statistics), you can compute the gain over a time window and weigh that against the cost of rewrite. Does that make sense?

teamurko force-pushed the dlo_part3_gen branch from 00e9540 to 4f7d668 Compare

June 25, 2024 22:08

cbb330 and others added 3 commits

July 8, 2024 16:41


          Enabling access to swagger-ui using client without auth (linkedin#130)

a5693f4

## Summary
problem: swagger ui returns 401 when requesting api-docs via browser
<img width="989" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/7f0c2f35-676f-4999-b11d-c9601c0969a7">

solution: expose swagger ui to unauthenticated access (just like the
existing, non-ui version `/v3/api-docs`)

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

Swagger-ui is beneficial for browsing API configuration for features
like, what are the required client headers for a given API endpoint.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

I manually ran tests by configuring
infra/recipes/docker-compose/oh-only/docker-compose.yml to also start
the jobs rest service, then
```bash
➜ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_282-msft.jdk/Contents/Home ./gradlew clean build -x test -x javadoc
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down --rmi all
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up
```

then querying each endpoint:
tables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/b5411c96-0f52-4010-96fc-ce4db8e34ac5">
housetables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/6a2dda20-f501-4298-be5a-4919ed4a1075">
jobs
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/eb7a6cb8-1a28-4d23-b353-e2c6e0ca54a3">

> No tests added or updated. Please explain why. If unsure, please feel
free to ask for help.

When atttempting to add unittests, using e.g. MockMvcBuilder, the
swagger endpoint returns 404 for our services. I believe this is because
the service unittests use a Mocked version of the controller, but none
of our controllers specify swagger. Swagger is configured at spring
application start, so would require a different method of testing than
using a mocked controller. I tried to use a tomcat/h2 local server but
still was getting issues of 404.

I would be open to continuing to try unittests with some pointers in the
right direction for testing springboot application server with
configured swagger.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.


          Make snapshots expiration job leaner (linkedin#131)

36aba11

## Summary
Snapshots expiration job skips removing files as we wanted to localize
files removal to one job. The job traverses the files tree though, and
that is expensive and unnecessary. As the result of this change, it
updates snapshots list in the metadata without traversing the files
tree.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

No change in the job effect is expected, it's purely optimization.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.


          Deprecate getAllTables api (linkedin#127)

5ba28ac

## Summary

Deprecate getAllTables api in favor of searchTables api. The
getAllTables api causes latency issues. It will come back with paging
support.

## Changes

- [x] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

teamurko force-pushed the dlo_part3_gen branch from ff7b526 to 23da562 Compare

July 11, 2024 17:47

jiang95-dev and others added 9 commits

July 12, 2024 10:49


          Fix tables service tests due to removal of getAllTables api (linkedin…

7ce5645

…#135)

## Summary

The [PR](linkedin#127) has not cleaned
all the tests related to the getAllTables api (not sure why the build
succeeded). This PR is to remove the tests to make github build succeed.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.


          Fix github build failure (linkedin#137)

27a8a07

## Summary

The build failed because of missing `khttp` library that is required by
`springdoc-openapi` plugin. This library has been removed from maven2
repo from 2022, so not sure why the previous build succeeded (maybe
github caches some gradle libraries). The `springdoc-openapi` plugin has
removed the usage of `khttp` too since `1.5.0`. I'm upgrading the
version to `1.6.0` which is the latest that gradle 6 supports.

More details in this issue:
springdoc/springdoc-openapi-gradle-plugin#92.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.


          [1/5] Set up basic Azure sandbox environment, providers, etc. (linked…

b04bb5c

…in#136)


          Exclude org.apache.hadoop classes from tables-test-fixtures uber jar (l…

b65b434

…inkedin#140)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

tables test fixtures fat jar ships with a version of OSS hadoop thats
incompatible with newer version of li-hadoop (specifically 2.10.0.780).
When li-openhouse was upgraded to this version of li-hadoop, unit tests
depending on tables test fixtures fat jar started to fail when starting
a test hdfs cluster.

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [x] Some other form of testing like staging or soak time in
production. Please explain.

Unpacked the jar locally after changes and verified that all hadoop
classes are relocated
<img width="1431" alt="image"
src="https://github.com/user-attachments/assets/6c61ac46-0209-4630-99b7-4b6f14a66736">

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Vikram Bohra <[email protected]>


          Add strategy generation, calculate cost & gain

ffdcae8


          Add partition stats data source

c51d73e


          Cleanup

2f4a85f

Fix

5e99f00


          Implement new cost/gain calculation

2ea4e23

teamurko force-pushed the dlo_part3_gen branch from 23da562 to 2ea4e23 Compare

July 23, 2024 16:20


          Update tests, check score

a72b9ff

teamurko force-pushed the dlo_part3_gen branch from 7785183 to a72b9ff Compare

July 23, 2024 22:59


          Cleanup

2d712bf

teamurko added 6 commits

August 1, 2024 12:46


          Fix test, remove serializable

5d80dbe


          Add docstring to partition stat class

3ffb26b


          Update docstring

87576db


          Remove collectAsList

59198e2


          Fix docstring

e4f7aca


          Refactor, address comments

3b58a40

sumedhsakdeo reviewed

View reviewed changes

.gitignore

@@ @@ -58,3 +58,45 @@ scripts/python/env @@
               # Ignore autogenerated dummy tokens for local docker-compose cluster
               infra/recipes/docker-compose/oh-hadoop-spark/openhouse.token
               infra/recipes/docker-compose/oh-hadoop-spark/u_tableowner.token

sumedhsakdeo Aug 2, 2024

Is there a merge problem in this branch?

sumedhsakdeo Aug 2, 2024

nvm, wrong pr

teamurko closed this

Owner Author

teamurko commented Aug 2, 2024

See linkedin#116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet