Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance GetDocuments API by adding bulk retrieval #931

Merged
merged 4 commits into from
Jul 19, 2024

Conversation

kokodak
Copy link
Member

@kokodak kokodak commented Jul 16, 2024

What this PR does / why we need it:

This PR implements a bulk retrieval operation for the GetDocuments API to enhance performance.

The specific tasks accomplished include:

  • Addition of the FindDocInfosByKeys() method to the Database interface
  • Implementation of the FindDocInfosByKeys() logic in mongo.Client and memory.DB respectively
  • Addition of test code for mongo and in-memory implementations
  • Replacement of DB queries used in GetDocument API and GetDocuments API

While the query to retrieve DocInfos has been reduced from N times to once when calling the GetDocuments API, there still remains an issue where packs.BuildDocumentForServerSeq() is called N times.

However, this logic seems to be related to CRDT or logical clock functionalities, which I do not fully understand yet, so I could not work on it. Therefore, I did not remove the TODO comment regarding the N+1 issue.

Which issue(s) this PR fixes:

Fixes #921

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Additional documentation:


Checklist:

  • Added relevant tests or not required
  • Didn't break anything

Summary by CodeRabbit

  • New Features

    • Enhanced document retrieval by adding a method to find document information based on given keys.
    • Introduced a new field to specify whether to include snapshots in document requests.
  • Bug Fixes

    • Improved efficiency of document information retrieval, addressing the N+1 query problem.
  • Documentation

    • Updated OpenAPI specifications for improved readability and consistency.

Copy link

coderabbitai bot commented Jul 16, 2024

Walkthrough

The new method FindDocInfosByKeys was introduced to the DB struct in database.go, enabling the retrieval of multiple documents based on given keys. A corresponding test function, RunFindDocInfosByKeysTest, was also added to verify the functionality. These enhancements aim to improve the performance of the GetDocuments API by facilitating efficient bulk data queries.

Changes

File Change Summary
server/backend/database/memory/database.go Added FindDocInfosByKeys method to retrieve documents based on given keys.
server/backend/database/testcases/testcases.go Added RunFindDocInfosByKeysTest to test the FindDocInfosByKeys method by creating documents with specified keys and verifying the retrieval process.
server/documents/documents.go Revised GetDocumentSummary and GetDocumentSummaries to use the new FindDocInfosByKeys method for improved bulk retrieval efficiency.
server/rpc/admin_server.go Updated GetDocuments function to include a new parameter for the include_snapshot flag to enhance data retrieval options.
api/yorkie/v1/admin.proto Added include_snapshot field to GetDocumentsRequest for optional snapshot inclusion in responses.

Assessment against linked issues

Objective Addressed Explanation
Implement DB Query for GetDocuments API to improve performance (#921)

In the realm of keys and docs they spin,
Where queries dance and tests begin,
Performance soared, the code refined,
In FindDocInfosByKeys, success we find.
🐇🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b468f8b and eb1425d.

Files selected for processing (8)
  • server/backend/database/database.go (1 hunks)
  • server/backend/database/memory/database.go (1 hunks)
  • server/backend/database/memory/database_test.go (1 hunks)
  • server/backend/database/mongo/client.go (1 hunks)
  • server/backend/database/mongo/client_test.go (1 hunks)
  • server/backend/database/testcases/testcases.go (1 hunks)
  • server/documents/documents.go (2 hunks)
  • test/sharding/mongo_client_test.go (1 hunks)
Additional comments not posted (8)
server/backend/database/memory/database_test.go (1)

51-53: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the memory database implementation.

server/backend/database/mongo/client_test.go (1)

66-68: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the MongoDB database implementation.

test/sharding/mongo_client_test.go (1)

75-77: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the MongoDB client with sharded database configuration.

server/documents/documents.go (2)

100-100: Approval of updated GetDocumentSummary function.

The simplification of the document retrieval process in GetDocumentSummary by using FindDocInfoByKey is a positive change, enhancing the efficiency and maintainability of the code.


123-148: Approval of updated GetDocumentSummaries function.

The update to GetDocumentSummaries to use FindDocInfosByKeys for bulk document retrieval is a significant improvement. This change effectively addresses the N+1 problem and enhances the performance of the API.

server/backend/database/database.go (1)

167-172: New method FindDocInfosByKeys added to Database interface

The addition of FindDocInfosByKeys to the Database interface is a key enhancement for supporting bulk document retrieval. The method signature correctly takes a context, a project ID, and a slice of document keys, which is consistent with the interface's pattern for similar methods.

  • Correctness: The method signature is correct and aligns with Go's conventions for interfaces.
  • Performance: This method supports bulk operations, which should improve performance as noted in the PR objectives.
  • Maintainability: The method is clearly defined and fits well with the existing structure of the interface.
server/backend/database/mongo/client.go (1)

766-793: Review of the new method FindDocInfosByKeys.

This method aims to fetch multiple documents based on their keys, which aligns with the PR's objective to enhance performance by reducing the number of database queries. Here are a few observations and suggestions:

  1. Error Handling: The method correctly handles potential errors from the MongoDB operations, which is crucial for robustness.
  2. Efficiency: Using the $in operator with the keys array is efficient for fetching multiple documents in a single query.
  3. Filter Construction: The method constructs a filter to exclude documents marked as removed, which is a good practice for data integrity.

However, consider the following improvements:

  • Logging: Adding logging before and after the MongoDB operations could help in debugging and monitoring the performance of this method.
  • Testing: Ensure that there are comprehensive tests covering various scenarios, including cases with large numbers of keys, no keys, and keys that do not match any documents.

Overall, the implementation looks solid and should contribute positively to the system's performance.

server/backend/database/testcases/testcases.go (1)

96-129: Review of RunFindDocInfosByKeysTest Function

  1. Context Setup: The function correctly sets up the test context and activates a client. This is a standard setup for database-related tests.

  2. Document Creation Simulation: The function simulates the creation of documents by attempting to find document information for a set of keys. However, it does not actually create any documents but only checks if they can be retrieved, assuming they exist. This might be misleading as the name suggests creation but it only checks existence.

  3. Bulk Retrieval and Validation: The bulk retrieval using FindDocInfosByKeys is correctly implemented. The function checks if the keys of the retrieved documents match the expected keys using assert.ElementsMatch, which is appropriate for unordered comparisons.

  4. Length Check: The function also checks if the number of retrieved documents matches the number of requested keys using assert.Len. This is a good practice to ensure that no documents are missing or unexpectedly added.

  5. Error Handling: The function properly checks for errors after each database operation, which is crucial for identifying issues early in the test.

  6. Test Isolation: Each test run is isolated using t.Run, which is good for separating test cases and identifying which specific test fails if there are multiple failures.

Suggestions:

  • Consider actually creating the documents in the database before trying to retrieve them. This would make the test more comprehensive and realistic.
  • Add more detailed assertions to check the contents of the retrieved documents, not just their keys.

server/backend/database/memory/database.go Outdated Show resolved Hide resolved
@sejongk
Copy link
Contributor

sejongk commented Jul 17, 2024

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization.
Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@krapie krapie requested review from sejongk and devleejb July 17, 2024 04:02
@krapie krapie added the enhancement 🌟 New feature or request label Jul 17, 2024
@kokodak
Copy link
Member Author

kokodak commented Jul 17, 2024

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@sejongk
Copy link
Contributor

sejongk commented Jul 17, 2024

@sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

Sure. If you have any suggestions about this, please let me know.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between eb1425d and d847567.

Files selected for processing (2)
  • server/backend/database/memory/database.go (1 hunks)
  • server/backend/database/testcases/testcases.go (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • server/backend/database/memory/database.go
  • server/backend/database/testcases/testcases.go

@hackerwins
Copy link
Member

hackerwins commented Jul 19, 2024

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

@sejongk
Copy link
Contributor

sejongk commented Jul 19, 2024

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?
@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Thanks for your suggestion. I believe this suggested method is somewhat related to #597.

@kokodak
Copy link
Member Author

kokodak commented Jul 19, 2024

I have reviewed all the comments provided.

Currently, I have completed the implementation of bulk query methods for DB.FindClosestSnapshotInfo() and DB.FindChangesBetweenServerSeqs(), which are used in BuildDocumentForServerSeq().

However, I am facing some issues and need help with the following:

  1. Although the bulk query operations are implemented, I am having difficulty writing test cases. Creating good test scenarios is challenging. Could I get some help with this?

  2. I generally understand the context of @hackerwins comment, but I am a bit unclear about the exact meaning of "snapshot" since the term is used in several places in the code.
    If the request value for include snapshot is false, does it mean that DB.FindClosestSnapshotInfo() should be called with includeSnapshot set to false, or does it mean that packs.BuildDocumentForServerSeq() should not be executed at all? (I am inclined to believe it's the latter.)

  • 2-a. If the latter is correct, should we still keep the bulk query code mentioned in point 1, to handle cases where include snapshot is true?
  1. I also agree that passing only the minimal information needed to render the screen is a good idea. However, if it turns out that snapshots will never be used in the GetDocuments API, we might consider configuring the code to exclude snapshots without adding an option to the API request. What are your thoughts on this? Should we still include the option in the request for flexibility?
  • 3-a. If we decide to include the option, we will need to coordinate with the front-end regarding the changes in the API structure. How should we approach this discussion?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d847567 and 327e91d.

Files ignored due to path filters (1)
  • api/yorkie/v1/admin.pb.go is excluded by !**/*.pb.go
Files selected for processing (6)
  • api/docs/yorkie/v1/admin.openapi.yaml (56 hunks)
  • api/docs/yorkie/v1/resources.openapi.yaml (64 hunks)
  • api/docs/yorkie/v1/yorkie.openapi.yaml (49 hunks)
  • api/yorkie/v1/admin.proto (1 hunks)
  • server/documents/documents.go (2 hunks)
  • server/rpc/admin_server.go (1 hunks)
Files skipped from review due to trivial changes (3)
  • api/docs/yorkie/v1/admin.openapi.yaml
  • api/docs/yorkie/v1/resources.openapi.yaml
  • api/docs/yorkie/v1/yorkie.openapi.yaml
Files skipped from review as they are similar to previous changes (1)
  • server/documents/documents.go
Additional comments not posted (2)
api/yorkie/v1/admin.proto (1)

120-120: LGTM! The include_snapshot field is correctly added.

The new field include_snapshot is correctly defined and integrated within the GetDocumentsRequest message.

server/rpc/admin_server.go (1)

228-228: LGTM! But verify the function usage in the codebase.

The code changes are approved.

However, ensure that all function calls to GetDocuments match the new signature.

@hackerwins hackerwins self-requested a review July 19, 2024 16:48
Copy link
Member

@hackerwins hackerwins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. 👍

@hackerwins hackerwins changed the title Implement bulk retrieval operation for GetDocuments API to enhance performance Enhance GetDocuments API by adding bulk retrieval Jul 19, 2024
@hackerwins hackerwins merged commit a4ce314 into yorkie-team:main Jul 19, 2024
4 checks passed
@kokodak
Copy link
Member Author

kokodak commented Jul 19, 2024

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Based on the discussions with @hackerwins and @sejongk regarding the comment ideas above, we have decided to implement the option to include or exclude snapshots in the API request.

As a result, the GetDocuments API request specification has changed, which can be reviewed in this commit.

Consequently, by adding the include_snapshot field with a value of false in the CodePair code, we can expect performance improvements in the GetDocuments API.

hackerwins pushed a commit to yorkie-team/codepair that referenced this pull request Jul 26, 2024
There was an issue with the updatedAt of a document showing another
document updatedAt. Specifically, the updatedAt in the document list
was being reversed and reflecting a different document's value.

This issue occured during the bulk retrieval of document lists using
yorkie-team/yorkie#931. In this process, there was no guarantee that the
order of the keys passed to the DB query matches the order of the
documents in the query result.
minai621 pushed a commit to minai621/codepair that referenced this pull request Jul 28, 2024
There was an issue with the updatedAt of a document showing another
document updatedAt. Specifically, the updatedAt in the document list
was being reversed and reflecting a different document's value.

This issue occured during the bulk retrieval of document lists using
yorkie-team/yorkie#931. In this process, there was no guarantee that the
order of the keys passed to the DB query matches the order of the
documents in the query result.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🌟 New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Implement DB Query for GetDocuments API to improve performance
4 participants