Add DynamoDB kv store implementation #33

piyush-zlai · 2024-10-04T21:26:26Z

Summary

Add a DynamoDB KV store implementation that we can leverage for our Stripe monitoring PoC. This PR only includes the KV store implementation + tests. Wiring this up to our Play service to create the relevant tables + start storing the actual model metadata + time series data will be part of a follow up.

Our existing Kv store api covers some aspects of what we need on the monitoring front but has a few limitations:

No list api - we need this to enumerate all the datasets in tables like the Model table. (We can squeeze this into the get call but it feels a bit contrived as we need to use an arbitrary fixed partition key (e.g. "1") and then issue a range query to get the entries. Leveraging dynamo's scan with a separate 'list' KV store api felt cleaner there.
Create doesn't take params - we do need to provide params to our Dynamo create call where we say if the table is 'sorted' or not, what the read / write capacity units are etc. I've extended the create api to include these params.
Open to feedback here to improve / tweak this. We can consider the kv store api to be a bit in flux and once we integrate a couple more cloud provider KV stores (big table, cosmos db) we'll get to something more generic and settled.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new key-value store interface using Amazon DynamoDB, allowing for data creation, retrieval, and pagination.
- Added support for listing datasets with new request and response types in the KVStore API.
- Enhanced data retrieval capabilities with optional time range parameters.
- Expanded metrics tracking with a new environment type for the key-value store.
Bug Fixes
- Improved error handling during table creation and data operations.
Tests
- Implemented a comprehensive test suite for the new DynamoDB key-value store functionalities, including table creation, data insertion, pagination, and time series data handling.
Chores
- Updated build configuration to integrate AWS support while preserving existing structures.
- Added a new entry to the .gitignore file for generated test files.

coderabbitai · 2024-10-04T21:26:33Z

Walkthrough

The changes in this pull request involve modifications to several files to enhance the project's functionality, particularly with the integration of Amazon DynamoDB. A new project definition for cloud_aws has been added to the build configuration, along with a key-value store implementation using DynamoDB. Additionally, a test suite for this implementation has been introduced. The API has been updated to support new request and response types, and a constant for metrics tracking has been added.

Changes

File Path	Change Summary
.gitignore	Added entry for `/cloud_aws/dynamodb-local-metadata.json`.
build.sbt	Added new project `cloud_aws` with dependencies on `api` and `online`. Updated `root` to aggregate `cloud_aws`.
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala	Implemented `DynamoDBKVStoreImpl` class with methods for creating tables, retrieving, and storing items in DynamoDB.
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala	Added test suite for `DynamoDBKVStoreImpl` covering table creation, data insertion/retrieval, and pagination.
online/src/main/scala/ai/chronon/online/Api.scala	Updated `GetRequest` case class to include `endTsMillis`. Added `ListRequest` and `ListResponse` case classes, and new methods in `KVStore` trait.
online/src/main/scala/ai/chronon/online/Metrics.scala	Added constant `val KVStore = "kv_store"` to `Environment` enumeration for metrics tracking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant KVStore
    participant DynamoDB

    User->>KVStore: create(dataset, props)
    KVStore->>DynamoDB: Create table with dataset
    DynamoDB-->>KVStore: Table created
    KVStore-->>User: Success

    User->>KVStore: multiGet(requests)
    KVStore->>DynamoDB: Retrieve items
    DynamoDB-->>KVStore: Return items
    KVStore-->>User: Items retrieved

    User->>KVStore: multiPut(keyValueDatasets)
    KVStore->>DynamoDB: Batch write items
    DynamoDB-->>KVStore: Write success
    KVStore-->>User: Items stored

Possibly related PRs

Docker init #20: The changes in the .gitignore file in the main PR are related to the addition of a new entry for a DynamoDB-related generated file, while the retrieved PR also modifies the .gitignore to add a new entry for .db files, indicating a focus on managing ignored files in the context of a project that involves DynamoDB.

🐰 In the code, a new tale unfolds,
With DynamoDB, our data holds.
A key-value store, robust and bright,
In tests and metrics, we take flight!
So hop along, let's code away,
With every change, we cheer and play!
🐇

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (11)

online/src/main/scala/ai/chronon/online/Metrics.scala (1)

42-42: LGTM! Consider future metric implementations.

The addition of the KVStore constant to the Environment enumeration is appropriate and aligns with the PR objective of implementing a DynamoDB key-value store. The naming convention is consistent with other entries in the enumeration.

As this is part of a WIP PR, here are some considerations for future implementation:

Ensure that you implement corresponding metrics using this new environment constant when you add the actual key-value store operations.

Consider adding comments to document the purpose and usage of this new environment, especially if it introduces any unique metrics or behaviors compared to existing environments.

Update any relevant documentation or README files to reflect this new metric environment.
build.sbt (2)
181-190: LGTM with suggestions: cloud_aws project structure

The new cloud_aws project is well-structured and includes appropriate dependencies for DynamoDB implementation. However, consider the following suggestions:
If JSON handling is needed in the main code, consider moving Circe dependencies to compile scope:
libraryDependencies ++= Seq(
  "io.circe" %% "circe-core" % circeVersion,
  "io.circe" %% "circe-generic" % circeVersion,
  "io.circe" %% "circe-parser" % circeVersion
)
Specify a version for the AWS SDK to ensure consistency across builds:
val awsSdkVersion = "2.25.35"
libraryDependencies += "software.amazon.awssdk" % "dynamodb" % awsSdkVersion
Consider adding ScalaTest for unit testing:
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.19" % Test
These changes will enhance the project's maintainability and testing capabilities.

Line range hint 1-256: Summary: DynamoDB KV store implementation progress

The changes to build.sbt successfully integrate the new cloud_aws project for DynamoDB implementation. This aligns well with the PR objectives of adding a key-value store using DynamoDB. The project structure and dependencies are appropriate for the task.

However, to fully meet the PR objectives, please consider the following next steps:

Implement the actual DynamoDB key-value store logic in the cloud_aws project.

Add unit tests as mentioned in your PR checklist.

Ensure integration with existing CI processes.

Perform integration testing.

Update relevant documentation.

These steps will help complete the implementation and make the PR ready for final review.
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala (3)
59-59: Consider using a dynamic port for the DynamoDB server

Hardcoding the DynamoDB server port to 8000 may lead to port conflicts if the port is already in use. Using a dynamically assigned port can prevent such conflicts and make the tests more robust.

Modify the setup to use an available port:
- server = ServerRunner.createServerFromCommandLineArgs(Array("-inMemory", "-port", "8000"))
+ val port = 8000 // You can use a method to find an available port dynamically
+ server = ServerRunner.createServerFromCommandLineArgs(Array("-inMemory", "-port", port.toString))
Alternatively, implement a utility to find an available port at runtime.

12-16: Use a consistent testing framework

Mixing JUnit annotations with ScalaTest matchers can lead to confusion and potential integration issues. Consider using a single testing framework consistently throughout the test class to ensure clarity and maintainability.

Option 1: Use ScalaTest annotations and fixtures.
-import org.junit.After
-import org.junit.Before
-import org.junit.Test
+import org.scalatest.BeforeAndAfter
+import org.scalatest.funsuite.AnyFunSuite

-class DynamoDBKVStoreTest {
+class DynamoDBKVStoreTest extends AnyFunSuite with BeforeAndAfter {

- @Before
+ before {

- @After
+ after {

- @Test
- def testCreatePKeyOnlyTable(): Unit = {
+ test("testCreatePKeyOnlyTable") {
Option 2: Use JUnit matchers instead of ScalaTest matchers.
-import org.scalatest.matchers.must.Matchers.be
-import org.scalatest.matchers.should.Matchers.convertToAnyShouldWrapper
+import org.junit.Assert._

...

- putResults.length shouldBe putReqs.length
+ assertEquals(putReqs.length, putResults.length)
Choose the framework that aligns best with your project's standards.

115-115: Review the use of fixed timeouts with Await.result

Using a fixed timeout of 1.second in Await.result may cause tests to become flaky if operations take longer than expected, especially under heavy load or on slower machines. Consider increasing the timeout or using asynchronous testing constructs provided by the testing framework.

Options to address this:

Increase the timeout duration to a higher value, such as 5.seconds or 10.seconds.

Use ScalaTest's AsyncFunSuite or similar to handle asynchronous code without blocking.

Refactor the code to avoid using Await.result where possible.

Example of increasing the timeout:
- val putResults = Await.result(kvStore.multiPut(putReqs), 1.second)
+ val putResults = Await.result(kvStore.multiPut(putReqs), 5.seconds)
Also applies to: 149-149, 180-180
online/src/main/scala/ai/chronon/online/Api.scala (2)

57-58: Add Scaladoc comments for ListRequest and ListResponse

To enhance code documentation and maintainability, consider adding Scaladoc comments to the new ListRequest and ListResponse case classes. This will help other developers understand their purpose and how to use them.

71-73: Clarify the default implementation of list method

The list method currently throws a NotImplementedError. If subclasses are expected to override this method, consider adding an @throws annotation or a comment indicating that this method should be implemented by subclasses. This will provide clearer guidance to developers implementing the trait.
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (3)
280-295: Enhance exception messages for better debugging

The exception messages in extractTimedValues and extractListValues are generic and may not provide sufficient context during debugging.

Include the dataset name or additional context in the exception messages:
 if (responseMap.isEmpty)
-  throw new Exception("Empty response returned from DynamoDB")
+  throw new Exception(s"Empty response returned from DynamoDB for dataset: ${request.dataset}")

// Similarly for other exceptions
Also applies to: 297-310

99-101: Implement retry logic for transient DynamoDB exceptions

When a ProvisionedThroughputExceededException occurs, the code logs the error but doesn't attempt to retry the operation, which could be successful on a subsequent attempt due to capacity recovery.

Implement exponential backoff and retry logic to handle transient exceptions:
import akka.pattern.retry

def withRetry[T](operation: => T, retries: Int = 3): T = {
  retry(() => Future(operation), retries)(ExecutionContext.global).await
}
Apply this pattern to DynamoDB operations where transient failures are expected.

Also applies to: 139-151, 165-177, 216-228, 256-268

186-186: Avoid magic numbers; define a constant for default list limit

The default list limit is hardcoded as 100, which can be unclear and error-prone.

Define a constant for the default list limit:
 val listLimit =
-  request.props.get(DynamoDBKVStoreConstants.listLimit).map(p => p.asInstanceOf[Int]).getOrElse(100)
+  request.props.get(DynamoDBKVStoreConstants.listLimit).map(p => p.asInstanceOf[Int]).getOrElse(defaultListLimit)

// Add to DynamoDBKVStoreConstants
+ val defaultListLimit = 100

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d998f8d and 2f5c5e8.

📒 Files selected for processing (6)

.gitignore (1 hunks)
build.sbt (2 hunks)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala (1 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/Metrics.scala (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.gitignore

🔇 Additional comments (4)

build.sbt (1)

54-54: LGTM: Appropriate inclusion of cloud_aws project

The addition of cloud_aws to the root project's aggregation is consistent with the PR objectives of implementing DynamoDB support. This change maintains the existing structure and correctly integrates the new AWS-related project.

online/src/main/scala/ai/chronon/online/Api.scala (1)

23-24: Imports for ListRequest and ListResponse are appropriate

The added import statements for ListRequest and ListResponse are necessary for the new functionalities introduced.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)

329-329: Confirm default end timestamp behavior in queries

In the buildQueryRequest method, endTs defaults to the current system time if not provided. This could lead to unintended results if the client's clock is skewed.

Ensure that defaulting to System.currentTimeMillis() aligns with the expected behavior and doesn't introduce time-related bugs. If necessary, consider requiring endTsMillis to be explicitly provided.

55-57: Address the TODO for tuning capacity units

The default read and write capacity units are set to 10L, and there is a TODO comment to tune these values.

Review the expected read and write throughput for your application and adjust defaultReadCapacityUnits and defaultWriteCapacityUnits accordingly to optimize cost and performance.

cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala

online/src/main/scala/ai/chronon/online/Api.scala

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1)
281-285: Define 'valueBytes' as a constant

The string "valueBytes" is used multiple times in the code. Defining it as a constant in DynamoDBKVStoreConstants can prevent typos and make future modifications easier.

Example:
// In DynamoDBKVStoreConstants
val valueBytesColumn = "valueBytes"
And update the code accordingly:
- "valueBytes" -> AttributeValue.builder.b(SdkBytes.fromByteArray(valueBytes)).build
+ valueBytesColumn -> AttributeValue.builder.b(SdkBytes.fromByteArray(valueBytes)).build

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2f5c5e8 and fb96f00.

📒 Files selected for processing (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala

🧰 Additional context used

📓 Learnings (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (3)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)
34-59: LGTM: Well-organized constants, but TODO needs addressing

The DynamoDBKVStoreConstants object is well-structured and provides clear, descriptive constants for DynamoDB configuration. This approach enhances maintainability.

However, there's a TODO comment on lines 56-58 regarding tuning the default capacity units. It would be beneficial to address this before finalizing the implementation.

Would you like assistance in determining appropriate default values for read and write capacity units?

231-235: Improve error handling for unimplemented bulkPut method

While it's good to document unimplemented methods, consider providing a more informative error message instead of using ???. This will help developers understand that the feature is planned but not yet implemented.

Consider updating the implementation to:
override def bulkPut(sourceOfflineTable: String, destinationOnlineDataSet: String, partition: String): Unit = {
  throw new NotImplementedError("bulkPut is not yet implemented for DynamoDB. This feature is planned for future development.")
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fb96f00 and 52addef.

📒 Files selected for processing (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)

🧰 Additional context used

📓 Learnings (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (5)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-07T15:49:27.648Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:223-242
Timestamp: 2024-10-07T15:39:43.507Z
Learning: In `DynamoDBKVStoreImpl.scala`, handling `ProvisionedThroughputExceededException` covers the scenarios for `ThrottlingException` in the `handleDynamoDbOperation` method.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

🔇 Additional comments (8)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (8)

1-32: LGTM: Imports and package declaration are well-organized

The imports are comprehensive and correctly grouped, covering all necessary AWS SDK components and Scala utilities required for the implementation.

61-65: LGTM: Well-structured class declaration with good practices

The DynamoDBKVStoreImpl class is well-defined, extending the KVStore interface and using dependency injection for the DynamoDbClient. The inclusion of a metrics context demonstrates good observability practices.

66-123: LGTM: Robust implementation of create method

The create method is well-implemented with proper error handling, metrics tracking, and use of a waiter to ensure table creation is complete. It correctly handles different property types and provides informative logging.

125-165: LGTM: Well-implemented multiGet method with proper concurrency

The multiGet method efficiently handles both simple get requests and query requests. It makes good use of Future for asynchronous processing and leverages helper methods for building requests and extracting values.

167-202: LGTM: Well-implemented list method with pagination support

The list method effectively implements pagination using DynamoDB's scan operation. It correctly handles continuation keys for subsequent requests and uses Future for asynchronous processing. The default limit of 100 is a reasonable choice.

204-229: LGTM: Efficient implementation of multiPut method

The multiPut method effectively handles multiple put operations using Future for asynchronous processing. It correctly uses Future.sequence for concurrent processing of all put requests and leverages a helper method for handling DynamoDB operations.

237-322: LGTM: Well-structured and reusable helper methods

The private helper methods are well-organized and promote code reuse. The handleDynamoDbOperation method provides comprehensive error handling, while the methods for extracting values and building requests are clear and focused. This structure enhances the overall maintainability of the code.

1-323: Overall: Excellent implementation of DynamoDB-based key-value store

This implementation of a DynamoDB-based key-value store is well-structured, efficient, and adheres to best practices. Key strengths include:

Proper use of asynchronous processing with Future.

Comprehensive error handling and metrics tracking.

Well-organized constants and helper methods promoting maintainability.

Efficient implementation of core operations (create, multiGet, list, multiPut).

Minor improvements could be made by addressing the TODO comments and providing a more informative error for the unimplemented bulkPut method. Overall, this is a solid foundation for a DynamoDB-based key-value store.

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (4)

online/src/main/scala/ai/chronon/online/Api.scala (3)

56-58: New case classes added for listing functionality

The addition of ListRequest and ListResponse case classes is a good design choice for implementing the new listing functionality. These classes provide a flexible structure for both input parameters and return values.

Consider adding scaladoc comments to these new case classes to explain their purpose and the meaning of each field, especially the props and resultProps maps. This will improve code maintainability and make it easier for other developers to understand and use these classes correctly.

69-70: New create method added with properties support

The addition of the create(dataset: String, props: Map[String, Any]) method enhances the KVStore trait by allowing the creation of datasets with associated properties. The default implementation maintains backward compatibility by calling the existing create(dataset) method.

Consider adding a scaladoc comment to this new method explaining its purpose and the potential use of the props parameter. While the default implementation doesn't use props, it's important to document that subclasses (like DynamoDBKVStoreImpl) may utilize this parameter for specific implementations.

71-74: New list method added to KVStore trait

The addition of the list(request: ListRequest) method to the KVStore trait is a good step towards implementing listing functionality. The default implementation throwing a NotImplementedError ensures that subclasses must provide their own implementation if they support listing.

Consider adding a scaladoc comment to this new method explaining its purpose and noting that it's not implemented by default. This will help other developers understand that they need to override this method in their KVStore implementations if they want to support listing functionality.
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1)
34-59: Consider enhancing DynamoDBKVStoreConstants with ScalaDoc and more specific types.

While the constants are well-defined, adding ScalaDoc comments would improve documentation. Also, consider using more specific types for some constants:
object DynamoDBKVStoreConstants {
  /** Read capacity units to configure DynamoDB table with */
  val readCapacityUnits: String = "read-capacity"

  /** Write capacity units to configure DynamoDB table with */
  val writeCapacityUnits: String = "write-capacity"

  /** Indicates if this table is meant to be time sorted in Dynamo */
  val isTimedSorted: String = "is-time-sorted"

  /** Limit of max number of entries to return in a list call */
  val listLimit: String = "limit"

  /** Continuation key to help with list pagination */
  val continuationKey: String = "continuation-key"

  /** Name of the partition key column to use */
  val partitionKeyColumn: String = "keyBytes"

  /** Name of the time sort key column to use */
  val sortKeyColumn: String = "timestamp"

  // TODO: tune these
  val defaultReadCapacityUnits: Long = 10L
  val defaultWriteCapacityUnits: Long = 10L
}
This change improves code readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 52addef and 26347b9.

📒 Files selected for processing (6)

.gitignore (1 hunks)
build.sbt (2 hunks)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala (1 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/Metrics.scala (1 hunks)

✅ Files skipped from review due to trivial changes (1)

online/src/main/scala/ai/chronon/online/Metrics.scala

🚧 Files skipped from review as they are similar to previous changes (2)

.gitignore
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala

🧰 Additional context used

📓 Learnings (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (5)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-07T15:49:27.648Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:223-242
Timestamp: 2024-10-07T15:39:43.507Z
Learning: In `DynamoDBKVStoreImpl.scala`, handling `ProvisionedThroughputExceededException` covers the scenarios for `ThrottlingException` in the `handleDynamoDbOperation` method.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

online/src/main/scala/ai/chronon/online/Api.scala (3)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:69-69
Timestamp: 2024-10-07T15:21:50.787Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/Api.scala`, the default implementation of the `create` method (`def create(dataset: String, props: Map[String, Any]): Unit = create(dataset)`) doesn't leverage the `props` parameter, but subclasses like `DynamoDBKVStoreImpl` use the `props` parameter in their overridden implementations.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:46-50
Timestamp: 2024-10-07T15:17:18.494Z
Learning: When adding new parameters with default values to Scala case classes like `GetRequest`, existing usages don't need updating if backward compatibility is intended.

🔇 Additional comments (7)

build.sbt (4)

Line range hint 1-1: LGTM: Scala version update

The update to Scala version 2.12.18 is a good practice for keeping dependencies current. This minor version update is unlikely to introduce breaking changes and is consistent with the version specified in the inThisBuild section.

Line range hint 1-190: Overall assessment: build.sbt changes

The modifications to build.sbt effectively integrate the new cloud_aws project into the build configuration. The Scala version update and root project aggregation changes are straightforward and beneficial. The cloud_aws project definition is well-structured, though there are suggestions regarding dependency scopes and inclusion.

To further improve the configuration:

Review the necessity of Spark dependencies in the cloud_aws project.

Consider adjusting the scope of Circe dependencies if they're needed in the main code.

Run the suggested verification commands to ensure proper integration and dependency management.

These changes provide a solid foundation for the DynamoDB key-value store implementation. Great job on maintaining build consistency while introducing new functionality!

54-54: LGTM: Root project aggregation updated

The addition of cloud_aws to the root project aggregation is consistent with the PR objectives and ensures the new AWS-related functionality will be included in the build process.

To ensure the new project is properly integrated, you may want to run the following command:

This command will attempt to switch to the cloud_aws project and display its name. If successful, it confirms that the project is properly integrated into the build.

181-190: LGTM with suggestions: cloud_aws project definition

The new cloud_aws project definition aligns well with the PR objectives. Good job including DynamoDBLocal for testing purposes. However, there are a couple of points to consider:

Circe dependencies are currently scoped to "test". If JSON handling is required in the main code, consider removing the test scope.

Spark dependencies are included, but it's not clear if they're necessary for a DynamoDB implementation. If they're not used, consider removing them to keep the project lean.

To ensure the dependencies are correctly added and to check for any conflicts, you can run:

This will help verify that all required dependencies are present and there are no version conflicts.

online/src/main/scala/ai/chronon/online/Api.scala (2)

23-24: New imports added for ListRequest and ListResponse

The new imports for ListRequest and ListResponse are correctly placed and correspond to the new case classes added to the KVStore object. This change is consistent with the introduction of the new listing functionality.

46-50: Approved: Enhanced GetRequest with endTsMillis parameter

The addition of the endTsMillis parameter to GetRequest improves the functionality by allowing range scans. The default value of None maintains backward compatibility, so existing usages don't need to be updated. This change is well-documented with an explanatory comment.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1)

1-32: LGTM: Imports and package declaration look good.

The imports are well-organized and cover all the necessary dependencies for DynamoDB operations and Scala utilities. The package declaration is correct for the project structure.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)

58-60: Consider addressing the TODO comment for tuning default capacity units.

The TODO comment suggests that the default read and write capacity units may need tuning. It would be beneficial to revisit these values based on expected usage patterns and performance requirements.

227-231: Create a GitHub issue to track the implementation of bulkPut.

The bulkPut method is currently unimplemented, with a TODO comment explaining the requirements. To ensure this doesn't get overlooked, it would be beneficial to create a GitHub issue to track this task.

Would you like me to create a GitHub issue for implementing the bulkPut method? The issue could include the following points:

Implement bulkPut for DynamoDB store

Transform Parquet data to Amazon's Ion format

Implement table swapping mechanism (as bulkLoad only writes to new tables)

Consider performance implications and best practices for bulk loading in DynamoDB

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3c6711b and aefe14a.

📒 Files selected for processing (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)

🧰 Additional context used

📓 Learnings (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (9)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-07T15:49:27.648Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Errors and logging are handled at a granular level in `handleDynamoDbOperation` by passing the metrics context and updating relevant metric values based on the invoking operation.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-08T16:18:45.669Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

🔇 Additional comments (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)

63-116: LGTM! The create method implementation is robust and well-structured.

The create method properly handles different scenarios, including error cases, and uses a waiter to ensure the table is fully created. The code also demonstrates good practices in logging and metrics tracking.

1-329: Overall, the DynamoDBKVStoreImpl is well-implemented and robust.

The implementation provides a comprehensive interface for interacting with DynamoDB as a key-value store. It demonstrates good practices in error handling, asynchronous operations, and code organization. The use of constants, helper methods, and clear method implementations contributes to the overall quality of the code.

Some minor improvements could be made:

Implementing batch writes for the multiPut method to improve performance.

Refining error handling in some methods to use more functional approaches.

Extracting some logic into separate methods for improved readability.

These suggestions aside, the current implementation is solid and ready for use, with clear TODO comments for future enhancements.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (8)

build.sbt (2)

181-190: LGTM: cloud_aws project definition

The new cloud_aws project definition is well-structured and aligns with the PR objectives. It includes appropriate dependencies for working with DynamoDB and handling JSON data. The inclusion of DynamoDBLocal for testing is a good practice.

Consider moving the Circe dependencies to compile scope if they're needed in the main code, not just for testing. If they're only used in tests, the current configuration is correct.

Line range hint 13-13: LGTM: Version updates for Scala and Spark

The updates to Scala (2.12.18) and Spark (3.5.1) versions are appropriate and align with the project's goal of using recent stable versions. These minor version updates should provide bug fixes and performance improvements without introducing breaking changes.

Consider updating the comment on line 7 to reflect the current date, as it currently mentions "July 24" which might be outdated.

Also applies to: 24-24
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (6)
36-61: Constants are well-defined, but default capacity units may need tuning.

The DynamoDBKVStoreConstants object provides a clear set of constants for configuring DynamoDB tables and operations. The naming and documentation are good, promoting code readability and maintainability.

However, there's a TODO comment regarding tuning the default read and write capacity units:
// TODO: tune these
val defaultReadCapacityUnits = 10L
val defaultWriteCapacityUnits = 10L
Consider benchmarking the application's performance with these default values and adjust them if necessary. Once tuned, remove the TODO comment.

63-116: Solid implementation of table creation with good error handling.

The create methods in DynamoDBKVStoreImpl are well-implemented, providing flexibility in table creation with or without additional properties. The code handles various scenarios, including existing tables and potential errors during creation.

Good practices observed:

Use of a waiter to ensure table creation is complete.

Proper error handling and logging.

Flexible configuration of read/write capacity units and sort keys.

Consider adding a specific catch block for ResourceInUseException to differentiate between table already existing and other creation errors. This would allow for more precise metric tracking:
catch {
  case _: ResourceInUseException =>
    logger.info(s"Table: $dataset already exists")
    metricsContext.increment("create.already_exists")
  case e: Exception =>
    logger.error(s"Error creating Dynamodb table: $dataset", e)
    metricsContext.increment("create.failures")
    throw e
}
118-158: Well-implemented multiGet method with proper concurrency.

The multiGet method effectively handles both simple get requests and query requests based on timestamps. The use of Future for asynchronous processing is appropriate and allows for concurrent execution of requests.

Positive aspects:

Clear separation of get and query requests.

Proper use of Future for concurrent processing.

Utilization of helper methods for building requests and extracting results.

To improve readability, consider extracting the logic for creating get and query requests into separate methods:
private def createGetItemRequest(req: KVStore.GetRequest): GetItemRequest = {
  val keyAttributeMap = primaryKeyMap(req.keyBytes)
  GetItemRequest.builder.key(keyAttributeMap.asJava).tableName(req.dataset).build
}

private def createQueryRequest(req: KVStore.GetRequest): QueryRequest = buildQueryRequest(req)

// Usage in multiGet
val getItemRequestPairs = getLookups.map(req => (req, createGetItemRequest(req)))
val queryRequestPairs = queryLookups.map(req => (req, createQueryRequest(req)))
This refactoring would make the main multiGet method more concise and easier to follow.

160-196: Effective implementation of paginated listing with proper error handling.

The list method provides a solid implementation for paginated retrieval of items from a DynamoDB table. It correctly handles continuation keys and uses the ScanRequest API appropriately.

Positive aspects:

Proper handling of pagination using continuation keys.

Use of Future for asynchronous processing.

Correct error handling through handleDynamoDbOperation.

Consider improving the handling of the listLimit parameter:
val listLimit = request.props.get(DynamoDBKVStoreConstants.listLimit) match {
  case Some(value: Int) if value > 0 => value
  case Some(value: String) =>
    Try(value.toInt).filter(_ > 0).getOrElse(100)
  case _ => 100
}
This change ensures that only positive integer values are used for the limit and provides more robust parsing of string values.

198-225: Functional multiPut implementation, but potential for performance improvement.

The multiPut method correctly implements multiple put operations for DynamoDB. It handles each put request individually and uses proper error handling through the handleDynamoDbOperation method.

Positive aspects:

Correct implementation of individual put operations.

Proper use of Future for concurrent processing.

Logging of the number of rows being processed.

Consider implementing batch writes using DynamoDB's BatchWriteItem operation for improved performance, especially when dealing with a large number of items:
import software.amazon.awssdk.services.dynamodb.model.{BatchWriteItemRequest, WriteRequest, PutRequest => DynamoPutRequest}

def batchedMultiPut(keyValueDatasets: Seq[KVStore.PutRequest]): Future[Seq[Boolean]] = {
  val batchSize = 25 // DynamoDB allows up to 25 items per batch write
  val batches = keyValueDatasets.grouped(batchSize).toSeq

  Future.traverse(batches) { batch =>
    Future {
      val writeRequests = batch.map { req =>
        val item = buildAttributeMap(req.keyBytes, req.valueBytes) ++
          req.tsMillis.map(ts => Map(sortKeyColumn -> AttributeValue.builder.n(ts.toString).build)).getOrElse(Map.empty)
        WriteRequest.builder().putRequest(DynamoPutRequest.builder().item(item.asJava).build()).build()
      }

      val batchWriteRequest = BatchWriteItemRequest.builder()
        .requestItems(Map(batch.head.dataset -> writeRequests.asJava).asJava)
        .build()

      handleDynamoDbOperation(metricsContext.withSuffix("batchedmultiput"), batch.head.dataset) {
        dynamoDbClient.batchWriteItem(batchWriteRequest)
      }
      // Handle partial success and return a sequence of boolean values
      // indicating success for each item in the batch
    }
  }.map(_.flatten)
}
This approach would likely provide better performance for large numbers of put operations.

227-329: Well-structured helper methods, but bulkPut needs implementation.

The private helper methods in this class are well-implemented and promote code reuse. They handle various aspects of DynamoDB operations, from building requests to extracting values from responses.

Positive aspects:

Centralized error handling and metrics in handleDynamoDbOperation.

Clear and concise methods for building attribute maps and requests.

Proper extraction of values from DynamoDB responses.

The bulkPut method is currently not implemented:
override def bulkPut(sourceOfflineTable: String, destinationOnlineDataSet: String, partition: String): Unit = ???
This method seems crucial for bulk data loading operations. Would you like assistance in implementing this method or creating a GitHub issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between aefe14a and e9f3b38.

📒 Files selected for processing (6)

.gitignore (1 hunks)
build.sbt (2 hunks)
cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala (1 hunks)
online/src/main/scala/ai/chronon/online/Api.scala (3 hunks)
online/src/main/scala/ai/chronon/online/Metrics.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

.gitignore
cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala
online/src/main/scala/ai/chronon/online/Api.scala
online/src/main/scala/ai/chronon/online/Metrics.scala

🧰 Additional context used

📓 Learnings (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (9)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-07T15:49:27.648Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Errors and logging are handled at a granular level in `handleDynamoDbOperation` by passing the metrics context and updating relevant metric values based on the invoking operation.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-08T16:18:45.669Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

🔇 Additional comments (4)

build.sbt (2)

54-54: LGTM: cloud_aws project added to root aggregate

The addition of cloud_aws to the root project's aggregate list is consistent with the PR objectives. This change ensures that the new DynamoDB key-value store implementation will be included in the build process along with other existing projects.

Line range hint 1-254: Overall assessment: LGTM

The changes to the build.sbt file are well-aligned with the PR objectives of adding a DynamoDB key-value store implementation. The new cloud_aws project is properly integrated into the build, and the dependency management is appropriate. The version updates for Scala and Spark are reasonable. The minor suggestions provided earlier can be addressed to further improve the configuration, but they don't impede the functionality or correctness of the build.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)

1-34: LGTM: Imports and package declaration are appropriate.

The imports cover all necessary classes from the project's online package, AWS SDK for DynamoDB, and Scala standard libraries. There are no unnecessary imports, which is good for maintaining a clean and efficient codebase.

1-329: Overall, a robust and well-implemented DynamoDB-based key-value store.

This implementation of DynamoDBKVStoreImpl provides a comprehensive set of operations for interacting with DynamoDB as a key-value store. The code is well-structured, with good error handling, proper use of the AWS SDK, and appropriate concurrency management using Future.

Key strengths:

Clear separation of concerns with well-defined methods for different operations.

Effective use of helper methods to promote code reuse and readability.

Proper error handling and metrics tracking through the handleDynamoDbOperation method.

Flexible configuration options for table creation and querying.

Areas for potential improvement:

Performance optimization for multiPut using batch writes.

Implementation of the bulkPut method for bulk data loading.

Minor refinements in error handling and parameter parsing.

Overall, this is a solid foundation for a DynamoDB-based key-value store, with room for some performance enhancements and feature completions.

chewy-zlai

Overall, LGTM

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

chewy-zlai

(meant to approve)

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (2)
36-61: Consider load testing to tune default capacity units.

The DynamoDBKVStoreConstants object is well-defined with necessary configuration options. However, there's a TODO comment about tuning the default read and write capacity units.
// TODO: tune these
val defaultReadCapacityUnits = 10L
val defaultWriteCapacityUnits = 10L
Consider conducting load tests to determine appropriate values for these defaults based on your expected workload.

63-116: Create methods look good, consider minor refactoring for clarity.

The implementation of the create methods is solid, with proper error handling and metric tracking. The method for extracting capacity units from props is well-implemented.

For improved readability, consider extracting the table creation logic into a separate private method. This would make the create method more concise and easier to understand at a glance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a99acd8 and 54809b9.

📒 Files selected for processing (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (1 hunks)

🧰 Additional context used

📓 Learnings (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (9)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:29-30
Timestamp: 2024-10-07T14:37:55.680Z
Learning: In the codebase, the `KVStore` implementation provides an implicit `ExecutionContext` in scope, so it's unnecessary to import another.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-07T15:49:27.648Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Errors and logging are handled at a granular level in `handleDynamoDbOperation` by passing the metrics context and updating relevant metric values based on the invoking operation.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:67-111
Timestamp: 2024-10-07T15:04:30.069Z
Learning: In the `DynamoDBKVStoreImpl` class, the `props` parameter is kept as `Map[String, Any]` to handle binary properties without serialization, allowing flexibility in storing different types of data.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-08T16:18:45.669Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within `DynamoDBKVStoreTest.scala` is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

🔇 Additional comments (3)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala (3)
1-34: Imports and package declaration look good.

The imports cover all necessary classes and utilities for DynamoDB operations and Scala concurrent programming. There are no unused imports, and the package declaration is appropriate.

1-333: Overall, a solid implementation with room for optimization.

The DynamoDBKVStoreImpl class provides a comprehensive implementation of a DynamoDB-based key-value store. The code is well-structured, follows good practices, and covers all necessary operations.

Main strengths:

Thorough implementation of create, multiGet, list, and multiPut operations.

Good use of Scala's Future for asynchronous processing.

Proper error handling and metric tracking in most areas.

Areas for potential improvement:

Performance optimization in the multiPut method by implementing batch writes.

More consistent error handling across helper methods, particularly in value extraction.

Minor refactoring for improved readability and code reuse.

These suggested improvements are mostly optimizations and refinements. The current implementation is solid and should work well for most use cases.

201-228: 🛠️ Refactor suggestion

Consider implementing batch writes for improved performance.

The current implementation of multiPut uses individual put operations for each item. While this approach works, it may not be the most efficient for large numbers of items. DynamoDB supports batch write operations, which can significantly improve performance.

Consider implementing a batched version of the multiPut method using DynamoDB's BatchWriteItem operation. This would involve:

Grouping the requests into batches of up to 25 items (DynamoDB's limit for batch writes).

Creating BatchWriteItemRequest objects for each batch.

Handling partial success scenarios, where some items in a batch might fail to write.

Here's a skeleton of how this could look:
import software.amazon.awssdk.services.dynamodb.model.{BatchWriteItemRequest, WriteRequest, PutRequest => DynamoPutRequest}

def batchedMultiPut(keyValueDatasets: Seq[KVStore.PutRequest]): Future[Seq[Boolean]] = {
  val batchSize = 25
  val batches = keyValueDatasets.grouped(batchSize).toSeq

  Future.traverse(batches) { batch =>
    Future {
      val writeRequests = batch.map { req =>
        val item = buildAttributeMap(req.keyBytes, req.valueBytes) ++
          req.tsMillis.map(ts => Map(sortKeyColumn -> AttributeValue.builder.n(ts.toString).build)).getOrElse(Map.empty)
        WriteRequest.builder().putRequest(DynamoPutRequest.builder().item(item.asJava).build()).build()
      }

      val batchWriteRequest = BatchWriteItemRequest.builder()
        .requestItems(Map(batch.head.dataset -> writeRequests.asJava).asJava)
        .build()

      handleDynamoDbOperation(metricsContext.withSuffix("batchedmultiput"), batch.head.dataset) {
        dynamoDbClient.batchWriteItem(batchWriteRequest)
      }
      // Handle partial success and return a sequence of boolean values
      // indicating success for each item in the batch
    }
  }.map(_.flatten)
}
This approach would likely provide better performance for large numbers of put operations.
⛔ Skipped due to learnings
Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:167-229
Timestamp: 2024-10-14T14:05:23.413Z
Learning: Leveraging DynamoDB batch writes for performance optimization in `multiPut` will be addressed in a future review.

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala

## Summary PR overview - [Video Link](https://drive.google.com/file/d/1Rei6upL2OiAls2jX7mCERaJGai3Noo1j/view?usp=drive_link) This PR builds on #33 and #43. We register the relevant model / join / groupby / staging query artifacts at the 'app' docker container startup by using the MetadataUploader and hitting the Dynamo endpoints using the KV store. We also extend the API to stop returning mocked data for the list and search calls and start returning real registered models + enriched responses (so the model object includes details on the Joins, GroupBys and features). There were a few broken pieces along the way that I fixed while working through the integration (e.g. the metadata walker code was missing handling models, the api.thrift enum for model type needed to start at index 0 etc). ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Bringing up the container and curling: ``` $ curl http://localhost:9000/api/v1/search?term=1&limit=20 {"offset":0,"items":[{"name":"risk.transaction_model.v1","join":{"name":"risk.user_transactions.txn_join","joinFeatures":[],"groupBys":[{"name":"risk.transaction_events.txn_group_by_user","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.transaction_events.txn_group_by_merchant","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.user_data.user_group_by","features":["account_age","account_balance","credit_score","number_of_devices","country","account_type","preferred_language"]},{"name":"risk.merchant_data.merchant_group_by","features":["account_age","zipcode","is_big_merchant","country","account_type","preferred_language"]}]},"online":false,"production":false,"team":"risk","modelType":"XGBoost"}]} ```  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced a new `DynamoDBMonitoringStore` for improved model data retrieval. - Enhanced `Model` structure to include joins and groupings for better data organization. - Expanded `parseTeam` method to handle `Model` instances in metadata processing. - Updated `ModelController` and `SearchController` to utilize real data from `DynamoDBMonitoringStore`. - Introduced new methods for managing datasets in the `MetadataStore`. - Improved handling of list values in the DynamoDB key-value store implementation. - Added support for AWS services in the application through new Docker configurations. - Enhanced the application's Docker setup for better build and runtime environments. - Modified the application's metadata loading process to include a new section for DynamoDB. - **Bug Fixes** - Corrected handling of `Either` types in test cases to prevent runtime errors. - **Documentation** - Updated configuration files to support new DynamoDB module. - **Tests** - Added comprehensive unit tests for `DynamoDBKVStoreImpl` and `DynamoDBMonitoringStore`. - Enhanced test coverage for `ModelController` and `SearchController` with mocked dependencies. - Introduced new tests for the functionality of `DynamoDBMonitoringStore`.

## Summary Add a DynamoDB KV store implementation that we can leverage for our Stripe monitoring PoC. This PR only includes the KV store implementation + tests. Wiring this up to our Play service to create the relevant tables + start storing the actual model metadata + time series data will be part of a follow up. Our existing Kv store api covers some aspects of what we need on the monitoring front but has a few limitations: * No list api - we need this to enumerate all the datasets in tables like the Model table. (We can squeeze this into the get call but it feels a bit contrived as we need to use an arbitrary fixed partition key (e.g. "1") and then issue a range query to get the entries. Leveraging dynamo's scan with a separate 'list' KV store api felt cleaner there. * Create doesn't take params - we do need to provide params to our Dynamo create call where we say if the table is 'sorted' or not, what the read / write capacity units are etc. I've extended the create api to include these params. Open to feedback here to improve / tweak this. We can consider the kv store api to be a bit in flux and once we integrate a couple more cloud provider KV stores (big table, cosmos db) we'll get to something more generic and settled. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new key-value store interface using Amazon DynamoDB, allowing for data creation, retrieval, and pagination. - Added support for listing datasets with new request and response types in the KVStore API. - Enhanced data retrieval capabilities with optional time range parameters. - Expanded metrics tracking with a new environment type for the key-value store. - **Bug Fixes** - Improved error handling during table creation and data operations. - **Tests** - Implemented a comprehensive test suite for the new DynamoDB key-value store functionalities, including table creation, data insertion, pagination, and time series data handling. - **Chores** - Updated build configuration to integrate AWS support while preserving existing structures. - Added a new entry to the .gitignore file for generated test files.

## Summary PR overview - [Video Link](https://drive.google.com/file/d/1Rei6upL2OiAls2jX7mCERaJGai3Noo1j/view?usp=drive_link) This PR builds on #33 and #43. We register the relevant model / join / groupby / staging query artifacts at the 'app' docker container startup by using the MetadataUploader and hitting the Dynamo endpoints using the KV store. We also extend the API to stop returning mocked data for the list and search calls and start returning real registered models + enriched responses (so the model object includes details on the Joins, GroupBys and features). There were a few broken pieces along the way that I fixed while working through the integration (e.g. the metadata walker code was missing handling models, the api.thrift enum for model type needed to start at index 0 etc). ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Bringing up the container and curling: ``` $ curl http://localhost:9000/api/v1/search?term=1&limit=20 {"offset":0,"items":[{"name":"risk.transaction_model.v1","join":{"name":"risk.user_transactions.txn_join","joinFeatures":[],"groupBys":[{"name":"risk.transaction_events.txn_group_by_user","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.transaction_events.txn_group_by_merchant","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.user_data.user_group_by","features":["account_age","account_balance","credit_score","number_of_devices","country","account_type","preferred_language"]},{"name":"risk.merchant_data.merchant_group_by","features":["account_age","zipcode","is_big_merchant","country","account_type","preferred_language"]}]},"online":false,"production":false,"team":"risk","modelType":"XGBoost"}]} ```  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced a new `DynamoDBMonitoringStore` for improved model data retrieval. - Enhanced `Model` structure to include joins and groupings for better data organization. - Expanded `parseTeam` method to handle `Model` instances in metadata processing. - Updated `ModelController` and `SearchController` to utilize real data from `DynamoDBMonitoringStore`. - Introduced new methods for managing datasets in the `MetadataStore`. - Improved handling of list values in the DynamoDB key-value store implementation. - Added support for AWS services in the application through new Docker configurations. - Enhanced the application's Docker setup for better build and runtime environments. - Modified the application's metadata loading process to include a new section for DynamoDB. - **Bug Fixes** - Corrected handling of `Either` types in test cases to prevent runtime errors. - **Documentation** - Updated configuration files to support new DynamoDB module. - **Tests** - Added comprehensive unit tests for `DynamoDBKVStoreImpl` and `DynamoDBMonitoringStore`. - Enhanced test coverage for `ModelController` and `SearchController` with mocked dependencies. - Introduced new tests for the functionality of `DynamoDBMonitoringStore`.

## Summary Add a DynamoDB KV store implementation that we can leverage for our Stripe monitoring PoC. This PR only includes the KV store implementation + tests. Wiring this up to our Play service to create the relevant tables + start storing the actual model metadata + time series data will be part of a follow up. Our existing Kv store api covers some aspects of what we need on the monitoring front but has a few limitations: * No list api - we need this to enumerate all the datasets in tables like the Model table. (We can squeeze this into the get call but it feels a bit contrived as we need to use an arbitrary fixed partition key (e.g. "1") and then issue a range query to get the entries. Leveraging dynamo's scan with a separate 'list' KV store api felt cleaner there. * Create doesn't take params - we do need to provide params to our Dynamo create call where we say if the table is 'sorted' or not, what the read / write capacity units are etc. I've extended the create api to include these params. Open to feedback here to improve / tweak this. We can consider the kv store api to be a bit in flux and once we integrate a couple more cloud provider KV stores (big table, cosmos db) we'll get to something more generic and settled. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new key-value store interface using Amazon DynamoDB, allowing for data creation, retrieval, and pagination. - Added support for listing datasets with new request and response types in the KVStore API. - Enhanced data retrieval capabilities with optional time range parameters. - Expanded metrics tracking with a new environment type for the key-value store. - **Bug Fixes** - Improved error handling during table creation and data operations. - **Tests** - Implemented a comprehensive test suite for the new DynamoDB key-value store functionalities, including table creation, data insertion, pagination, and time series data handling. - **Chores** - Updated build configuration to integrate AWS support while preserving existing structures. - Added a new entry to the .gitignore file for generated test files.

## Summary PR overview - [Video Link](https://drive.google.com/file/d/1Rei6upL2OiAls2jX7mCERaJGai3Noo1j/view?usp=drive_link) This PR builds on #33 and #43. We register the relevant model / join / groupby / staging query artifacts at the 'app' docker container startup by using the MetadataUploader and hitting the Dynamo endpoints using the KV store. We also extend the API to stop returning mocked data for the list and search calls and start returning real registered models + enriched responses (so the model object includes details on the Joins, GroupBys and features). There were a few broken pieces along the way that I fixed while working through the integration (e.g. the metadata walker code was missing handling models, the api.thrift enum for model type needed to start at index 0 etc). ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Bringing up the container and curling: ``` $ curl http://localhost:9000/api/v1/search?term=1&limit=20 {"offset":0,"items":[{"name":"risk.transaction_model.v1","join":{"name":"risk.user_transactions.txn_join","joinFeatures":[],"groupBys":[{"name":"risk.transaction_events.txn_group_by_user","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.transaction_events.txn_group_by_merchant","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.user_data.user_group_by","features":["account_age","account_balance","credit_score","number_of_devices","country","account_type","preferred_language"]},{"name":"risk.merchant_data.merchant_group_by","features":["account_age","zipcode","is_big_merchant","country","account_type","preferred_language"]}]},"online":false,"production":false,"team":"risk","modelType":"XGBoost"}]} ```  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced a new `DynamoDBMonitoringStore` for improved model data retrieval. - Enhanced `Model` structure to include joins and groupings for better data organization. - Expanded `parseTeam` method to handle `Model` instances in metadata processing. - Updated `ModelController` and `SearchController` to utilize real data from `DynamoDBMonitoringStore`. - Introduced new methods for managing datasets in the `MetadataStore`. - Improved handling of list values in the DynamoDB key-value store implementation. - Added support for AWS services in the application through new Docker configurations. - Enhanced the application's Docker setup for better build and runtime environments. - Modified the application's metadata loading process to include a new section for DynamoDB. - **Bug Fixes** - Corrected handling of `Either` types in test cases to prevent runtime errors. - **Documentation** - Updated configuration files to support new DynamoDB module. - **Tests** - Added comprehensive unit tests for `DynamoDBKVStoreImpl` and `DynamoDBMonitoringStore`. - Enhanced test coverage for `ModelController` and `SearchController` with mocked dependencies. - Introduced new tests for the functionality of `DynamoDBMonitoringStore`.

## Summary Add a DynamoDB KV store implementation that we can leverage for our Stripe monitoring PoC. This PR only includes the KV store implementation + tests. Wiring this up to our Play service to create the relevant tables + start storing the actual model metadata + time series data will be part of a follow up. Our existing Kv store api covers some aspects of what we need on the monitoring front but has a few limitations: * No list api - we need this to enumerate all the datasets in tables like the Model table. (We can squeeze this into the get call but it feels a bit contrived as we need to use an arbitrary fixed partition key (e.g. "1") and then issue a range query to get the entries. Leveraging dynamo's scan with a separate 'list' KV store api felt cleaner there. * Create doesn't take params - we do need to provide params to our Dynamo create call where we say if the table is 'sorted' or not, what the read / write capacity units are etc. I've extended the create api to include these params. Open to feedback here to improve / tweak this. We can consider the kv store api to be a bit in flux and once we integrate a couple more cloud provider KV stores (big table, cosmos db) we'll get to something more generic and settled. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new key-value store interface using Amazon DynamoDB, allowing for data creation, retrieval, and pagination. - Added support for listing datasets with new request and response types in the KVStore API. - Enhanced data retrieval capabilities with optional time range parameters. - Expanded metrics tracking with a new environment type for the key-value store. - **Bug Fixes** - Improved error handling during table creation and data operations. - **Tests** - Implemented a comprehensive test suite for the new DynamoDB key-value store functionalities, including table creation, data insertion, pagination, and time series data handling. - **Chores** - Updated build configuration to integrate AWS support while preserving existing structures. - Added a new entry to the .gitignore file for generated test files.

## Summary PR overview - [Video Link](https://drive.google.com/file/d/1Rei6upL2OiAls2jX7mCERaJGai3Noo1j/view?usp=drive_link) This PR builds on #33 and #43. We register the relevant model / join / groupby / staging query artifacts at the 'app' docker container startup by using the MetadataUploader and hitting the Dynamo endpoints using the KV store. We also extend the API to stop returning mocked data for the list and search calls and start returning real registered models + enriched responses (so the model object includes details on the Joins, GroupBys and features). There were a few broken pieces along the way that I fixed while working through the integration (e.g. the metadata walker code was missing handling models, the api.thrift enum for model type needed to start at index 0 etc). ## Checklist - [X] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Bringing up the container and curling: ``` $ curl http://localhost:9000/api/v1/search?term=1&limit=20 {"offset":0,"items":[{"name":"risk.transaction_model.v1","join":{"name":"risk.user_transactions.txn_join","joinFeatures":[],"groupBys":[{"name":"risk.transaction_events.txn_group_by_user","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.transaction_events.txn_group_by_merchant","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.user_data.user_group_by","features":["account_age","account_balance","credit_score","number_of_devices","country","account_type","preferred_language"]},{"name":"risk.merchant_data.merchant_group_by","features":["account_age","zipcode","is_big_merchant","country","account_type","preferred_language"]}]},"online":false,"production":false,"team":"risk","modelType":"XGBoost"}]} ```  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced a new `DynamoDBMonitoringStore` for improved model data retrieval. - Enhanced `Model` structure to include joins and groupings for better data organization. - Expanded `parseTeam` method to handle `Model` instances in metadata processing. - Updated `ModelController` and `SearchController` to utilize real data from `DynamoDBMonitoringStore`. - Introduced new methods for managing datasets in the `MetadataStore`. - Improved handling of list values in the DynamoDB key-value store implementation. - Added support for AWS services in the application through new Docker configurations. - Enhanced the application's Docker setup for better build and runtime environments. - Modified the application's metadata loading process to include a new section for DynamoDB. - **Bug Fixes** - Corrected handling of `Either` types in test cases to prevent runtime errors. - **Documentation** - Updated configuration files to support new DynamoDB module. - **Tests** - Added comprehensive unit tests for `DynamoDBKVStoreImpl` and `DynamoDBMonitoringStore`. - Enhanced test coverage for `ModelController` and `SearchController` with mocked dependencies. - Introduced new tests for the functionality of `DynamoDBMonitoringStore`.

## Summary Add a DynamoDB KV store implementation that we can leverage for our Stripe monitoring PoC. This PR only includes the KV store implementation + tests. Wiring this up to our Play service to create the relevant tables + start storing the actual model metadata + time series data will be part of a follow up. Our existing Kv store api covers some aspects of what we need on the monitoring front but has a few limitations: * No list api - we need this to enumerate all the datasets in tables like the Model table. (We can squeeze this into the get call but it feels a bit contrived as we need to use an arbitrary fixed partition key (e.g. "1") and then issue a range query to get the entries. Leveraging dynamo's scan with a separate 'list' KV store api felt cleaner there. * Create doesn't take params - we do need to provide params to our Dynamo create call where we say if the table is 'sorted' or not, what the read / write capacity units are etc. I've extended the create api to include these params. Open to feedbaour clients here to improve / tweak this. We can consider the kv store api to be a bit in flux and once we integrate a couple more cloud provider KV stores (big table, cosmos db) we'll get to something more generic and settled. ## Cheour clientslist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced a new key-value store interface using Amazon DynamoDB, allowing for data creation, retrieval, and pagination. - Added support for listing datasets with new request and response types in the KVStore API. - Enhanced data retrieval capabilities with optional time range parameters. - Expanded metrics traour clientsing with a new environment type for the key-value store. - **Bug Fixes** - Improved error handling during table creation and data operations. - **Tests** - Implemented a comprehensive test suite for the new DynamoDB key-value store functionalities, including table creation, data insertion, pagination, and time series data handling. - **Chores** - Updated build configuration to integrate AWS support while preserving existing structures. - Added a new entry to the .gitignore file for generated test files.

## Summary PR overview - [Video Link](https://drive.google.com/file/d/1Rei6upL2OiAls2jX7mCERaJGai3Noo1j/view?usp=drive_link) This PR builds on #33 and #43. We register the relevant model / join / groupby / staging query artifacts at the 'app' doour clientser container startup by using the MetadataUploader and hitting the Dynamo endpoints using the KV store. We also extend the API to stop returning moour clientsed data for the list and search calls and start returning real registered models + enriched responses (so the model object includes details on the Joins, GroupBys and features). There were a few broken pieces along the way that I fixed while working through the integration (e.g. the metadata walker code was missing handling models, the api.thrift enum for model type needed to start at index 0 etc). ## Cheour clientslist - [X] Added Unit Tests - [X] Covered by existing CI - [X] Integration tested - [ ] Documentation update Bringing up the container and curling: ``` $ curl http://localhost:9000/api/v1/search?term=1&limit=20 {"offset":0,"items":[{"name":"risk.transaction_model.v1","join":{"name":"risk.user_transactions.txn_join","joinFeatures":[],"groupBys":[{"name":"risk.transaction_events.txn_group_by_user","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.transaction_events.txn_group_by_merchant","features":["transaction_amount_count_1h","transaction_amount_count_1d","transaction_amount_count_30d","transaction_amount_count_365d","transaction_amount_sum_1h"]},{"name":"risk.user_data.user_group_by","features":["account_age","account_balance","credit_score","number_of_devices","country","account_type","preferred_language"]},{"name":"risk.merchant_data.merchant_group_by","features":["account_age","zipcode","is_big_merchant","country","account_type","preferred_language"]}]},"online":false,"production":false,"team":"risk","modelType":"XGBoost"}]} ```  ## Summary by CodeRabbit ## Summary by CodeRabbit - **New Features** - Introduced a new `DynamoDBMonitoringStore` for improved model data retrieval. - Enhanced `Model` structure to include joins and groupings for better data organization. - Expanded `parseTeam` method to handle `Model` instances in metadata processing. - Updated `ModelController` and `SearchController` to utilize real data from `DynamoDBMonitoringStore`. - Introduced new methods for managing datasets in the `MetadataStore`. - Improved handling of list values in the DynamoDB key-value store implementation. - Added support for AWS services in the application through new Doour clientser configurations. - Enhanced the application's Doour clientser setup for better build and runtime environments. - Modified the application's metadata loading process to include a new section for DynamoDB. - **Bug Fixes** - Corrected handling of `Either` types in test cases to prevent runtime errors. - **Documentation** - Updated configuration files to support new DynamoDB module. - **Tests** - Added comprehensive unit tests for `DynamoDBKVStoreImpl` and `DynamoDBMonitoringStore`. - Enhanced test coverage for `ModelController` and `SearchController` with moour clientsed dependencies. - Introduced new tests for the functionality of `DynamoDBMonitoringStore`.

piyush-zlai force-pushed the piyush/dyndb_store branch from 78aa430 to 2f5c5e8 Compare October 4, 2024 21:27

coderabbitai bot reviewed Oct 4, 2024

View reviewed changes

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

piyush-zlai changed the title ~~[wip] Add DynamoDB kv store implementation~~ Add DynamoDB kv store implementation Oct 7, 2024

piyush-zlai requested a review from nikhil-zlai October 7, 2024 15:27

piyush-zlai assigned nikhil-zlai Oct 7, 2024

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

piyush-zlai force-pushed the piyush/dyndb_store branch from 52addef to 26347b9 Compare October 7, 2024 18:52

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

piyush-zlai force-pushed the piyush/dyndb_store branch from 26347b9 to 3c6711b Compare October 9, 2024 13:56

coderabbitai bot reviewed Oct 14, 2024

View reviewed changes

piyush-zlai mentioned this pull request Oct 14, 2024

Register DynamoDB models at Play application startup #44

Merged

4 tasks

piyush-zlai force-pushed the piyush/dyndb_store branch from aefe14a to e9f3b38 Compare October 14, 2024 18:45

coderabbitai bot reviewed Oct 14, 2024

View reviewed changes

chewy-zlai reviewed Oct 14, 2024

View reviewed changes

cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala Outdated Show resolved Hide resolved

chewy-zlai approved these changes Oct 14, 2024

View reviewed changes

piyush-zlai added 10 commits October 15, 2024 09:54

Add DynamoDB kv store implementation

7bdd054

Scala fix updates

21af401

Switch to utf8 charset in test

b8378ce

Refactor to not repeat some of the DynDB code

7a54ce3

Catch exception and not throwable

916c0bf

Add a list check to create tables

b896d59

Tweak tests to pattern match responses

9170ac8

Scala fix

7b3eff9

More coder rabbit suggestion fixes

8799c5a

Coder rabbit feedback

a99acd8

piyush-zlai force-pushed the piyush/dyndb_store branch from e9f3b38 to a99acd8 Compare October 15, 2024 13:54

Inject timestamp

54809b9

coderabbitai bot reviewed Oct 15, 2024

View reviewed changes

coderabbitai bot mentioned this pull request Oct 17, 2024

Changed timestamp column in DynamoDB table configuration #48

Merged

4 tasks

This was referenced Oct 24, 2024

Add delta lake integration #51

Merged

Summary upload #50

Merged

coderabbitai bot mentioned this pull request Nov 6, 2024

Driver Summarizer #62

Merged

4 tasks

coderabbitai bot mentioned this pull request Nov 26, 2024

fixing issues in drift compute logic #96

Merged

4 tasks

coderabbitai bot mentioned this pull request Dec 13, 2024

feat: implement Dataproc spark submitter utility #122

Merged

4 tasks

coderabbitai bot mentioned this pull request Dec 22, 2024

Rework BigTableKV Store & GCP Api #135

Merged

4 tasks

coderabbitai bot mentioned this pull request Jan 14, 2025

unifying all tests into flatspec except vertx ones #212

Merged

4 tasks

This was referenced Jan 31, 2025

feat: tiling support api changes part 1 #288

Closed

chore: add bazel build file for cloud_aws #343

Merged

This was referenced Feb 19, 2025

Fixes to make fetch Join work in CLI including use name over nameToFilePath and replacing / to . in MetaData names #398

Merged

Add list & join schema fetcher APIs #431

Merged

coderabbitai bot mentioned this pull request Mar 7, 2025

fix: downgrade jetty version so that we don't run into jdk compat issues when running streaming #484

Merged

4 tasks

coderabbitai bot mentioned this pull request May 6, 2025

Remove DDBKVStore rate limits #741

Merged

4 tasks

This was referenced Jul 29, 2025

Add Iceberg partition stats extraction #1012

Merged

feat: bazel 6 -> 8, rules_scala 6 -> 7, workspace -> bzlmod #958

Closed

Add DynamoDB kv store implementation #33

Add DynamoDB kv store implementation #33

Uh oh!

Conversation

piyush-zlai commented Oct 4, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chewy-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chewy-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

piyush-zlai commented Oct 4, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)