Skip to content

Conversation

@chewy-zlai
Copy link
Collaborator

@chewy-zlai chewy-zlai commented Oct 15, 2024

Summary

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Updated user permissions for the DynamoDB service.
    • Enhanced data generation with improved schema clarity and batch uploading to DynamoDB.
    • Introduced functionality for reading Parquet files and uploading processed data to DynamoDB.
    • Added a new dependency for AWS SDK.
  • Bug Fixes

    • Improved error handling for AWS credentials during DynamoDB client initialization.
  • Refactor

    • Renamed environment variables for better clarity in the app service.
    • Updated service dependencies for improved application structure.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2024

Walkthrough

The changes in this pull request involve modifications to the docker-init project, specifically affecting the compose.yaml, generate_anomalous_data.py, requirements.txt, Dockerfile, and the introduction of a new script parquet_to_dynamo.py. The compose.yaml file updates the user context for the DynamoDB service, adjusts service dependencies, and renames an environment variable. The Python script enhances the initialization of DynamoDB, renames fields in the schema for clarity, reduces sample data generation, and improves the data upload process. Additionally, a new dependency on botocore is added to the requirements.

Changes

File Change Summary
docker-init/compose.yaml - Changed user from dynamodblocal to root for dynamo service.
- Added depends_on: - dynamo in app service.
- Renamed environment variable from DYNAMO_ENDPOINT to AWS_ENDPOINT_URL_DYNAMODB in app service.
docker-init/generate_anomalous_data.py - Updated DynamoDB initialization to use a try-except block.
- Renamed fields in schema for clarity (e.g., account_balance to user_account_balance).
- Reduced sample data generation from 20,000 to 5,000.
- Enhanced data upload process with table existence check and batch uploads.
docker-init/requirements.txt - Added dependency: botocore==1.31.62.
docker-init/Dockerfile - Added new script parquet_to_dynamo.py in the runtime image.
- Specified Python 3.8 for dependencies installation.
docker-init/parquet_to_dynamo.py - Introduced functionality for reading Parquet files, transforming data, and uploading to DynamoDB.

Possibly related PRs

Suggested reviewers

  • piyush-zlai

Poem

In the world of code, we hop and play,
With Dynamo's changes, we brighten the day.
From dynamodblocal to root we soar,
New names and samples, oh what a score!
With botocore added, we're ready to go,
Hopping through data, watch our project grow! 🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7baff52 and 96d7405.

📒 Files selected for processing (3)
  • docker-init/compose.yaml (2 hunks)
  • docker-init/generate_anomalous_data.py (3 hunks)
  • docker-init/requirements.txt (1 hunks)
🧰 Additional context used
🔇 Additional comments (3)
docker-init/requirements.txt (1)

Line range hint 1-9: LGTM, pending resolution of the botocore dependency.

The requirements file looks good overall, with appropriately pinned versions for all dependencies. Please address the concern regarding the botocore dependency as mentioned in the previous comment. Once that's resolved, these changes can be approved.

docker-init/compose.yaml (2)

52-52: Explicit dependency on DynamoDB service: Approved

Adding an explicit dependency on the 'dynamo' service is a good practice. This ensures that the 'app' service won't start until DynamoDB is ready, potentially preventing startup issues or race conditions.


54-54: Environment variable renamed: Approved with verification needed

Renaming 'DYNAMO_ENDPOINT' to 'AWS_ENDPOINT_URL_DYNAMODB' is a good change as it aligns with AWS SDK naming conventions. This standardization can improve code readability and maintainability.

Please ensure that this change is reflected in all relevant application code and documentation. Run the following script to check for any remaining occurrences of the old variable name:

- "./docker/dynamodb:/home/dynamodblocal/data"
working_dir: /home/dynamodblocal
user: dynamodblocal
user: root
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security concern: DynamoDB service running as root

Changing the user context to root for the DynamoDB service poses significant security risks. Running containers with root privileges can potentially allow unrestricted access to the host system if the container is compromised.

Consider the following alternatives:

  1. Identify the specific permissions needed and create a custom user with only those permissions.
  2. Use the least privileged user that still allows the service to function correctly.
  3. If file permission issues are the reason for this change, consider modifying the permissions of the mounted volumes instead.

Example:

dynamo:
  user: dynamodb
  volumes:
    - "./docker/dynamodb:/home/dynamodblocal/data:rw"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DynamoDB Local has permission issues if it doesn't run as root. I haven't been able to identify a way to modify the permissions of the mounted volume to allow the user dynamodblocal to write to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, I understand that you're encountering permission issues with DynamoDB Local when it doesn't run as root. A common solution is to adjust the permissions of the mounted volume so that the dynamodblocal user inside the container has the necessary write access.

Here are some steps you could try:

  1. Determine the UID and GID of the dynamodblocal user inside the Docker image:

    Run the following commands to find the user and group IDs:

    docker run --rm amazon/dynamodb-local:2.5.2 id -u dynamodblocal
    docker run --rm amazon/dynamodb-local:2.5.2 id -g dynamodblocal

    Suppose the output is:

    1000
    1000
    

    This means the dynamodblocal user has a UID and GID of 1000.

  2. Adjust the ownership of the mounted directory on the host:

    Change the ownership of ./docker/dynamodb to match the UID and GID of the dynamodblocal user:

    sudo chown -R 1000:1000 ./docker/dynamodb
  3. Set appropriate permissions on the directory:

    Ensure that the directory has the necessary read/write permissions:

    chmod -R u+rwx ./docker/dynamodb
  4. Keep the user as dynamodblocal in your Docker Compose file:

    user: dynamodblocal

By aligning the ownership and permissions, the dynamodblocal user inside the container should be able to write to the mounted volume without requiring root privileges. This approach enhances security by avoiding running the container as root.

Let me know if this helps or if you need further assistance.


✏️ Learnings added
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:27.898Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That command didn't work:
"$ docker run --rm amazon/dynamodb-local:2.5.2 id -u dynamodblocal
Unable to find image 'amazon/dynamodb-local:2.5.2' locally
2.5.2: Pulling from amazon/dynamodb-local
25f617ca51f7: Pull complete
88169dbe8767: Pull complete
4f4fb700ef54: Pull complete
b510243a5cd9: Pull complete
Digest: sha256:d7ebddeb60fa418bcda218a6c6a402a58441b2a20d54c9cb1d85fd5194341753
Status: Downloaded newer image for amazon/dynamodb-local:2.5.2
Error: Could not find or load main class id
Caused by: java.lang.ClassNotFoundException: id
"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, thank you for pointing that out. The issue occurs because the amazon/dynamodb-local:2.5.2 image has a default entrypoint that runs the DynamoDB application, so when we try to execute id, it doesn't recognize the command.

To resolve this, we need to override the entrypoint to allow us to run shell commands inside the container. Please try the following commands:

docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -u dynamodblocal
docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -g dynamodblocal

These commands override the default entrypoint, enabling you to execute the id command directly. This should output the UID and GID of the dynamodblocal user.

Once you have the UID and GID, you can adjust the ownership and permissions of the mounted directory on your host machine:

sudo chown -R [UID]:[GID] ./docker/dynamodb
chmod -R u+rwx ./docker/dynamodb

Replace [UID] and [GID] with the numbers obtained from the previous commands.

Let me know if this helps or if you have any questions!


✏️ Learnings added
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:24:11.223Z
Learning: When executing shell commands inside the `amazon/dynamodb-local` Docker image, you need to override the entrypoint using `--entrypoint ""` to prevent it from running the default Java application.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:28.155Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to change the owner in the Dockerfile, so that the change will persist on other host machines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, yes, you can modify the Dockerfile to change the ownership of directories, which can help resolve permission issues across different host machines. By adjusting the ownership inside the Docker image, you ensure that the dynamodblocal user within the container has the necessary permissions, without requiring manual changes on the host system.

Here's how you can do it:

  1. Create a custom Dockerfile for your DynamoDB Local service:

    # Use the base image from Amazon
    FROM amazon/dynamodb-local:2.5.2
    
    # Switch to root user to change ownership
    USER root
    
    # Create the data directory and adjust ownership and permissions
    RUN mkdir -p /home/dynamodblocal/data && \
        chown -R dynamodblocal:dynamodblocal /home/dynamodblocal/data
    
    # Switch back to the dynamodblocal user
    USER dynamodblocal
    
    # Set the working directory
    WORKDIR /home/dynamodblocal
    
    # Expose the default DynamoDB Local port
    EXPOSE 8000
    
    # Set the default command to run DynamoDB Local
    CMD ["-jar", "DynamoDBLocal.jar", "-sharedDb", "-dbPath", "./data"]

    This Dockerfile does the following:

    • Uses the official DynamoDB Local image as the base.
    • Switches to the root user to modify ownership.
    • Creates the /home/dynamodblocal/data directory and changes its owner to dynamodblocal.
    • Switches back to the dynamodblocal user for running the application.
  2. Update your docker-compose.yaml to build and use this custom image:

    services:
      dynamo:
        build:
          context: ./path_to_dynamo_dockerfile
          dockerfile: Dockerfile
        container_name: dynamodb-local
        ports:
          - "8000:8000"
        volumes:
          - "./docker/dynamodb:/home/dynamodblocal/data"
        working_dir: /home/dynamodblocal
        user: dynamodblocal

    Make sure to replace ./path_to_dynamo_dockerfile with the actual relative path to your custom Dockerfile.

    This setup ensures that:

    • The dynamodblocal user inside the container has ownership of the /home/dynamodblocal/data directory.
    • The service runs as dynamodblocal, enhancing security by not running as root.
    • Other developers or environments using your Docker configuration won't need to manually adjust permissions on their host machines.

Note on Mounted Volumes:

When you mount a host directory into a container, the permissions and ownership are governed by the host's filesystem. Changing ownership inside the Dockerfile affects files and directories within the image and the container's filesystem, but not the host-mounted volumes.

To mitigate permission issues with mounted volumes:

  • Use Named Volumes: Instead of mounting a host directory, use a Docker named volume, which Docker manages internally.

    services:
      dynamo:
        build:
          context: ./path_to_dynamo_dockerfile
          dockerfile: Dockerfile
        container_name: dynamodb-local
        ports:
          - "8000:8000"
        volumes:
          - dynamodb_data:/home/dynamodblocal/data
        working_dir: /home/dynamodblocal
        user: dynamodblocal
    
    volumes:
      dynamodb_data:

    This approach avoids host filesystem permission issues since Docker handles the volume's filesystem.

  • Adjust Permissions on Host Directory: If you need to use a host directory, you might still need to adjust its permissions to ensure the container user can access it. However, by setting the ownership in the Dockerfile and ensuring the container runs as dynamodblocal, the need for manual host adjustments is minimized.

Let me know if this solution works for you or if you have any further questions!


🧠 Learnings used
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:28.155Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.

Copy link
Contributor

@piyush-zlai piyush-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chewy for this one, I think we need to ingest a df with the format:

private val baseRowSchema = StructType(
(including the time field)

We could skip the key_json and value_json fields as they're typically used when we are exploring the data using a query engine like Trino and and to see the values without having to crack open the avro.

One of the steps in either @nikhil-zlai 's Drift job PR or a subsequent one will be to extend the Drift job to write out this key_bytes, value_bytes shape of the data. See this example for an existing job that does it - https://github.com/zipline-ai/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala#L233

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (2)
docker-init/Dockerfile (1)

Line range hint 6-11: Consider updating Python and numpy versions.

While specifying Python 3.8 for numpy 1.22.0 due to security concerns is a good practice, there are a few points to consider:

  1. Python 3.8 will reach end-of-life in October 2024. It might be beneficial to use a more recent version of Python.
  2. The security concern with numpy might be outdated. Newer versions of numpy might have addressed these issues and provide better security.
  3. Pinning to a specific numpy version might conflict with other dependencies.

Consider the following actions:

  1. Upgrade to a more recent Python version (e.g., 3.9 or 3.10) if possible.
  2. Review the current numpy security status and upgrade if a newer secure version is available.
  3. Use a dependency management tool like pip-compile or poetry to manage package versions and their interdependencies more effectively.

To help verify the current numpy version and check for updates, please run:

#!/bin/bash
# Description: Check numpy version and available updates

# Test 1: Display current numpy version in requirements.txt
echo "Current numpy version:"
grep numpy docker-init/requirements.txt

# Test 2: Check for the latest numpy version
echo -e "\nLatest numpy version:"
pip3.8 index versions numpy | grep numpy
docker-init/parquet_to_dynamo.py (1)

72-76: Consider adjusting the batch size for optimal performance

The current batch size is set to 1000. Depending on the size of your data and network conditions, you might experience improved performance with a different batch size.

Experiment with different batch sizes to find the optimal setting. Additionally, you may consider implementing error handling within the loop to retry failed batches.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 96d7405 and 530b263.

📒 Files selected for processing (2)
  • docker-init/Dockerfile (1 hunks)
  • docker-init/parquet_to_dynamo.py (1 hunks)
🧰 Additional context used
🪛 Ruff
docker-init/parquet_to_dynamo.py

4-4: decimal.Decimal imported but unused

Remove unused import: decimal.Decimal

(F401)


5-5: json imported but unused

Remove unused import: json

(F401)


7-7: pandas imported but unused

Remove unused import: pandas

(F401)

🔇 Additional comments (1)
docker-init/Dockerfile (1)

Line range hint 1-35: Overall, the Dockerfile changes look good, with some areas for improvement.

The modifications to the Dockerfile align well with the PR objectives and enhance the build and runtime environment. The multi-stage build process is maintained, and the addition of the new parquet_to_dynamo.py script supports the goal of converting key-value pairs from Parquet to DynamoDB.

Key points to address:

  1. Clarify the integration and execution of the new parquet_to_dynamo.py script.
  2. Consider updating Python and numpy versions for better long-term maintainability and security.
  3. Ensure all dependencies are properly managed and compatible with the chosen Python version.

To ensure all necessary files are present and permissions are set correctly, please run:

#!/bin/bash
# Description: Verify presence and permissions of key files

# Test 1: Check if all mentioned files exist
echo "Checking for required files:"
for file in docker-init/requirements.txt docker-init/generate_anomalous_data.py docker-init/parquet_to_dynamo.py docker-init/start.sh
do
    if [ -f "$file" ]; then
        echo "$file exists"
    else
        echo "$file is missing"
    fi
done

# Test 2: Verify start.sh permissions
echo -e "\nChecking start.sh permissions:"
ls -l docker-init/start.sh

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
docker-init/parquet_to_dynamo.py (2)

28-69: Adjust the success message for table creation

The current implementation prints "Created DynamoDB Table" even when the table already exists. This can be misleading. Consider moving the success message inside the try block and adding a more general message outside:

try:
    dynamodb.create_table(
        # ... (table creation parameters)
    )
    print("Created DynamoDB Table")
except botocore.exceptions.ClientError as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f"Table {table_name} already exists. Skipping creation.")
    else:
        raise error

print(f"DynamoDB Table {table_name} is ready for use.")

This change ensures that the "Created" message is only printed when a new table is actually created.


82-85: Enhance completion message and remove extra lines

  1. Consider providing more information in the completion message, such as the number of records processed:

    print(f"Completed writing {simplified_df.count()} records from Parquet to DynamoDB table {table_name}")
  2. Remove the extra empty lines at the end of the file. Generally, it's good practice to have a single empty line at the end of a Python file.

These changes will make the script output more informative and adhere to common Python style guidelines.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 530b263 and ef6cc99.

📒 Files selected for processing (1)
  • docker-init/parquet_to_dynamo.py (1 hunks)
🧰 Additional context used
🪛 Ruff
docker-init/parquet_to_dynamo.py

4-4: decimal.Decimal imported but unused

Remove unused import: decimal.Decimal

(F401)


5-5: json imported but unused

Remove unused import: json

(F401)


7-7: pandas imported but unused

Remove unused import: pandas

(F401)


10-10: pyspark.sql.functions.concat imported but unused

Remove unused import: pyspark.sql.functions.concat

(F401)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
docker-init/parquet_to_dynamo.py (5)

1-11: Remove unused import to improve code clarity

The concat function is imported from pyspark.sql.functions but is not used in the code. Remove this unused import to enhance code readability.

Apply this diff to remove the unused import:

-from pyspark.sql.functions import concat, encode, struct, to_json
+from pyspark.sql.functions import encode, struct, to_json
🧰 Tools
🪛 Ruff

7-7: pyspark.sql.functions.concat imported but unused

Remove unused import: pyspark.sql.functions.concat

(F401)


13-18: LGTM! Consider adding error handling for Parquet file reading

The Parquet file processing and DataFrame transformations look good. The use of environment variables for the Parquet folder path is a good practice.

Consider adding error handling for the Parquet file reading operation to gracefully handle cases where the files might not exist or are inaccessible:

try:
    parquet_files = spark.read.parquet(os.path.join(os.environ['PARQUET_FOLDER'], "*.parquet"))
except Exception as e:
    print(f"Error reading Parquet files: {str(e)}")
    sys.exit(1)

20-60: Move success message inside try block for accuracy

The DynamoDB table creation logic and error handling look good. However, the success message for table creation is outside the try block, which could be misleading if an error occurs.

Move the success message inside the try block to ensure it's only printed when the table is actually created:

 try:
     dynamodb.create_table(
         # ... (table creation parameters)
     )
+    print("Created DynamoDB Table")
 except botocore.exceptions.ClientError as error:
     if error.response['Error']['Code'] == 'ResourceInUseException':
         # Table already exists
         print("Table " + table_name + " already exists. Skipping")
     else:
         raise error
-print("Created DynamoDB Table")

64-75: Consider optimizing for large datasets, but overall approach is good

The batch upload process to DynamoDB and the error handling look good. The progress printing is a nice touch for monitoring the upload process.

For large datasets, converting the entire Spark DataFrame to a Pandas DataFrame might be inefficient. Consider using Spark's foreachPartition method to process and upload data in parallel:

def upload_partition(partition):
    for row in partition:
        try:
            wr.dynamodb.put_item(item=row.asDict(), table_name=table_name)
        except Exception as e:
            print(f"Error uploading item: {str(e)}", flush=True)

simplified_df.foreachPartition(upload_partition)

This approach avoids loading the entire dataset into memory and can be more efficient for large datasets.


77-80: Enhance completion message with operation summary

The completion message is clear, but it could be more informative.

Consider adding more detailed information about the operation, such as the number of records processed and the total time taken:

import time

start_time = time.time()
# ... (rest of the code)

total_records = simplified_df.count()
end_time = time.time()
duration = end_time - start_time

print(f"Wrote {total_records} records from Parquet to Dynamo in {duration:.2f} seconds")

This provides more context about the operation's scale and performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ef6cc99 and c6dd49f.

📒 Files selected for processing (1)
  • docker-init/parquet_to_dynamo.py (1 hunks)
🧰 Additional context used
🪛 Ruff
docker-init/parquet_to_dynamo.py

7-7: pyspark.sql.functions.concat imported but unused

Remove unused import: pyspark.sql.functions.concat

(F401)

…ifies down to just renaming the ts column timestamp.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
docker-init/parquet_to_dynamo.py (2)

1-11: Remove unused imports to improve code clarity

The following imports are not used in the code and should be removed:

  • botocore (line 3)
  • concat, encode, struct, to_json from pyspark.sql.functions (line 7)

Removing these unused imports will enhance code readability and maintainability.

Apply this diff to remove the unused imports:

 import awswrangler as wr
 import boto3
-import botocore
 import os

 from pyspark.sql import SparkSession
-from pyspark.sql.functions import concat, encode, struct, to_json
🧰 Tools
🪛 Ruff

3-3: botocore imported but unused

Remove unused import: botocore

(F401)


7-7: pyspark.sql.functions.concat imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.encode imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.struct imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.to_json imported but unused

Remove unused import

(F401)


35-35: Enhance the completion message with more details

The current completion message is concise but could be more informative. Consider adding details such as the number of records processed or the time taken for the operation.

You could modify the print statement as follows:

print(f"Wrote {len(panda_df)} records from Parquet to DynamoDB table {table_name}")

This provides more context about the operation that was completed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c6dd49f and 3047b2a.

📒 Files selected for processing (1)
  • docker-init/parquet_to_dynamo.py (1 hunks)
🧰 Additional context used
🪛 Ruff
docker-init/parquet_to_dynamo.py

3-3: botocore imported but unused

Remove unused import: botocore

(F401)


7-7: pyspark.sql.functions.concat imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.encode imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.struct imported but unused

Remove unused import

(F401)


7-7: pyspark.sql.functions.to_json imported but unused

Remove unused import

(F401)

🔇 Additional comments (1)
docker-init/parquet_to_dynamo.py (1)

13-15: LGTM: Parquet file reading and initial processing

The code for reading Parquet files and renaming the "ts" column to "timestamp" is well-implemented. This ensures consistency in column naming, which is a good practice.

Comment on lines +17 to +18
dynamodb = boto3.client('dynamodb')
table_name = "test-join_drift_batch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making the table name configurable

The table name "test-join_drift_batch" is currently hardcoded. To improve maintainability and flexibility, consider making it configurable through an environment variable or a configuration file.

You could modify the code as follows:

table_name = os.environ.get('DYNAMODB_TABLE_NAME', 'test-join_drift_batch')

This allows you to easily change the table name without modifying the code, while still providing a default value.

Comment on lines +22 to +32
panda_df = parquet_files.toPandas()

# Upload data in batches
batch_size = 1000 # Adjust based on your needs
for i in range(0, len(panda_df), batch_size):
batch = panda_df.iloc[i:i+batch_size]
try:
wr.dynamodb.put_df(df=batch, table_name=table_name)
print(f"Uploaded batch {i//batch_size + 1}/{len(panda_df)//batch_size + 1}", flush=True)
except Exception as e:
print(f"Error uploading batch {i + 1}: {str(e)}", flush=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using Spark DataFrame for the entire process

Currently, the code converts the Spark DataFrame to a Pandas DataFrame before uploading to DynamoDB. This conversion might be unnecessary and could be inefficient for large datasets. Consider using the Spark DataFrame throughout the process for better performance and scalability.

You could modify the code to use Spark's foreachPartition method:

def upload_partition(partition):
    for row in partition:
        try:
            wr.dynamodb.put_df(df=pd.DataFrame([row.asDict()]), table_name=table_name)
        except Exception as e:
            print(f"Error uploading row: {str(e)}", flush=True)

parquet_files.foreachPartition(upload_partition)

This approach processes the data in Spark partitions, which can be more efficient for large datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants