Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
415380a
WIP
ezvz Sep 8, 2024
5f596b1
WIP
ezvz Sep 12, 2024
f61ce7b
WIP
ezvz Sep 12, 2024
6f028dc
Convert map to TDataType
chewys1024 Oct 2, 2024
a48ce40
Changed output schema to just a TDataType
chewys1024 Oct 3, 2024
6ab86c7
Fix lint issues.
chewys1024 Oct 3, 2024
3bbf16f
Make modelParams optional as per coderabbit suggestion
chewys1024 Oct 3, 2024
8d93555
Maintain default value of empty dict for modelParams in the resulting…
chewys1024 Oct 3, 2024
eccdfd5
Merge branch 'main' of https://github.com/zipline-ai/chronon into par…
chewys1024 Oct 7, 2024
bb1e5b8
Merge branch 'main' of https://github.com/zipline-ai/chronon into par…
chewys1024 Oct 8, 2024
e4f582b
First Draft of script to upload parquet to Dynamo. Splits parquet int…
chewys1024 Oct 9, 2024
d919bff
Moves code to upload to Dynamo into generate_anomalous_data directly.
chewys1024 Oct 10, 2024
67d405b
Update dynamo upload to only occur if the table doesn't exist. This w…
chewys1024 Oct 11, 2024
a0f40df
Upload to dynamodb in batches to be kinder to memory usage.
chewys1024 Oct 11, 2024
16e0526
Removes unnecessary spaces.
chewys1024 Oct 11, 2024
b0cc123
Reduce to 5000 samples to keep total build time under 5 minutes.
chewys1024 Oct 11, 2024
96d7405
Merge branch 'main' of https://github.com/zipline-ai/chronon into par…
chewys1024 Oct 15, 2024
530b263
Initial version of KV parquet to DynamoDB
chewys1024 Oct 15, 2024
164746f
Attempt to convert to the right format.
chewys1024 Oct 15, 2024
ef6cc99
Revert generate_anomalous_data as this was acidentally picked up.
chewys1024 Oct 15, 2024
9e1c2c6
Edits suggested by coderabbit
chewys1024 Oct 15, 2024
c6dd49f
Clean back up requirements.txt
chewys1024 Oct 15, 2024
3047b2a
Removes table creation as that will be handled separately. Also simpl…
chewys1024 Oct 17, 2024
f3b6ac7
Drop json fields and ds from parquet
chewys1024 Oct 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docker-init/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ ENV PYSPARK_PYTHON=python3.8

RUN mkdir -p /app
COPY docker-init/generate_anomalous_data.py /app/
COPY docker-init/parquet_to_dynamo.py /app/
COPY docker-init/start.sh /start.sh
RUN chmod +x /start.sh
WORKDIR /app
Expand Down
5 changes: 3 additions & 2 deletions docker-init/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ services:
volumes:
- "./docker/dynamodb:/home/dynamodblocal/data"
working_dir: /home/dynamodblocal
user: dynamodblocal
user: root
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security concern: DynamoDB service running as root

Changing the user context to root for the DynamoDB service poses significant security risks. Running containers with root privileges can potentially allow unrestricted access to the host system if the container is compromised.

Consider the following alternatives:

  1. Identify the specific permissions needed and create a custom user with only those permissions.
  2. Use the least privileged user that still allows the service to function correctly.
  3. If file permission issues are the reason for this change, consider modifying the permissions of the mounted volumes instead.

Example:

dynamo:
  user: dynamodb
  volumes:
    - "./docker/dynamodb:/home/dynamodblocal/data:rw"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DynamoDB Local has permission issues if it doesn't run as root. I haven't been able to identify a way to modify the permissions of the mounted volume to allow the user dynamodblocal to write to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, I understand that you're encountering permission issues with DynamoDB Local when it doesn't run as root. A common solution is to adjust the permissions of the mounted volume so that the dynamodblocal user inside the container has the necessary write access.

Here are some steps you could try:

  1. Determine the UID and GID of the dynamodblocal user inside the Docker image:

    Run the following commands to find the user and group IDs:

    docker run --rm amazon/dynamodb-local:2.5.2 id -u dynamodblocal
    docker run --rm amazon/dynamodb-local:2.5.2 id -g dynamodblocal

    Suppose the output is:

    1000
    1000
    

    This means the dynamodblocal user has a UID and GID of 1000.

  2. Adjust the ownership of the mounted directory on the host:

    Change the ownership of ./docker/dynamodb to match the UID and GID of the dynamodblocal user:

    sudo chown -R 1000:1000 ./docker/dynamodb
  3. Set appropriate permissions on the directory:

    Ensure that the directory has the necessary read/write permissions:

    chmod -R u+rwx ./docker/dynamodb
  4. Keep the user as dynamodblocal in your Docker Compose file:

    user: dynamodblocal

By aligning the ownership and permissions, the dynamodblocal user inside the container should be able to write to the mounted volume without requiring root privileges. This approach enhances security by avoiding running the container as root.

Let me know if this helps or if you need further assistance.


✏️ Learnings added
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:27.898Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That command didn't work:
"$ docker run --rm amazon/dynamodb-local:2.5.2 id -u dynamodblocal
Unable to find image 'amazon/dynamodb-local:2.5.2' locally
2.5.2: Pulling from amazon/dynamodb-local
25f617ca51f7: Pull complete
88169dbe8767: Pull complete
4f4fb700ef54: Pull complete
b510243a5cd9: Pull complete
Digest: sha256:d7ebddeb60fa418bcda218a6c6a402a58441b2a20d54c9cb1d85fd5194341753
Status: Downloaded newer image for amazon/dynamodb-local:2.5.2
Error: Could not find or load main class id
Caused by: java.lang.ClassNotFoundException: id
"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, thank you for pointing that out. The issue occurs because the amazon/dynamodb-local:2.5.2 image has a default entrypoint that runs the DynamoDB application, so when we try to execute id, it doesn't recognize the command.

To resolve this, we need to override the entrypoint to allow us to run shell commands inside the container. Please try the following commands:

docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -u dynamodblocal
docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -g dynamodblocal

These commands override the default entrypoint, enabling you to execute the id command directly. This should output the UID and GID of the dynamodblocal user.

Once you have the UID and GID, you can adjust the ownership and permissions of the mounted directory on your host machine:

sudo chown -R [UID]:[GID] ./docker/dynamodb
chmod -R u+rwx ./docker/dynamodb

Replace [UID] and [GID] with the numbers obtained from the previous commands.

Let me know if this helps or if you have any questions!


✏️ Learnings added
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:24:11.223Z
Learning: When executing shell commands inside the `amazon/dynamodb-local` Docker image, you need to override the entrypoint using `--entrypoint ""` to prevent it from running the default Java application.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:28.155Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to change the owner in the Dockerfile, so that the change will persist on other host machines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chewy-zlai, yes, you can modify the Dockerfile to change the ownership of directories, which can help resolve permission issues across different host machines. By adjusting the ownership inside the Docker image, you ensure that the dynamodblocal user within the container has the necessary permissions, without requiring manual changes on the host system.

Here's how you can do it:

  1. Create a custom Dockerfile for your DynamoDB Local service:

    # Use the base image from Amazon
    FROM amazon/dynamodb-local:2.5.2
    
    # Switch to root user to change ownership
    USER root
    
    # Create the data directory and adjust ownership and permissions
    RUN mkdir -p /home/dynamodblocal/data && \
        chown -R dynamodblocal:dynamodblocal /home/dynamodblocal/data
    
    # Switch back to the dynamodblocal user
    USER dynamodblocal
    
    # Set the working directory
    WORKDIR /home/dynamodblocal
    
    # Expose the default DynamoDB Local port
    EXPOSE 8000
    
    # Set the default command to run DynamoDB Local
    CMD ["-jar", "DynamoDBLocal.jar", "-sharedDb", "-dbPath", "./data"]

    This Dockerfile does the following:

    • Uses the official DynamoDB Local image as the base.
    • Switches to the root user to modify ownership.
    • Creates the /home/dynamodblocal/data directory and changes its owner to dynamodblocal.
    • Switches back to the dynamodblocal user for running the application.
  2. Update your docker-compose.yaml to build and use this custom image:

    services:
      dynamo:
        build:
          context: ./path_to_dynamo_dockerfile
          dockerfile: Dockerfile
        container_name: dynamodb-local
        ports:
          - "8000:8000"
        volumes:
          - "./docker/dynamodb:/home/dynamodblocal/data"
        working_dir: /home/dynamodblocal
        user: dynamodblocal

    Make sure to replace ./path_to_dynamo_dockerfile with the actual relative path to your custom Dockerfile.

    This setup ensures that:

    • The dynamodblocal user inside the container has ownership of the /home/dynamodblocal/data directory.
    • The service runs as dynamodblocal, enhancing security by not running as root.
    • Other developers or environments using your Docker configuration won't need to manually adjust permissions on their host machines.

Note on Mounted Volumes:

When you mount a host directory into a container, the permissions and ownership are governed by the host's filesystem. Changing ownership inside the Dockerfile affects files and directories within the image and the container's filesystem, but not the host-mounted volumes.

To mitigate permission issues with mounted volumes:

  • Use Named Volumes: Instead of mounting a host directory, use a Docker named volume, which Docker manages internally.

    services:
      dynamo:
        build:
          context: ./path_to_dynamo_dockerfile
          dockerfile: Dockerfile
        container_name: dynamodb-local
        ports:
          - "8000:8000"
        volumes:
          - dynamodb_data:/home/dynamodblocal/data
        working_dir: /home/dynamodblocal
        user: dynamodblocal
    
    volumes:
      dynamodb_data:

    This approach avoids host filesystem permission issues since Docker handles the volume's filesystem.

  • Adjust Permissions on Host Directory: If you need to use a host directory, you might still need to adjust its permissions to ensure the container user can access it. However, by setting the ownership in the Dockerfile and ensuring the container runs as dynamodblocal, the need for manual host adjustments is minimized.

Let me know if this solution works for you or if you have any further questions!


🧠 Learnings used
Learnt from: chewy-zlai
PR: zipline-ai/chronon#46
File: docker-init/compose.yaml:12-12
Timestamp: 2024-10-15T19:03:28.155Z
Learning: DynamoDB Local may have permission issues when not running as root due to write permissions on the mounted volume. Adjusting the host directory's ownership and permissions can allow the `dynamodblocal` user inside the container to access it without running as root.


spark:
image: bitnami/spark:3.5.2
Expand Down Expand Up @@ -49,8 +49,9 @@ services:
dockerfile: docker-init/Dockerfile
depends_on:
- base
- dynamo
environment:
- DYNAMO_ENDPOINT=http://dynamo:8000
- AWS_ENDPOINT_URL_DYNAMODB=http://dynamo:8000
- AWS_DEFAULT_REGION=fakeregion
- AWS_ACCESS_KEY_ID=fakeaccesskey
- AWS_SECRET_ACCESS_KEY=fakesecretkey
Expand Down
36 changes: 36 additions & 0 deletions docker-init/parquet_to_dynamo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import awswrangler as wr
import boto3
import botocore
import os

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, encode, struct, to_json


# Initialize Spark session
spark = SparkSession.builder.appName("FraudClassificationConversion").getOrCreate()

parquet_files = spark.read.parquet(os.path.join(os.environ['PARQUET_FOLDER'], "*.parquet")).drop("key_json", "value_json", "ds")

dynamodb = boto3.client('dynamodb')
table_name = "test-join_drift_batch"
Comment on lines +15 to +16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making the table name configurable

The table name "test-join_drift_batch" is currently hardcoded. To improve maintainability and flexibility, consider making it configurable through an environment variable or a configuration file.

You could modify the code as follows:

table_name = os.environ.get('DYNAMODB_TABLE_NAME', 'test-join_drift_batch')

This allows you to easily change the table name without modifying the code, while still providing a default value.




panda_df = parquet_files.toPandas()

# Upload data in batches
batch_size = 1000 # Adjust based on your needs
for i in range(0, len(panda_df), batch_size):
batch = panda_df.iloc[i:i+batch_size]
try:
wr.dynamodb.put_df(df=batch, table_name=table_name)
print(f"Uploaded batch {i//batch_size + 1}/{len(panda_df)//batch_size + 1}", flush=True)
except Exception as e:
print(f"Error uploading batch {i + 1}: {str(e)}", flush=True)
Comment on lines +20 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using Spark DataFrame for the entire process

Currently, the code converts the Spark DataFrame to a Pandas DataFrame before uploading to DynamoDB. This conversion might be unnecessary and could be inefficient for large datasets. Consider using the Spark DataFrame throughout the process for better performance and scalability.

You could modify the code to use Spark's foreachPartition method:

def upload_partition(partition):
    for row in partition:
        try:
            wr.dynamodb.put_df(df=pd.DataFrame([row.asDict()]), table_name=table_name)
        except Exception as e:
            print(f"Error uploading row: {str(e)}", flush=True)

parquet_files.foreachPartition(upload_partition)

This approach processes the data in Spark partitions, which can be more efficient for large datasets.



print("Wrote parquet to Dynamo")