-
Notifications
You must be signed in to change notification settings - Fork 8
WIP Key value parquet to dynamo #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
415380a
5f596b1
f61ce7b
6f028dc
a48ce40
6ab86c7
3bbf16f
8d93555
eccdfd5
bb1e5b8
e4f582b
d919bff
67d405b
a0f40df
16e0526
b0cc123
96d7405
530b263
164746f
ef6cc99
9e1c2c6
c6dd49f
3047b2a
f3b6ac7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,7 +9,7 @@ services: | |
| volumes: | ||
| - "./docker/dynamodb:/home/dynamodblocal/data" | ||
| working_dir: /home/dynamodblocal | ||
| user: dynamodblocal | ||
| user: root | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Security concern: DynamoDB service running as root Changing the user context to root for the DynamoDB service poses significant security risks. Running containers with root privileges can potentially allow unrestricted access to the host system if the container is compromised. Consider the following alternatives:
Example: dynamo:
user: dynamodb
volumes:
- "./docker/dynamodb:/home/dynamodblocal/data:rw"
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. DynamoDB Local has permission issues if it doesn't run as root. I haven't been able to identify a way to modify the permissions of the mounted volume to allow the user dynamodblocal to write to it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chewy-zlai, I understand that you're encountering permission issues with DynamoDB Local when it doesn't run as root. A common solution is to adjust the permissions of the mounted volume so that the Here are some steps you could try:
By aligning the ownership and permissions, the Let me know if this helps or if you need further assistance. ✏️ Learnings added
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That command didn't work:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chewy-zlai, thank you for pointing that out. The issue occurs because the To resolve this, we need to override the entrypoint to allow us to run shell commands inside the container. Please try the following commands: docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -u dynamodblocal
docker run --rm --entrypoint "" amazon/dynamodb-local:2.5.2 id -g dynamodblocalThese commands override the default entrypoint, enabling you to execute the Once you have the UID and GID, you can adjust the ownership and permissions of the mounted directory on your host machine: sudo chown -R [UID]:[GID] ./docker/dynamodb
chmod -R u+rwx ./docker/dynamodbReplace Let me know if this helps or if you have any questions! ✏️ Learnings added
🧠 Learnings used
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to change the owner in the Dockerfile, so that the change will persist on other host machines?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chewy-zlai, yes, you can modify the Dockerfile to change the ownership of directories, which can help resolve permission issues across different host machines. By adjusting the ownership inside the Docker image, you ensure that the Here's how you can do it:
Note on Mounted Volumes: When you mount a host directory into a container, the permissions and ownership are governed by the host's filesystem. Changing ownership inside the Dockerfile affects files and directories within the image and the container's filesystem, but not the host-mounted volumes. To mitigate permission issues with mounted volumes:
Let me know if this solution works for you or if you have any further questions! 🧠 Learnings used |
||
|
|
||
| spark: | ||
| image: bitnami/spark:3.5.2 | ||
|
|
@@ -49,8 +49,9 @@ services: | |
| dockerfile: docker-init/Dockerfile | ||
| depends_on: | ||
| - base | ||
| - dynamo | ||
| environment: | ||
| - DYNAMO_ENDPOINT=http://dynamo:8000 | ||
| - AWS_ENDPOINT_URL_DYNAMODB=http://dynamo:8000 | ||
| - AWS_DEFAULT_REGION=fakeregion | ||
| - AWS_ACCESS_KEY_ID=fakeaccesskey | ||
| - AWS_SECRET_ACCESS_KEY=fakesecretkey | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| import awswrangler as wr | ||
| import boto3 | ||
| import botocore | ||
| import os | ||
|
|
||
| from pyspark.sql import SparkSession | ||
| from pyspark.sql.functions import concat, encode, struct, to_json | ||
|
|
||
|
|
||
| # Initialize Spark session | ||
| spark = SparkSession.builder.appName("FraudClassificationConversion").getOrCreate() | ||
|
|
||
| parquet_files = spark.read.parquet(os.path.join(os.environ['PARQUET_FOLDER'], "*.parquet")).drop("key_json", "value_json", "ds") | ||
|
|
||
| dynamodb = boto3.client('dynamodb') | ||
| table_name = "test-join_drift_batch" | ||
|
Comment on lines
+15
to
+16
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider making the table name configurable The table name "test-join_drift_batch" is currently hardcoded. To improve maintainability and flexibility, consider making it configurable through an environment variable or a configuration file. You could modify the code as follows: table_name = os.environ.get('DYNAMODB_TABLE_NAME', 'test-join_drift_batch')This allows you to easily change the table name without modifying the code, while still providing a default value. |
||
|
|
||
|
|
||
|
|
||
| panda_df = parquet_files.toPandas() | ||
|
|
||
| # Upload data in batches | ||
| batch_size = 1000 # Adjust based on your needs | ||
| for i in range(0, len(panda_df), batch_size): | ||
| batch = panda_df.iloc[i:i+batch_size] | ||
| try: | ||
| wr.dynamodb.put_df(df=batch, table_name=table_name) | ||
| print(f"Uploaded batch {i//batch_size + 1}/{len(panda_df)//batch_size + 1}", flush=True) | ||
| except Exception as e: | ||
| print(f"Error uploading batch {i + 1}: {str(e)}", flush=True) | ||
|
Comment on lines
+20
to
+30
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Consider using Spark DataFrame for the entire process Currently, the code converts the Spark DataFrame to a Pandas DataFrame before uploading to DynamoDB. This conversion might be unnecessary and could be inefficient for large datasets. Consider using the Spark DataFrame throughout the process for better performance and scalability. You could modify the code to use Spark's def upload_partition(partition):
for row in partition:
try:
wr.dynamodb.put_df(df=pd.DataFrame([row.asDict()]), table_name=table_name)
except Exception as e:
print(f"Error uploading row: {str(e)}", flush=True)
parquet_files.foreachPartition(upload_partition)This approach processes the data in Spark partitions, which can be more efficient for large datasets. |
||
|
|
||
|
|
||
| print("Wrote parquet to Dynamo") | ||
|
|
||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.