Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Duplicate S3 access log request IDs and glue deduping #32

Open
mikeplem opened this issue May 4, 2023 · 0 comments
Open

Duplicate S3 access log request IDs and glue deduping #32

mikeplem opened this issue May 4, 2023 · 0 comments

Comments

@mikeplem
Copy link

mikeplem commented May 4, 2023

We are testing version 6.0.0 of the tool using Glue 3.0 and have noticed that some access log data is being deduped when files are converted into hive/parquet format.

An example from our access logs are

9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY BATCH.DELETE.OBJECT f1683055189142x766494105173435800/IMG_1138.jpeg - 204 - - - - - - - - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -
9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8 BUCKET_NAME [02/May/2023:19:19:55 +0000] 52.12.241.113 IAM_ARN_HERE 1C25EZNCB2HBMQQY REST.POST.MULTI_OBJECT_DELETE - "POST /BUCKET_NAME/?delete HTTP/1.1" 200 - 305 - 29 - "-" "-" - Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg== SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader s3.amazonaws.com TLSv1.2 - -

The Athena query output is as follows

#	bucket_owner	bucket	time	remote_ip	requester	request_id	operation	key	request_uri	http_status	error_code	bytes_sent	object_size	total_time	turnaround_time	referrer	user_agent	version_id	host_id	signature_version	cipher_suite	authentication_type	host_header	tls_version	year	month	day
1	9d306c3478cf2e54f72d7f972c2e7090d30324572b78f1901d30a5d89e33cbe8	BUCKET_NAME	2023-05-02 19:19:55.000	52.12.241.113	IAM_ARN_HERE	1C25EZNCB2HBMQQY	REST.POST.MULTI_OBJECT_DELETE		POST /BUCKET_NAME/?delete HTTP/1.1	200		305		29					Pad2oayPBK9Yqw9/BWjhgn84fAsgRK7OTjjRFTy8Nuzlr27Ou+InFTsEf3eJsOaOkr2jw9xLBUa6d1tHwjf+xg==	SigV2	ECDHE-RSA-AES128-GCM-SHA256			TLSv1.2 - -	2023	05	02

This is concerning because the Athena query output does not show which file was deleted. It appears the second s3 access log entry is overwriting the first when the file is converted.

  • Is this expected?
  • Is there a way to log both access log entries?

Thank you for taking the time to look into this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant