Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced discovery through ElasticSearch #5

Open
paulu-aws opened this issue May 21, 2020 · 7 comments
Open

Enhanced discovery through ElasticSearch #5

paulu-aws opened this issue May 21, 2020 · 7 comments

Comments

@paulu-aws
Copy link
Contributor

No description provided.

@paulu-aws paulu-aws created this issue from a note in 2020 Roadmap (To do) May 21, 2020
@martyn-swift
Copy link

Hi @paulu-aws, what would the architecture look like for this? Do you have a diagram?

@paulu-aws
Copy link
Contributor Author

paulu-aws commented Dec 12, 2022

@martyn-swift, Originally I had in mind an S3 trigger on the data-lake bucket that would trigger a lambda to insert records into Elasticsearch. However, the cheat code for this Quilt. https://quiltdata.com/

Full disclosure, I don't work for them, this is my own opinion and not my employers: Quilt is an awesome product for this kind of use case. I'm not sure I'd try and implement my own indexing and elasticsearch architecture if Quilt was an option.

@martyn-swift
Copy link

@paulu-aws, does the Glue job write to S3 in large batches? Would the job trigger per object? Is Eventbridge an alternative for batching up s3 put events?

@paulu-aws
Copy link
Contributor Author

@martyn-swift, part of the beauty of the Glue is how Dynamic Frames abstracts away this kind of detail, UNLESS you really want to control it. glueContext.write_dynamic_frame.from_options() format and format options parameters (like block size) give you some control as to the behavior of the write activity to S3. However, I'd advise against trying to outsmart it. The dynamic frame is going to do a much more efficient job mapping writes across your DPUs in parallels into S3 than anything someone might cook up. Live the dream. Let the dynamic frame do its job. Lets you focus on enrolling more datasets and less on plumbing.

@paulu-aws
Copy link
Contributor Author

@martyn-swift, I'll also mention if you are trying to plug the Glue Job into an event-driven architecture, its probably not a good idea to rely on S3 triggers as the message bus. S3 triggers characterize write behaviors to an S3 bucket, not logical processing steps. Multi-file outputs, failed-writes, object versions, etc all become challenges relying on S3 triggers as an eventing mechanism. Better options in order of complex to least complex (IMO) would be AWS EventBridge, Amazon MQ, AWS Step, and Amazon SNS. All of those services can be called directly from inside your Glue job using python (or Java) APIs to send messages over the duration of your job. For example, directly after the .write_dynamic_frame. you'll know the files are written, what bucket, key name, format, and options used and you can pass that along into your preferred messaging bus.

@martyn-swift
Copy link

martyn-swift commented Jan 18, 2023

@paulu-aws, can you give me an example of the SNS call? Is it using boto3 with a call after the job.commit() ?

@paulu-aws
Copy link
Contributor Author

@martyn-swift,

The SNS call would look like any other Boto3 call you might see in the api basics examples. You just need to make sure you glue jobs execution role has IAM permissions to publish to the SNS topic. You may eventually find yourself peppering in several .publish() calls to SNS topics over the course of your Glue job write operations to provide updates or metrics.

Just keep in mind, doing this will start you down a path of blending business workflow state with your data engineering code which are probably best kept abstracted. As a one-off or short term solution, SNS inside the Glue job is fine. But anything else really deserves a more sophisticated business workflow framework like AWS Step. Here is a quick Step State machine I mocked up in a few minutes that triggers a glue workflow and publishes to SNS topics WITHOUT requiring any changes to the Glue job itself.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
2020 Roadmap
  
To do
Development

No branches or pull requests

2 participants