Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of AWS script #3

Open
simonw opened this issue Jan 29, 2022 · 4 comments
Open

Improve performance of AWS script #3

simonw opened this issue Jan 29, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jan 29, 2022

This works now! Would be good to make it faster though.

Originally posted by @simonw in #2 (comment)

@simonw simonw added the enhancement New feature or request label Jan 29, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 29, 2022

It takes 2hr45m right now. That's a long time, especially if I want to run it every day! Feels like a poor use of GitHub Actions resources.

@simonw
Copy link
Owner Author

simonw commented Jan 29, 2022

Some options:

  • Use threads or processes to run some of the tasks in parallel - not sure how many vCPUs GitHub Actions gives me though so this may not make much of a difference
  • Check the version first and only run the crawl if it has changed since last time. This would definitely be worthwhile.
  • Dig into the Python implementation of awscli and see if I can call help while avoiding the overhead of starting up a fresh process for every single page

@simonw
Copy link
Owner Author

simonw commented Jan 29, 2022

Worth considering: I'm currently using the aws CLI that ships with the GitHub Actions worker.

According to https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md that's currently AWS CLI 2.4.13 - and that gets bumped pretty often, see the commit history here https://github.com/actions/virtual-environments/commits/main/images/linux/Ubuntu2004-Readme.md which seems to bump it every few days.

But the release history on https://pypi.org/project/awscli/#history shows daily releases of AWS CLI - so actually I should update to the latest version using pip install -U rather than relying on the built-in one.

@simonw
Copy link
Owner Author

simonw commented Feb 13, 2022

https://superfastpython.com/threadpoolexecutor-in-python/ looks useful.

Might also be interesting to try doing this with asyncio and https://docs.python.org/3/library/asyncio-eventloop.html#running-subprocesses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant