Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up listdir #19

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Speed up listdir #19

wants to merge 2 commits into from

Conversation

leibowitz
Copy link

instead of fetching all objects starting with the first letter of the prefix (which is time consuming due to the sheer number of objects on the buckets), use the whole prefix, so for instance for the sentry-sdk package, the prefix to use would be:
sentry-sdk instead of just s
It does also change any - in the package name to _, as it seems to be that all packages with dashes in their name are stored with underscores instead.

@emil79
Copy link

emil79 commented Dec 15, 2021

Gosh, ok this is surprising to me. I can't fathom why we would only use the first letter in the search. Which makes me worried, as could there actually be a reason for this?

@leibowitz
Copy link
Author

It might be due to some package names having PascalCase but being saved on disk with camelCase. That's the only reason I can think of, but when looking at the list of package names in one project as an example, everything was working fine. I guess we will know where it breaks after we update... 🤷 💥

@emil79
Copy link

emil79 commented Dec 15, 2021

@msf do you remember - was there a particular reason to just use the first character?

@leibowitz
Copy link
Author

leibowitz commented Dec 16, 2021

should we just give this a try and see how it plays out? I think it's pretty easy to revert. My only concern is I don't know how the binary gets built and pushed to s3
It seems there's a branch called circleci-upload-arm64 that builds both amd64 and arm64 but it doesn't seem to work due to an old ssh key being used to fetch from github

@emil79
Copy link

emil79 commented Dec 16, 2021

@leibowitz yes ok, as arguably its pretty broken right now, i'd say just go for it. We may have to be responsive though with any problems caused (hopefully none, but we haven't changed this in so long). I can't remember how it is built + packaged though (maybe manually?)

@simongibbons
Copy link

Hello, totally didn't realise I was still subscribed to this repo. Hope you are all doing well!

I think this is because package names in PyPi are case insensitive. (i.e. if you have Django and django in your requirements file both should install the same package). Moving to the prefix search means that this is no-longer a dropin replacement for PyPi.

A simple thing to do here might be to use this trick on the first 2-3 characters of the path (i.e. make 4/8 list requests instead of 2 which should cut down the number keys that get returned in each by a lot)

@leibowitz
Copy link
Author

Thanks @simongibbons for the insight. From what I saw in our requirements.txt, we use the correct names (spelled properly in term of case sensitivity)
It would definitely break if a package name is not spelled properly, but we can always work around this by updating the package name in the requirements.txt file. Sounds like a minor and solvable issue to be able to solve the timing out one – which is blocking and where the only workaround is to increase the timeout to larger and larger values

@leibowitz
Copy link
Author

leibowitz commented Dec 16, 2021

the other "better" option would be to store all the files lowercase on s3, and change the request to lowercase when looking for the prefix and the package. But that would require to rename all the packages names on s3, which would take a while.
That would need to be done with a script.

The only issue is that be a breaking change, as it might break existing go-minipypi (if done before pushing a new version). And it will break once a new version with that change in is released, so there's a window of time where things will be broken no matter what.

@emil79
Copy link

emil79 commented Dec 17, 2021

Thanks @simongibbons !

@leibowitz I think that we should try what simon suggests, since we don't want to rename all the files on S3.

  • We make 4 requests instead of 2. If first chars are "AB" we make prefix requests for
    • AA
    • Aa
    • aA
    • aa

If this is still too slow, we can try 8 requests: AAA, AAa, AaA, etc

@leibowitz
Copy link
Author

I disagree. I would rather keep it simple, and do the query as the package name is specified in the requirements.txt – and fail when it doesn't mach. The failures will be quite obvious, and we can amend the package name in the requirements.txt to fix the build on CircleCI.

Doing 4/8 requests per package name is not going to help the requests being slow issue imho

@leibowitz
Copy link
Author

leibowitz commented Dec 17, 2021

What I'm trying to say is this boils down to how ListObjectsPages works. As the doc mentions (and the name suggests), it iterates through the list of objects, 1000 elements at at time.
In some cases, we have multiple thousands of packages sharing the same first/second letters. To fetch the whole list, multiple requests will be triggered. AFAIK the requests use the StartAfter parameter to query the next page. Therefore it can only do these requests in sequence. In other words, the more objects it has to go through, the longer it will take. And there's no way around it. At least that I know of

The best "solution" imho is to not do multiple requests, and fetch only the packages matching the complete name (prefix)

@msf
Copy link

msf commented Dec 17, 2021 via email

@msf
Copy link

msf commented Dec 17, 2021 via email

@msf
Copy link

msf commented Dec 17, 2021 via email

@msf
Copy link

msf commented Dec 17, 2021 via email

@leibowitz
Copy link
Author

thanks @msf 👍

@emil79
Copy link

emil79 commented Jan 3, 2022

@leibowitz I'm concerned that this will break things, since Python authors will assume package names are case insensitive - this applies both to requirements.txt and also to packages' setup.py where they may reference other packages.

I believe in our case we can fix the speed of go-minipypi by moving some unneeded old versions out of the packages bucket. (I shared this with you on Notion). Would it be worth trying that first? (Long term, we'll move everything to Code Artifact, so we don't need a permanent fix for this.

@leibowitz
Copy link
Author

leibowitz commented Jan 5, 2022

@emil79 removing packages from the s3 bucket we use is, as you pointed out, only a temporary fix. And it is independent from this change. In other words, this can be done with or without this change.

And I agree that moving to Code artifact is the way forward. As I said before, this was my attempt at fixing the issues we are facing while using go-minipypi in its current form in the meantime, while we make a migration plan. The fact that S3 buckets are case sensitive, but pypi isn't, means this idea of using an s3 bucket is always going to be flawed. My suggestion is to bite the bullet now, and not have to come back to this issue ever again. If someone has issues using go-minipypi, it will be apparent what the problem is when a package is not being resolved (and easily fixed by changing the case in the name)

I could even add a logging message when a package name lookup is returning a 404, to notify the user to double-check if the name of the package is spelled properly (in regards to case)

That and bumping the major version number, and highlight the breaking change.

But if you are still feeling against merging this change as-is, that's fine. I just don't want to add more requests, as we discussed, as I feel this is just repeating the original issue (due to how the S3 ListObjects API works)

To be fair, that makes me think we could try an hybrid approach. First, lookup the full name as-is, and if that fails, do the lookup with 1 or 2 characters. That should give us the best of both approaches. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants