Speed up listdir #19

leibowitz · 2021-12-15T11:44:08Z

instead of fetching all objects starting with the first letter of the prefix (which is time consuming due to the sheer number of objects on the buckets), use the whole prefix, so for instance for the sentry-sdk package, the prefix to use would be:
sentry-sdk instead of just s
It does also change any - in the package name to _, as it seems to be that all packages with dashes in their name are stored with underscores instead.

emil79 · 2021-12-15T15:54:06Z

Gosh, ok this is surprising to me. I can't fathom why we would only use the first letter in the search. Which makes me worried, as could there actually be a reason for this?

leibowitz · 2021-12-15T15:58:21Z

It might be due to some package names having PascalCase but being saved on disk with camelCase. That's the only reason I can think of, but when looking at the list of package names in one project as an example, everything was working fine. I guess we will know where it breaks after we update... 🤷 💥

emil79 · 2021-12-15T16:04:14Z

@msf do you remember - was there a particular reason to just use the first character?

leibowitz · 2021-12-16T16:53:49Z

should we just give this a try and see how it plays out? I think it's pretty easy to revert. My only concern is I don't know how the binary gets built and pushed to s3
It seems there's a branch called circleci-upload-arm64 that builds both amd64 and arm64 but it doesn't seem to work due to an old ssh key being used to fetch from github

emil79 · 2021-12-16T17:05:09Z

@leibowitz yes ok, as arguably its pretty broken right now, i'd say just go for it. We may have to be responsive though with any problems caused (hopefully none, but we haven't changed this in so long). I can't remember how it is built + packaged though (maybe manually?)

simongibbons · 2021-12-16T19:13:17Z

Hello, totally didn't realise I was still subscribed to this repo. Hope you are all doing well!

I think this is because package names in PyPi are case insensitive. (i.e. if you have Django and django in your requirements file both should install the same package). Moving to the prefix search means that this is no-longer a dropin replacement for PyPi.

A simple thing to do here might be to use this trick on the first 2-3 characters of the path (i.e. make 4/8 list requests instead of 2 which should cut down the number keys that get returned in each by a lot)

leibowitz · 2021-12-16T20:53:05Z

Thanks @simongibbons for the insight. From what I saw in our requirements.txt, we use the correct names (spelled properly in term of case sensitivity)
It would definitely break if a package name is not spelled properly, but we can always work around this by updating the package name in the requirements.txt file. Sounds like a minor and solvable issue to be able to solve the timing out one – which is blocking and where the only workaround is to increase the timeout to larger and larger values

leibowitz · 2021-12-16T21:07:12Z

the other "better" option would be to store all the files lowercase on s3, and change the request to lowercase when looking for the prefix and the package. But that would require to rename all the packages names on s3, which would take a while.
That would need to be done with a script.

The only issue is that be a breaking change, as it might break existing go-minipypi (if done before pushing a new version). And it will break once a new version with that change in is released, so there's a window of time where things will be broken no matter what.

emil79 · 2021-12-17T14:37:14Z

Thanks @simongibbons !

@leibowitz I think that we should try what simon suggests, since we don't want to rename all the files on S3.

We make 4 requests instead of 2. If first chars are "AB" we make prefix requests for
- AA
- Aa
- aA
- aa

If this is still too slow, we can try 8 requests: AAA, AAa, AaA, etc

leibowitz · 2021-12-17T16:12:43Z

I disagree. I would rather keep it simple, and do the query as the package name is specified in the requirements.txt – and fail when it doesn't mach. The failures will be quite obvious, and we can amend the package name in the requirements.txt to fix the build on CircleCI.

Doing 4/8 requests per package name is not going to help the requests being slow issue imho

leibowitz · 2021-12-17T16:20:25Z

What I'm trying to say is this boils down to how ListObjectsPages works. As the doc mentions (and the name suggests), it iterates through the list of objects, 1000 elements at at time.
In some cases, we have multiple thousands of packages sharing the same first/second letters. To fetch the whole list, multiple requests will be triggered. AFAIK the requests use the StartAfter parameter to query the next page. Therefore it can only do these requests in sequence. In other words, the more objects it has to go through, the longer it will take. And there's no way around it. At least that I know of

The best "solution" imho is to not do multiple requests, and fetch only the packages matching the complete name (prefix)

msf · 2021-12-17T18:31:27Z

yes, the indexing on s3, is purely prefix based and pip installer was case insensitive and did two queries for a package, changing the casing.. at the time we didn’t has that many packages so, searching for first letter (twice, casing!) was simple and cacheable and easy.. -and because i was reverse engineering the pip behavior and not studying carefully i didn’t wann go into the complications of “what if the package is called xYz”, (that is weird casings of characters..) whixh would for caching for quadratic combinations of lookups.. and caching all those s3 list queries.. but quite honestly, it was simplicity and doing it quickly without thinking nor knowing too much about pip behavior in essence, it was something that worked just fine for just 2

On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***> wrote: @msf <https://github.com/msf> do you remember - was there a particular reason to just use the first character? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA> .

-- Miguel Mascarenhas Filipe

msf · 2021-12-17T18:33:58Z

btw, nowadays there are pip proxies/caches that are super easy to use/maintain, my minipipy is “old tech” now On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe < ***@***.***> wrote:

yes, the indexing on s3, is purely prefix based and pip installer was case insensitive and did two queries for a package, changing the casing.. at the time we didn’t has that many packages so, searching for first letter (twice, casing!) was simple and cacheable and easy.. -and because i was reverse engineering the pip behavior and not studying carefully i didn’t wann go into the complications of “what if the package is called xYz”, (that is weird casings of characters..) whixh would for caching for quadratic combinations of lookups.. and caching all those s3 list queries.. but quite honestly, it was simplicity and doing it quickly without thinking nor knowing too much about pip behavior in essence, it was something that worked just fine for just 2 On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***> wrote: > @msf <https://github.com/msf> do you remember - was there a particular > reason to just use the first character? > > — > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > <#19 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA> > . > -- Miguel Mascarenhas Filipe

-- Miguel Mascarenhas Filipe

msf · 2021-12-17T18:40:33Z

another option is to have a long running go-minipypi server running and update the url on circleci configs to use that instead of a shortlived instance that way all builds share the s3 lookup cache On Fri, 17 Dec 2021 at 18:33, Miguel Mascarenhas Filipe < ***@***.***> wrote:

btw, nowadays there are pip proxies/caches that are super easy to use/maintain, my minipipy is “old tech” now On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe < ***@***.***> wrote: > > yes, the indexing on s3, is purely prefix based and pip installer was > case insensitive and did two queries for a package, changing the casing.. > > at the time we didn’t has that many packages so, searching for first > letter (twice, casing!) was simple and cacheable and easy.. -and because i > was reverse engineering the pip behavior and not studying carefully i > didn’t wann go into the complications of “what if the package is called > xYz”, (that is weird casings of characters..) whixh would for caching for > quadratic combinations of lookups.. and caching all those s3 list queries.. > > > but quite honestly, it was simplicity and doing it quickly without > thinking nor knowing too much about pip behavior > > in essence, it was something that worked just fine for just 2 > On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***> > wrote: > >> @msf <https://github.com/msf> do you remember - was there a particular >> reason to just use the first character? >> >> — >> You are receiving this because you were mentioned. >> >> >> Reply to this email directly, view it on GitHub >> <#19 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA> >> . >> > -- > Miguel Mascarenhas Filipe > -- Miguel Mascarenhas Filipe

-- Miguel Mascarenhas Filipe

msf · 2021-12-17T19:17:33Z

at unbabel we host our own pypi server and it is quite painless, speaking against my creation, maybe that would be a more robust fix? https://github.com/pypiserver/pypiserver but hey, I do like my creation.. and using a simple s3 bucket as a wheel repository.. But please be aware that go-minipypi was my first Go program, written initially on a saturday with a tcpdump and `netcat -l` to see what pip install -r xx was asking for.. and simply solving for that. I was lazy and didn't check how wheel files should be uploaded or how to handle the nuances of pip.. and those nuances might be different nowadays.. I guess I'm secretly happy that my code is still being used :-) hugs to everyone, covid free and full of good spirits! cheers On Fri, Dec 17, 2021 at 6:40 PM Miguel Mascarenhas Filipe < ***@***.***> wrote:

…

another option is to have a long running go-minipypi server running and update the url on circleci configs to use that instead of a shortlived instance that way all builds share the s3 lookup cache On Fri, 17 Dec 2021 at 18:33, Miguel Mascarenhas Filipe < ***@***.***> wrote: > btw, nowadays there are pip proxies/caches that are super easy to > use/maintain, my minipipy is “old tech” now > > On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe < > ***@***.***> wrote: > >> >> yes, the indexing on s3, is purely prefix based and pip installer was >> case insensitive and did two queries for a package, changing the casing.. >> >> at the time we didn’t has that many packages so, searching for first >> letter (twice, casing!) was simple and cacheable and easy.. -and because i >> was reverse engineering the pip behavior and not studying carefully i >> didn’t wann go into the complications of “what if the package is called >> xYz”, (that is weird casings of characters..) whixh would for caching for >> quadratic combinations of lookups.. and caching all those s3 list queries.. >> >> >> but quite honestly, it was simplicity and doing it quickly without >> thinking nor knowing too much about pip behavior >> >> in essence, it was something that worked just fine for just 2 >> On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***> >> wrote: >> >>> @msf <https://github.com/msf> do you remember - was there a particular >>> reason to just use the first character? >>> >>> — >>> You are receiving this because you were mentioned. >>> >>> >>> Reply to this email directly, view it on GitHub >>> <#19 (comment)>, >>> or unsubscribe >>> <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA> >>> . >>> >> -- >> Miguel Mascarenhas Filipe >> > -- > Miguel Mascarenhas Filipe > -- Miguel Mascarenhas Filipe

-- Miguel Mascarenhas Filipe

leibowitz · 2021-12-20T09:56:46Z

thanks @msf 👍

emil79 · 2022-01-03T22:33:09Z

@leibowitz I'm concerned that this will break things, since Python authors will assume package names are case insensitive - this applies both to requirements.txt and also to packages' setup.py where they may reference other packages.

I believe in our case we can fix the speed of go-minipypi by moving some unneeded old versions out of the packages bucket. (I shared this with you on Notion). Would it be worth trying that first? (Long term, we'll move everything to Code Artifact, so we don't need a permanent fix for this.

leibowitz · 2022-01-05T15:32:19Z

@emil79 removing packages from the s3 bucket we use is, as you pointed out, only a temporary fix. And it is independent from this change. In other words, this can be done with or without this change.

And I agree that moving to Code artifact is the way forward. As I said before, this was my attempt at fixing the issues we are facing while using go-minipypi in its current form in the meantime, while we make a migration plan. The fact that S3 buckets are case sensitive, but pypi isn't, means this idea of using an s3 bucket is always going to be flawed. My suggestion is to bite the bullet now, and not have to come back to this issue ever again. If someone has issues using go-minipypi, it will be apparent what the problem is when a package is not being resolved (and easily fixed by changing the case in the name)

I could even add a logging message when a package name lookup is returning a 404, to notify the user to double-check if the name of the package is spelled properly (in regards to case)

That and bumping the major version number, and highlight the breaking change.

But if you are still feeling against merging this change as-is, that's fine. I just don't want to add more requests, as we discussed, as I feel this is just repeating the original issue (due to how the S3 ListObjects API works)

To be fair, that makes me think we could try an hybrid approach. First, lookup the full name as-is, and if that fails, do the lookup with 1 or 2 characters. That should give us the best of both approaches. What do you think?

leibowitz added 2 commits December 20, 2021 10:58

use full prefix to list objects

b94b5f5

Use prefix as-is, do not change to lower/upper case

e65279f

leibowitz force-pushed the speed-up-listdir branch from a209ca7 to e65279f Compare December 20, 2021 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up listdir #19

Speed up listdir #19

leibowitz commented Dec 15, 2021

emil79 commented Dec 15, 2021

leibowitz commented Dec 15, 2021

emil79 commented Dec 15, 2021

leibowitz commented Dec 16, 2021 •

edited

Loading

emil79 commented Dec 16, 2021

simongibbons commented Dec 16, 2021

leibowitz commented Dec 16, 2021

leibowitz commented Dec 16, 2021 •

edited

Loading

emil79 commented Dec 17, 2021 •

edited

Loading

leibowitz commented Dec 17, 2021

leibowitz commented Dec 17, 2021 •

edited

Loading

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

leibowitz commented Dec 20, 2021

emil79 commented Jan 3, 2022

leibowitz commented Jan 5, 2022 •

edited

Loading

Speed up listdir #19

Are you sure you want to change the base?

Speed up listdir #19

Conversation

leibowitz commented Dec 15, 2021

emil79 commented Dec 15, 2021

leibowitz commented Dec 15, 2021

emil79 commented Dec 15, 2021

leibowitz commented Dec 16, 2021 • edited Loading

emil79 commented Dec 16, 2021

simongibbons commented Dec 16, 2021

leibowitz commented Dec 16, 2021

leibowitz commented Dec 16, 2021 • edited Loading

emil79 commented Dec 17, 2021 • edited Loading

leibowitz commented Dec 17, 2021

leibowitz commented Dec 17, 2021 • edited Loading

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

msf commented Dec 17, 2021 via email

leibowitz commented Dec 20, 2021

emil79 commented Jan 3, 2022

leibowitz commented Jan 5, 2022 • edited Loading

leibowitz commented Dec 16, 2021 •

edited

Loading

leibowitz commented Dec 16, 2021 •

edited

Loading

emil79 commented Dec 17, 2021 •

edited

Loading

leibowitz commented Dec 17, 2021 •

edited

Loading

leibowitz commented Jan 5, 2022 •

edited

Loading