-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up listdir #19
base: master
Are you sure you want to change the base?
Speed up listdir #19
Conversation
Gosh, ok this is surprising to me. I can't fathom why we would only use the first letter in the search. Which makes me worried, as could there actually be a reason for this? |
It might be due to some package names having PascalCase but being saved on disk with camelCase. That's the only reason I can think of, but when looking at the list of package names in one project as an example, everything was working fine. I guess we will know where it breaks after we update... 🤷 💥 |
@msf do you remember - was there a particular reason to just use the first character? |
should we just give this a try and see how it plays out? I think it's pretty easy to revert. My only concern is I don't know how the binary gets built and pushed to s3 |
@leibowitz yes ok, as arguably its pretty broken right now, i'd say just go for it. We may have to be responsive though with any problems caused (hopefully none, but we haven't changed this in so long). I can't remember how it is built + packaged though (maybe manually?) |
Hello, totally didn't realise I was still subscribed to this repo. Hope you are all doing well! I think this is because package names in PyPi are case insensitive. (i.e. if you have A simple thing to do here might be to use this trick on the first 2-3 characters of the path (i.e. make 4/8 list requests instead of 2 which should cut down the number keys that get returned in each by a lot) |
Thanks @simongibbons for the insight. From what I saw in our requirements.txt, we use the correct names (spelled properly in term of case sensitivity) |
the other "better" option would be to store all the files lowercase on s3, and change the request to lowercase when looking for the prefix and the package. But that would require to rename all the packages names on s3, which would take a while. The only issue is that be a breaking change, as it might break existing go-minipypi (if done before pushing a new version). And it will break once a new version with that change in is released, so there's a window of time where things will be broken no matter what. |
Thanks @simongibbons ! @leibowitz I think that we should try what simon suggests, since we don't want to rename all the files on S3.
If this is still too slow, we can try 8 requests: AAA, AAa, AaA, etc |
I disagree. I would rather keep it simple, and do the query as the package name is specified in the requirements.txt – and fail when it doesn't mach. The failures will be quite obvious, and we can amend the package name in the requirements.txt to fix the build on CircleCI. Doing 4/8 requests per package name is not going to help the requests being slow issue imho |
What I'm trying to say is this boils down to how The best "solution" imho is to not do multiple requests, and fetch only the packages matching the complete name (prefix) |
yes, the indexing on s3, is purely prefix based and pip installer was case
insensitive and did two queries for a package, changing the casing..
at the time we didn’t has that many packages so, searching for first letter
(twice, casing!) was simple and cacheable and easy.. -and because i was
reverse engineering the pip behavior and not studying carefully i didn’t
wann go into the complications of “what if the package is called xYz”,
(that is weird casings of characters..) whixh would for caching for
quadratic combinations of lookups.. and caching all those s3 list queries..
but quite honestly, it was simplicity and doing it quickly without thinking
nor knowing too much about pip behavior
in essence, it was something that worked just fine for just 2
On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***> wrote:
@msf <https://github.com/msf> do you remember - was there a particular
reason to just use the first character?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA>
.
--
Miguel Mascarenhas Filipe
|
btw, nowadays there are pip proxies/caches that are super easy to
use/maintain, my minipipy is “old tech” now
On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe <
***@***.***> wrote:
yes, the indexing on s3, is purely prefix based and pip installer was case
insensitive and did two queries for a package, changing the casing..
at the time we didn’t has that many packages so, searching for first
letter (twice, casing!) was simple and cacheable and easy.. -and because i
was reverse engineering the pip behavior and not studying carefully i
didn’t wann go into the complications of “what if the package is called
xYz”, (that is weird casings of characters..) whixh would for caching for
quadratic combinations of lookups.. and caching all those s3 list queries..
but quite honestly, it was simplicity and doing it quickly without
thinking nor knowing too much about pip behavior
in essence, it was something that worked just fine for just 2
On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***>
wrote:
> @msf <https://github.com/msf> do you remember - was there a particular
> reason to just use the first character?
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#19 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA>
> .
>
--
Miguel Mascarenhas Filipe
--
Miguel Mascarenhas Filipe
|
another option is to have a long running go-minipypi server running and
update the url on circleci configs to use that instead of a shortlived
instance
that way all builds share the s3 lookup cache
On Fri, 17 Dec 2021 at 18:33, Miguel Mascarenhas Filipe <
***@***.***> wrote:
btw, nowadays there are pip proxies/caches that are super easy to
use/maintain, my minipipy is “old tech” now
On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe <
***@***.***> wrote:
>
> yes, the indexing on s3, is purely prefix based and pip installer was
> case insensitive and did two queries for a package, changing the casing..
>
> at the time we didn’t has that many packages so, searching for first
> letter (twice, casing!) was simple and cacheable and easy.. -and because i
> was reverse engineering the pip behavior and not studying carefully i
> didn’t wann go into the complications of “what if the package is called
> xYz”, (that is weird casings of characters..) whixh would for caching for
> quadratic combinations of lookups.. and caching all those s3 list queries..
>
>
> but quite honestly, it was simplicity and doing it quickly without
> thinking nor knowing too much about pip behavior
>
> in essence, it was something that worked just fine for just 2
> On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***>
> wrote:
>
>> @msf <https://github.com/msf> do you remember - was there a particular
>> reason to just use the first character?
>>
>> —
>> You are receiving this because you were mentioned.
>>
>>
>> Reply to this email directly, view it on GitHub
>> <#19 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA>
>> .
>>
> --
> Miguel Mascarenhas Filipe
>
--
Miguel Mascarenhas Filipe
--
Miguel Mascarenhas Filipe
|
at unbabel we host our own pypi server and it is quite painless, speaking
against my creation, maybe that would be a more robust fix?
https://github.com/pypiserver/pypiserver
but hey, I do like my creation.. and using a simple s3 bucket as a wheel
repository.. But please be aware that go-minipypi was my first Go program,
written initially on a saturday with a tcpdump and `netcat -l` to see what
pip install -r xx was asking for.. and simply solving for that.
I was lazy and didn't check how wheel files should be uploaded or how to
handle the nuances of pip.. and those nuances might be different nowadays..
I guess I'm secretly happy that my code is still being used :-)
hugs to everyone, covid free and full of good spirits!
cheers
On Fri, Dec 17, 2021 at 6:40 PM Miguel Mascarenhas Filipe <
***@***.***> wrote:
… another option is to have a long running go-minipypi server running and
update the url on circleci configs to use that instead of a shortlived
instance
that way all builds share the s3 lookup cache
On Fri, 17 Dec 2021 at 18:33, Miguel Mascarenhas Filipe <
***@***.***> wrote:
> btw, nowadays there are pip proxies/caches that are super easy to
> use/maintain, my minipipy is “old tech” now
>
> On Fri, 17 Dec 2021 at 18:31, Miguel Mascarenhas Filipe <
> ***@***.***> wrote:
>
>>
>> yes, the indexing on s3, is purely prefix based and pip installer was
>> case insensitive and did two queries for a package, changing the casing..
>>
>> at the time we didn’t has that many packages so, searching for first
>> letter (twice, casing!) was simple and cacheable and easy.. -and because i
>> was reverse engineering the pip behavior and not studying carefully i
>> didn’t wann go into the complications of “what if the package is called
>> xYz”, (that is weird casings of characters..) whixh would for caching for
>> quadratic combinations of lookups.. and caching all those s3 list queries..
>>
>>
>> but quite honestly, it was simplicity and doing it quickly without
>> thinking nor knowing too much about pip behavior
>>
>> in essence, it was something that worked just fine for just 2
>> On Wed, 15 Dec 2021 at 16:04, Emil R. Vaughan ***@***.***>
>> wrote:
>>
>>> @msf <https://github.com/msf> do you remember - was there a particular
>>> reason to just use the first character?
>>>
>>> —
>>> You are receiving this because you were mentioned.
>>>
>>>
>>> Reply to this email directly, view it on GitHub
>>> <#19 (comment)>,
>>> or unsubscribe
>>> <https://github.com/notifications/unsubscribe-auth/AAAH3DGBWHQ7BX67PNVHEQLURC4ATANCNFSM5KDNHNKA>
>>> .
>>>
>> --
>> Miguel Mascarenhas Filipe
>>
> --
> Miguel Mascarenhas Filipe
>
--
Miguel Mascarenhas Filipe
--
Miguel Mascarenhas Filipe
|
thanks @msf 👍 |
a209ca7
to
e65279f
Compare
@leibowitz I'm concerned that this will break things, since Python authors will assume package names are case insensitive - this applies both to requirements.txt and also to packages' setup.py where they may reference other packages. I believe in our case we can fix the speed of go-minipypi by moving some unneeded old versions out of the packages bucket. (I shared this with you on Notion). Would it be worth trying that first? (Long term, we'll move everything to Code Artifact, so we don't need a permanent fix for this. |
@emil79 removing packages from the s3 bucket we use is, as you pointed out, only a temporary fix. And it is independent from this change. In other words, this can be done with or without this change. And I agree that moving to Code artifact is the way forward. As I said before, this was my attempt at fixing the issues we are facing while using go-minipypi in its current form in the meantime, while we make a migration plan. The fact that S3 buckets are case sensitive, but pypi isn't, means this idea of using an s3 bucket is always going to be flawed. My suggestion is to bite the bullet now, and not have to come back to this issue ever again. If someone has issues using go-minipypi, it will be apparent what the problem is when a package is not being resolved (and easily fixed by changing the case in the name) I could even add a logging message when a package name lookup is returning a 404, to notify the user to double-check if the name of the package is spelled properly (in regards to case) That and bumping the major version number, and highlight the breaking change. But if you are still feeling against merging this change as-is, that's fine. I just don't want to add more requests, as we discussed, as I feel this is just repeating the original issue (due to how the S3 ListObjects API works) To be fair, that makes me think we could try an hybrid approach. First, lookup the full name as-is, and if that fails, do the lookup with 1 or 2 characters. That should give us the best of both approaches. What do you think? |
instead of fetching all objects starting with the first letter of the prefix (which is time consuming due to the sheer number of objects on the buckets), use the whole prefix, so for instance for the sentry-sdk package, the prefix to use would be:
sentry-sdk
instead of justs
It does also change any
-
in the package name to_
, as it seems to be that all packages with dashes in their name are stored with underscores instead.