-
Notifications
You must be signed in to change notification settings - Fork 204
Implement list interface for filedb #3329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pipecd-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data []interface{} | ||
| } | ||
|
|
||
| func (it *Iterator) Next(dst interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dst is unused in Next
|
Code coverage for golang is
|
| parts, err := f.backend.List(ctx, dpath) | ||
| if err != nil { | ||
| f.logger.Error("failed to find entities", | ||
| zap.String("kind", kind), | ||
| zap.Error(err), | ||
| ) | ||
| return nil, err | ||
| } | ||
|
|
||
| if objects == nil { | ||
| objects = make(map[string][][]byte, len(parts)) | ||
| } | ||
| for _, obj := range parts { | ||
| id := filepath.Base(obj.Path) | ||
|
|
||
| data, err := f.fetch(ctx, obj.Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I was thinking that the List function also returns data of file objects so that we can have their contents without requiring any extra requests.
By this way, a lot of requests number_of_shards * (1 + number_of_objects) will be required in a short time and I think it will not be realistic.
Do we have any better idea for that problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you, the (n+1) problem, right. Tbh, I was thinking as you at first but when I read the place we use this filestore List interface (ref: planpreview cleaner) I got the same surprise as you have now 😄
I tried to work around with the "list" support API from our supporting filestore (gcs, s3 and minio), and looks like they only support the get object by key as the only way to fetch the raw data of an object, all list API seems only returns the attributes and meta which we can use to fetch the content. I will investigate more about that, but in the worst case, I think we have 2 points to rely on which are:
- the number of the object in hot storage of each kind is expected to be small enough, we may need to do some kind of middleware fetching objects parts in parallel here if necessary
- cache storage in API layer will reduce the number of times we have to list this in a direct way.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation.
the number of the object in hot storage of each kind is expected to be all enough, we may need to do some kind of middleware fetching objects parts in parallel here if necessary
I'm afraid this way is not appropriate since we will make a burst of requests on external services so we could get the limit error.
cache storage in API layer will reduce the number of times we have to list this in a direct way.
Yes, that could be an approachable way. I think it is time to think about that before continuing implementation.
What cache solution do you think about now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nghialv thank you so much for your comment 🙏 I added logic to check whether the raw data of the object part is updated or not based on the etag value returned from List objects request. For now, we will only fetch for the object part stored under the given path in case there is no version of it found in filedb.cache. PTAL when you have time 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Let me take a look.
|
/hold |
|
Code coverage for golang is
|
|
/hold cancel |
|
Code coverage for golang is
|
pkg/datastore/filedb/filedb.go
Outdated
| cdata, err := f.cache.Get(obj.Etag) | ||
| if err == nil { | ||
| objects[id] = append(objects[id], cdata.([]byte)) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about grouping and storing all models of a kind using HashCache https://github.com/pipe-cd/pipecd/blob/master/pkg/cache/rediscache/hashcache.go?
Then we can fetchAll to check instead of calling every time like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we need to design the key name carefully instead of directly using the Etag value to avoid conflict with other keys in Redis. At other places we're prefixing the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you, in that case, it's possible to have some objects updated while the whole others are not, meaning that we still for loop to check which one is updated and which one is not. In that case, we have a minor point to care about: Since I store the whole object content as the value for etag key, if we group it as a single value and use GetAll, it's a bit dangerous in case there are a lot of objects. Of course, as the downside of the current implementation, we may have a bunch of requests to cache to check whether the object is updated or not based on its etag. Wdyt about this trade-off 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we need to design the key name carefully instead of directly using the Etag value to avoid conflict with other keys in Redis. At other places we're prefixing the name.
Nice catch, lets me address it 🙆♂️ tbh, I made the key as id_shard_etag at first, but just feel like it's overdone, lets me add etag_ prefix to this key as other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible to have some objects updated while the whole others are not
Yes, it is. For the outdated ones, we call Cache HSET to update after fetching directly from the file store.
if we group it as a single value on use GetAll, it's a bit dangerous in case there are a lot of objects
I see. Storing by HashCache, I mean the Field Key is the object ID instead of eTag, eTag is included in the value with object content. So the number of objects will not increase when the object is updated.
When storing by ETag the number of objects in cache will increase quickly whenever an object has been updated. In that case, we will need TTL or something like that to deal with that problem.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this
- Normal cache (not hashcache)
- FieldKey:
entity_id_shard - FieldValue:
{etag: etag_value, data: data}
HashCache only helps us to reduce the number of fetch cache request, but the downside is that we have 2 for loop to handle which should be updated, and the size of the value on the hashkey can be too large since we store the whole object content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am concerned about that way is that it still requires N requests to our Redis.
But definitely, it is simpler so let's apply that way for now and see how it goes. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am concerned about that way is that it still requires N requests to our Redis.
I feel you, then why not both 🤔
I mean we can make our cache a bit more complicated to handle this case, we will have:
hashcache as your drafted with:
- hashkey: List_{kind}
- field key: entityId_shard
- field value: etagValue
cache to store object data
- key: etag_entityId (I mean
etag_is the prefix not the value of etag) - value: {etag: etagValue, data: data}
And whenever we update the etagValue in the hashcache, we update the cache storing object content as well. Wdyt 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I forgot that we still need to fetch cache to get the content by separated cache request. Please forget the above suggestion 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL when you have time 😉
|
Code coverage for golang is
|
|
Code coverage for golang is
|
|
Code coverage for golang is
|
| } | ||
|
|
||
| func makeKey(shard datastore.Shard, id string) string { | ||
| return fmt.Sprintf("filedb_object_%s_%s", id, shard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow our key name convention.
https://github.com/pipe-cd/pipecd/blob/master/pkg/app/server/unregisteredappstore/store.go#L103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 👍 Addressed by 2b6d569 🙏
|
Code coverage for golang is
|
Co-authored-by: Le Van Nghia <[email protected]>
|
Code coverage for golang is
|
|
Here you go. |
|
The following ISSUES will be created once got merged. If you want me to skip creating the issue, you can use Details1. Implement filterable interface for each collection.pipecd/pkg/datastore/filedb/filter.go Lines 21 to 24 in 8a13375
This was created by todo plugin since "TODO:" was found in 8a13375 when #3329 was merged. cc: @khanhtc1202. |
|
Code coverage for golang is
|
|
Here you go! |
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: