Improve S3 Performance with More Variability in Cookbook Object Naming #130
Labels
Status: To be prioritized
Indicates that product needs to prioritize this issue.
Triage: Confirmed
Indicates and issue has been confirmed as described.
Triage: Try Reproducing
Indicates that this issue needs to be reproduced.
Type: Bug
Does not work as expected.
HelpSpot 18569
Example of the current URL format of the cookbook file objects we store on S3
The portion of the key we are interested in is "opscode-platform-production-data/organization-3eca786473a44f68a91c1279ce6c845b"
We would get much better performance out of the gate if the bucket/org key looked like this instead, with a reversed key suffix for the organization
Our customer recommends '...try the "not reversed" version if the first get() fails (during the transition to the new storage system, while background S3 workers' move things around.
More customer discussion on the issue follows
I came across this message on AWS support forums: https://forums.aws.amazon.com/thread.jspa?threadID=96847 and find it surprising that you didn't prefix the cookbook segment keys with more semi random data to spread them across multiple S3 storage nodes, this is part of S3 101 when dealing with massive numbers of objects.
I don't really understand your point about S3 storage nodes. It is
transparent from an API point of view. S3 use the first few bytes of the
keys to spread them across storage nodes, it's true that hitting different
storage nodes might result in inconsistencies when the data spreads, but I
suggest you read this document...
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
(i have successfully optimized a high profile image sharing service just by
adding ".reverse!" in the bit of code where they compute the S3 key for
the objects they where storing... their scheme was very similar to yours)
"Amazon S3 maintains an index of object key names in each AWS region.
Object keys are stored lexicographically across multiple partitions in the
index. That is, Amazon S3 stores key names in alphabetical order. The key
name dictates which partition the key is stored in. Using a sequential
prefix, such as timestamp or an alphabetical sequence, increases the
likelihood that Amazon S3 will target a specific partition for a large
number of your keys, overwhelming the I/O capacity of the partition. If you
introduce some randomness in your key name prefixes, the key names, and
therefore the I/O load, will be distributed across more than one partition."
My solution to that was simply to store objects for
"organization-6213e346bef545b988d155a568d93d3e"
using "e3d39d865a551d889b545feb643e3126-noitazinagro", and try the "not
reverted" version if the first get() failed (during the transition to the
new storage system, while background S3 workers where converting every keys
Send me a beer when you get 10x S3 performance bump (no kidding)
The text was updated successfully, but these errors were encountered: