-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Expand How to tune for disk usage #25562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,20 +135,40 @@ PUT index | |
| -------------------------------------------------- | ||
| // CONSOLE | ||
|
|
||
| [float] | ||
| === Watch your shard size | ||
|
|
||
| Larger shards are going to be more efficient at storing data. While it is not possible to give a precise recommendation for how large shards should be, and very large shards come with drawbacks (for example, longer recovery times), it is generally possible to have shard sizes in the 20-30 GB range. | ||
|
|
||
| [float] | ||
| === Disable `_all` | ||
|
|
||
| The <<mapping-all-field,`_all`>> field indexes the value of all fields of a | ||
| document and can use significant space. If you never need to search against all | ||
| fields at the same time, it can be disabled. | ||
|
|
||
| [float] | ||
| === Disable `_source` | ||
|
|
||
| The <<mapping-source-field,`_source`>> field stores the original JSON body of the document. If you don’t need access to it you can disable it. However, APIs that needs access to `_source` such as update and reindex won’t work. | ||
|
|
||
| [float] | ||
| === Use `best_compression` | ||
|
|
||
| The `_source` and stored fields can easily take a non negligible amount of disk | ||
| space. They can be compressed more aggressively by using the `best_compression` | ||
| <<index-codec,codec>>. | ||
|
|
||
| [float] | ||
| === Force Merge | ||
|
|
||
| Elasticsearch stores data in segments. Segments make up a Lucene index - a shard in Elasticsearch. The <<indices-forcemerge,`_forcemerge` API>> can be used to reduce the number of segments per shard. In many cases, the number of segments can be reduced to one per shard by setting `max_num_segments=1`. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps turn this around a bit: Elasticsearch stores data in shards. Shards are Lucene indices and are composed of segments. Segments are the actual files on disk, etc.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, how about: "Indices in Elasticsearch are stored in one or more shards. Each shard is a Lucene index and made up of one or more segments - the actual files on disk. Larger segments are more efficient for storing data. The <<indices-forcemerge,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds good to me. |
||
|
|
||
| [float] | ||
| === Shrink Index | ||
|
|
||
| The <<indices-shrink-index,Shrink API>> allows you to reduce the number of shards in an index. Together with the Force Merge API above, this can significantly reduce the number of shards and segments of an index. | ||
|
|
||
| [float] | ||
| === Use the smallest numeric type that is sufficient | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should specify a size here as it depends on too many factors. With fast replica recovery coming that is another mitigating factor (#22484) to one of the drawbacks that you mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasontedor Makes sense. Do you think we should mention an upper range, e.g. 50 GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question but I don't think so. We are still fighting the "30 GB" heap recommendation, too many people see that number and think it's the magical number where they should set their heap without enough consideration for all the factors involved. Instead, I think that the verbiage is good but we should avoid enshrining specific numbers.