-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[DOCS] Add introduction to Elasticsearch. #43075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
4ce2dc7
[DOCS] Add introduction to Elasticsearch.
debadair c3eb934
[DOCS] Incorporated review comments.
debadair e62e863
[DOCS] Minor edits to add an abbreviated title and cross refs.
debadair 15add27
[DOCS] Added sizing tips & link to quantatative sizing video.
debadair File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,244 @@ | ||||||
| [[elasticsearch-intro]] | ||||||
| = You know, for search (and analysis) | ||||||
| [partintro] | ||||||
| -- | ||||||
| {es} is the distributed search and analytics engine at the heart of | ||||||
| the {stack}. {ls} and {beats} facilitate collecting, aggregating, and | ||||||
| enriching your data and storing it in {es}. {kib} enables you to | ||||||
| interactively explore, visualize, and share insights into your data and manage | ||||||
| and monitor the stack. {es} is where the indexing, search, and analysis | ||||||
| magic happen. | ||||||
|
|
||||||
| {es} provides real-time search and analytics for all types of data. Whether you | ||||||
| have structured or unstructured text, numerical data, or geospatial data, | ||||||
| {es} can efficiently store and index it in a way that supports fast searches. | ||||||
| You can go far beyond simple data retrieval and aggregate information to discover | ||||||
| trends and patterns in your data. And as your data and query volume grows, the | ||||||
| distributed nature of {es} enables your deployment to grow seamlessly right | ||||||
| along with it. | ||||||
|
|
||||||
| While not _every_ problem is a search problem, {es} offers speed and flexibility | ||||||
| to handle data in a wide variety of use cases: | ||||||
|
|
||||||
| * Add a search box to an app or website | ||||||
| * Store and analyze logs, metrics, and security event data | ||||||
| * Use machine learning to automatically model the behavior of your data in real | ||||||
| time | ||||||
| * Automate business workflows using {es} as a storage engine | ||||||
| * Manage, integrate, and analyze spatial information using {es} as a geographic | ||||||
| information system (GIS) | ||||||
| * Store and process genetic data using {es} as a bioinformatics research tool | ||||||
|
|
||||||
| We’re continually amazed by the novel ways people use search. Whether | ||||||
| your use case is similar to one of these, or you're using {es} to tackle a new | ||||||
| problem, the way you work with your data, documents, and indices in {es} is | ||||||
| the same. | ||||||
| -- | ||||||
|
|
||||||
| [[documents-indices]] | ||||||
| == Data in: documents and indices | ||||||
|
|
||||||
| {es} is a distributed document store. Instead of storing information as rows of | ||||||
| columnar data, {es} stores complex data structures that have been serialized | ||||||
| as JSON documents. When you have multiple {es} nodes in a cluster, stored | ||||||
| documents are distributed across the cluster and can be accessed immediately | ||||||
| from any node. | ||||||
|
|
||||||
| When a document is stored, it is indexed and fully searchable in near | ||||||
| real-time--within 1 second. {es} uses a data structure called an | ||||||
| inverted index that supports very fast full-text searches. An inverted index | ||||||
| lists every unique word that appears in any document and identifies all of the | ||||||
| documents each word occurs in. | ||||||
|
|
||||||
| An index can be thought of as an optimized collection of documents and each | ||||||
| document is a collection of fields, which are the key-value pairs that contain | ||||||
| your data. By default, {es} indexes all data in every field and each indexed | ||||||
| field has a dedicated, optimized data structure. For example, text fields are | ||||||
| stored in inverted indices, and numeric and geo fields are stored in BKD trees. | ||||||
| The ability to use the per-field data structures to assemble and return search | ||||||
| results is what makes {es} so fast. | ||||||
|
|
||||||
| {es} also has the ability to be schema-less, which means that documents can be | ||||||
| indexed without explicitly specifying how to handle each of the different fields | ||||||
| that might occur in a document. When dynamic mapping is enabled, {es} | ||||||
| automatically detects and adds new fields to the index settings. This default | ||||||
| behavior makes it easy to start indexing and exploring your data--just start | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| indexing documents and {es} will detect and map booleans, floating point and | ||||||
| integer values, dates, and strings to the appropriate {es} datatypes. | ||||||
|
|
||||||
| Ultimately, however, you know more about your data and how you want to use it | ||||||
| than {es} can. You can define rules to control dynamic mapping and use custom | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| mappings to take full control of how fields are stored and indexed. | ||||||
|
|
||||||
| Defining your own mappings enables you to: | ||||||
|
|
||||||
| * Distinguish between full-text string fields and exact value string fields | ||||||
| * Perform language-specific text analysis | ||||||
| * Optimize fields for partial matching | ||||||
| * Use custom date formats | ||||||
| * Use data types such as geo_point and geo_shape that cannot be automatically | ||||||
| detected | ||||||
|
|
||||||
| It’s often useful to index the same field in different ways for different | ||||||
| purposes. For example, you might want to index a string field as both a text | ||||||
| field for full-text search and as a keyword field for sorting or aggregating | ||||||
| your data. Or, you might choose to use more than one language analyzer to | ||||||
| process the contents of a string field that contains user input. | ||||||
|
|
||||||
| The analysis chain that is applied to a full-text field during indexing is also | ||||||
| used at search time. When you query a full-text field, the query text undergoes | ||||||
| the same analysis before the terms are looked up in the index. | ||||||
|
|
||||||
| [[search-analyze]] | ||||||
| == Information out: search and analyze | ||||||
|
|
||||||
| While you can use {es} as a document store and retrieve documents and their | ||||||
debadair marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| metadata, the real power comes from being able to easily access the full suite | ||||||
| of search capabilities built on the Apache Lucene search engine library. | ||||||
|
|
||||||
| {es} provides a simple, coherent REST API for managing your cluster and indexing | ||||||
| and searching your data. For testing purposes, you can easily submit requests | ||||||
| directly from the command line or through the Developer Console in {kib}. From | ||||||
| your applications, you can use the | ||||||
| https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client] | ||||||
| for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python | ||||||
| or Ruby. | ||||||
|
|
||||||
| [float] | ||||||
| [[search-data]] | ||||||
| === Searching your data | ||||||
|
|
||||||
| The {es} REST APIs support structured queries, full text queries, and the | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| ability to combine them into more complex queries. Structured queries are | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| similar to the types of queries you can construct in SQL. For example, you | ||||||
| could search the `gender` and `age` fields in your `employee` index and sort the | ||||||
| matches by the `hire_date` field. Full-text queries find all documents that | ||||||
| match the query string and return them sorted by _relevance_--how good a match | ||||||
| they are for your search terms. | ||||||
|
|
||||||
| In addition to searching for individual terms, you can perform phrase searches, | ||||||
| similarity searches, prefix searches, and get autocomplete suggestions. | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Have geospatial or other numerical data {es} you want to search? {es} indexes | ||||||
| non-textual data in highly-optimized data structures that support | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| high-performance geo and numerical queries. | ||||||
|
|
||||||
| You can access all of these search capabilities using {es}'s | ||||||
| comprehensive JSON-style query language (Query DSL). You can also | ||||||
| construct SQL style queries to search and aggregate data natively inside | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| {es}, and JDBC and ODBC drivers enable a broad range of third-party | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| applications to interact with {es} via SQL. | ||||||
|
|
||||||
| [float] | ||||||
| [[analyze-data]] | ||||||
| === Analyzing your data | ||||||
|
|
||||||
| {es} aggregations enable you to build complex summaries of your data and gain | ||||||
| insight into key metrics, patterns, and trends. Instead of just finding the | ||||||
| proverbial “needle in a haystack”, aggregations enable you to answer questions | ||||||
| like: | ||||||
|
|
||||||
| * How many needles are in the haystack? | ||||||
| * What is the average length of the needles? | ||||||
| * What is the median length of the needles, broken down by manufacturer? | ||||||
| * How many needles were added to the haystack in each of the last six months? | ||||||
|
|
||||||
| You can also use aggregations to answer more subtle questions, such as: | ||||||
|
|
||||||
| * What are your most popular needle manufacturers? | ||||||
| * Are there any unusual or anomalous clumps of needles? | ||||||
|
|
||||||
| Because aggregations leverage the same data-structures used for search, they are | ||||||
| also very fast. This enables you to analyze and visualize your data in real time. | ||||||
| Your reports and dashboards update as your data changes so you can take action | ||||||
| based on the latest information. | ||||||
|
|
||||||
| What’s more, aggregations operate alongside search requests. You can search | ||||||
| documents, filter results, and perform analytics at the same time, on the same | ||||||
| data, in a single request. And because aggregations are calculated in the | ||||||
| context of a particular search, you’re not just displaying a count of all | ||||||
| four-star hotels, you’re displaying a count of the four-star hotels | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| that match your users' search criteria. | ||||||
|
|
||||||
| [float] | ||||||
| [[more-features]] | ||||||
| ==== But wait, there’s more | ||||||
|
|
||||||
| Want to automate the analysis of your time-series data? You can use the machine | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| learning features to create accurate baselines of normal behavior in your data | ||||||
| and identify anomalous patterns. With machine learning, you can detect: | ||||||
|
|
||||||
| * Anomalies related to temporal deviations in values, counts, or frequencies | ||||||
| * Statistical rarity | ||||||
| * Unusual behaviors for a member of a population | ||||||
|
|
||||||
| And the best part? You can do this without having to specify algorithms, models, | ||||||
| or other data science-related configurations. | ||||||
|
|
||||||
| [[scalability]] | ||||||
| == Scalability and resilience: clusters, nodes, and shards | ||||||
|
|
||||||
| {es} is built to be always available and to scale with your needs. It does this | ||||||
| by being distributed by nature. You can add servers (nodes) to a cluster to | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| increase capacity and {es} automatically distributes your data and query load | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| across all of the available nodes. No need to overhaul your application, {es} | ||||||
| knows how to balance multi-node clusters to provide scale and high availability. | ||||||
| The more nodes, the merrier. | ||||||
|
|
||||||
| How does this work? Under the covers, an {es} index is really just a logical | ||||||
| grouping of one or more physical shards, where each shard is actually a | ||||||
| self-contained index. By distributing the documents in an index across multiple | ||||||
| shards, and distributing those shards across multiple nodes, {es} can ensure | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| redundancy to protect against hardware failures and benefit from increased | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), | ||||||
| {es} automatically migrates shards to rebalance the cluster. | ||||||
|
|
||||||
| There are two types of shards: primaries and replicas. Each document in an index | ||||||
| belongs to one primary shard. A replica shard is a copy of a primary shard. | ||||||
| Replicas provide redundant copies of your data to protect against hardware | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| failure and provide increased capacity to serve read requests | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| like searching or retrieving a document. | ||||||
|
|
||||||
| The number of primary shards in an index is fixed at the time that an index is | ||||||
| created, but the number of replica shards can be changed at any time, without | ||||||
| interrupting indexing or query operations. | ||||||
|
|
||||||
| There are a number of performance considerations and trade offs with respect | ||||||
debadair marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| to shard size and the number of primary shards configured for an index. The more | ||||||
| shards, the more overhead there is simply in maintaining those indices. The | ||||||
| larger the shard size, the longer it takes to move shards around when {es} | ||||||
| needs to rebalance a cluster. Querying lots of small shards makes the processing | ||||||
| per shard faster, but more queries means more overhead, so querying a smaller | ||||||
| number of larger shards might be faster. In short...it depends. The best way | ||||||
| to determine the optimal configuration for your use case is to through testing | ||||||
| with your own data and queries. | ||||||
|
|
||||||
| [float] | ||||||
| [[disaster-ccr]] | ||||||
| === In case of disaster | ||||||
|
|
||||||
| For performance reasons, the nodes within a cluster need to be on the same | ||||||
| network. Balancing shards in a cluster across nodes in different data centers | ||||||
| simply takes too long. But high-availability architectures demand that you avoid | ||||||
| putting all of your eggs in one basket. In the event of a major outage in one | ||||||
| location, servers in another location need to be able to take over. Seamlessly. | ||||||
| The answer? Cross-cluster replication (CCR). | ||||||
|
|
||||||
| CCR provides a way to automatically synchronize indices from your primary cluster | ||||||
| to a secondary remote cluster that can serve as a hot backup. If the primary | ||||||
| cluster fails, the secondary cluster can take over. You can also use CCR to | ||||||
| create secondary clusters to serve read requests in geo-proximity to your users. | ||||||
|
|
||||||
| Cross cluster replication is active-passive. The index on the primary cluster is | ||||||
debadair marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| the active leader index and handles all write requests. Indices replicated to | ||||||
| secondary clusters are read-only followers. | ||||||
|
|
||||||
| [float] | ||||||
| [[admin]] | ||||||
| === Care and feeding | ||||||
|
|
||||||
| As with any enterprise system, you need tools to secure, manage, and | ||||||
| monitor your {es} clusters. Security, monitoring, and administrative features | ||||||
| that are integrated into {es} enable you to use {kib} as a control center for | ||||||
| managing a cluster. Features like data rollups and index lifecycle | ||||||
| management help you intelligently manage your data over time. | ||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.