Skip to content

Commit

Permalink
fix build, titles, formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
polyfractal committed May 30, 2014
1 parent 3f0c9b2 commit 6be6384
Show file tree
Hide file tree
Showing 12 changed files with 46 additions and 47 deletions.
3 changes: 0 additions & 3 deletions 300_Aggregations/05_overview.asciidoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@

== Elasticsearch offers more than just search


Up until this point, the this book has been dedicated to search. With search,
we have a query and we wish to find a subset of documents which
match the query. We are looking for the proverbial needle(s) in the
Expand Down
14 changes: 7 additions & 7 deletions 300_Aggregations/15_concepts_buckets.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

=== High-level concepts
== High-level concepts

Like the query DSL, aggregations have a _composable_ syntax: independent units
of functionality can be mixed and matched to provide the custom behavior that
Expand All @@ -14,12 +14,12 @@ _Metrics_:: Statistics calculated on the documents in a bucket.
That's it! Every aggregation is simply a combination of one or more buckets
and zero or more metrics. To translate into rough SQL terms:

[source]
----
[source,sql]
--------------------------------------------------
SELECT COUNT(color) <1>
FROM table
GROUP BY color <2>
----
--------------------------------------------------
<1> `COUNT(color)` is equivalent to a metric
<2> `GROUP BY color` is equivalent to a bucket

Expand All @@ -29,7 +29,7 @@ to `COUNT()`, `SUM()`, `MAX()`, etc

Let's dig into both of these concepts and see what they entail.

==== Buckets
=== Buckets

A bucket is simply a collection of documents that meet a certain criteria.

Expand All @@ -51,7 +51,7 @@ partition documents in many different ways (by hour, by most popular terms, by
age ranges, by geographical location, etc). But fundamentally they all operate
on the same principle: partitioning documents based on a criteria.

==== Metrics
=== Metrics

Buckets allow us to partition documents into useful subsets, but ultimately what
we want is some kind of _metric_ calculated on those documents in each bucket.
Expand All @@ -63,7 +63,7 @@ which are calculated using the document values. In practical terms, metrics all
you to calculate quantities such as the average salary, or the maximum sale price,
or the 95th percentile for query latency.

==== Combining the two
=== Combining the two

An aggregation is a combination of buckets and metrics. An aggregation may have
a single bucket, or a single metric, or one of each. It may even have multiple
Expand Down
4 changes: 2 additions & 2 deletions 300_Aggregations/20_basic_example.asciidoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// This section feels like you're worrying too much about explaining the syntax, rather than the point of aggs. By this stage in the book, people should be used to the ES api, so I think we can assume more. I'd change the emphasis here and state that intention: we want to find out what the most popular colours are. To do that we'll use a "terms" agg, which counts up every term in the "color" field and returns the 10 most popular.
// Step two: Add a query, to show that the aggs are calculated live on the results from the user's query.
=== Aggregation Test-drive
== Aggregation Test-drive

We could spend the next few pages defining the various aggregations
and their syntax, but aggregations are truly best learned by example.
Expand Down Expand Up @@ -56,7 +56,7 @@ GET /cars/transactions/_search?search_type=count <1>

// Add the search_type=count thing as a sidebar, so it doesn't get in the way
<1> Because we don't care about search results, we are going to use the `count`
<<search-type,`search_type`>, which will be faster.
<<search-type,search_type>>, which will be faster.
<2> Aggregations are placed under the top-level `"aggs"` parameter (the longer `"aggregations"`
will also work if you prefer that)
<3> We then name the aggregation whatever we want -- "popular_colors" in this example
Expand Down
6 changes: 3 additions & 3 deletions 300_Aggregations/28_bucket_metric_list.asciidoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// I'd limit this list to the metrics and rely on the obvious. You don't need to explain what min/max/avg etc are. Then say that we'll discusss these more interesting metrics in later chapters: cardinality, percentiles, significant terms. The buckets I'd mention under the relevant section, eg Histo & Range, etc

=== Available Buckets and Metrics
== Available Buckets and Metrics

There are a number of different buckets and metrics. The reference documentation
does a great job describing the various parameters and how they affect
Expand All @@ -9,7 +9,7 @@ link to the reference docs and provide a brief description. Skim the list
so that you know what is available, and check the reference docs when you need
exact parameters.

==== Buckets
=== Buckets

- http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-global-aggregation.html[Global]: includes all documents in your index
- http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html[Filter]: only includes documents that match
Expand All @@ -32,7 +32,7 @@ exact parameters.
- http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geohashgrid-aggregation.html[Geohash Grid]: partitions documents according to
what geohash grid they fall into

==== Metrics
=== Metrics

- Individual statistics: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-min-aggregation.html[Min], http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-max-aggregation.html[Max], http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-avg-aggregation.html[Avg], http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-sum-aggregation.html[Sum]
- http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-stats-aggregation.html[Stats]: calculates min/mean/max/sum/count of documents in bucket
Expand Down
5 changes: 2 additions & 3 deletions 300_Aggregations/30_histogram.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ The histogram works by specifying an interval. If we were histogram'ing sale
prices, you might specify an interval of 20,000. This would create a new bucket
every $20,000. Documents are then sorted into buckets.

Since you've already seen a few examples of aggregations, we'll go straight to a
nested example. For our dashboard, we want a bar chart of car sale prices, but we
For our dashboard, we want a bar chart of car sale prices, but we
also want to know the top selling make per price range. This is easily accomplished
using a `terms` bucket nested inside the `histogram`:

Expand Down Expand Up @@ -54,7 +53,7 @@ top make per price range
As you can see, our query is built around the "price" aggregation, which contains
a `histogram` bucket. This bucket requires a numeric field to calculate
buckets on, and an interval size. The interval defines how "wide" each bucket
is. An interval of 20000 means we will have ranges [0-20000, 20000-40000, etc]
is. An interval of 20000 means we will have ranges `[0-20000, 20000-40000, ...]`

Next, we define a nested bucket inside of the histogram. This is a `terms` bucket
over the "make" field. There is also a new "size" parameter, which defines how
Expand Down
23 changes: 13 additions & 10 deletions 300_Aggregations/35_date_histogram.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

=== Looking at time
== Looking at time

If search is the most popular activity in Elasticsearch, building date
histograms must be the second most popular. Why would you want to use a date
Expand Down Expand Up @@ -30,14 +30,15 @@ Technically, yes. A regular `histogram` bucket will work with dates. However,
it is not calendar-aware. With the `date_histogram`, you can specify intervals
such as `1 month`, which knows that February is shorter than December. The
`date_histogram` also has the advantage of being able to work with timezones,
such as displaying a graph in the timezone of the user rather than the server.
which allows you to customize graphs to the timezone of the user, not the server.
The regular histogram will interpret dates as numbers, which means you must specify
intervals in terms of milliseconds. And the aggregation doesn't know about
calendar intervals, which makes it largely useless for dates.
****

Our first example will build a simple line chart: how many cars were sold each month?
Our first example will build a simple line chart to answer the question:
how many cars were sold each month?

[source,js]
--------------------------------------------------
Expand Down Expand Up @@ -65,7 +66,8 @@ per month. This will give us the number of cars sold in each month. An additio
dates are simply represented as a numeric value. This tends to make UI designers
grumpy, however, so a prettier format can be specified using common date formatting.

The response is both expected and a little surprising:
The response is both expected and a little surprising (see if you can spot
the "surprise"):

[source,js]
--------------------------------------------------
Expand Down Expand Up @@ -117,19 +119,20 @@ The response is both expected and a little surprising:
The aggregation is represented in full. As you can see, we have buckets
which represent months, a count of docs in each month, and our pretty "key_as_string".

==== Returning empty buckets
=== Returning empty buckets

Notice something odd about that last response?

Yep, that's right. We are missing months! By default, the `date_histogram`
and (`histogram` too, for that matter) only returns buckets which have a non-zero
Yep, that's right. We are missing a few months! By default, the `date_histogram`
(and `histogram` too) only returns buckets which have a non-zero
document count.

This means your histogram will be a minimal response. Often, this is not the
behavior you actually want. For many applications, you would like to dump the
response directly into a graphing library without doing any post-processing.

There are two additional parameters we can set which will provide this behavior:
Essentially, we want buckets even if they have a count of zero. There are two
additional parameters we can set which will provide this behavior:

[source,js]
--------------------------------------------------
Expand All @@ -153,7 +156,7 @@ GET /cars/transactions/_search?search_type=count
--------------------------------------------------
// SENSE: 300_Aggregations/35_date_histogram.json
<1> This parameter forces empty buckets to be returned
<2> While this parameter forces the entire year to be returned
<2> This parameter forces the entire year to be returned

The two additional parameters will force the response to return all months in the
year, regardless of their doc count. The `min_doc_count` is very understandable:
Expand All @@ -171,7 +174,7 @@ minimum value or _after_ the maximum value.
The `extended_bounds` parameter does just that. Once you add those two settings,
you'll get a response that is easy to plug straight into your graphing libraries.

==== Extended Example
=== Extended Example

Just like we've seen a dozen times already, buckets can be nested in buckets for
more sophisticated behavior. For illustration, we'll build an aggregation
Expand Down
6 changes: 3 additions & 3 deletions 300_Aggregations/40_scope.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

=== Scoping Aggregations
== Scoping Aggregations

With all of the aggregation examples given so far, you may have noticed that we
omitted a `query` from the search request. The entire request was
Expand Down Expand Up @@ -136,11 +136,11 @@ by adding a search bar. This allows the user to search for terms and see all
of the graphs (which are powered by aggregations, and thus scoped to the query)
update in real-time. Try that with Hadoop!

<TODO> Maybe add two screenshots of a Kibana dashboard that changes considerably
//<TODO> Maybe add two screenshots of a Kibana dashboard that changes considerably
when the search changes?


==== Global Bucket
=== Global Bucket

You'll often want your aggregation to be scoped to your query. But sometimes
you'll want to search for some subset of data, but aggregate across _all_ of
Expand Down
8 changes: 4 additions & 4 deletions 300_Aggregations/45_filtering.asciidoc
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@

=== Filtering Aggregations
== Filtering Queries and Aggregations

A natural extension to aggregation scoping is filtering. Because the aggregation
operates in the context of the query scope, any filter applied to the query
will also apply to the aggregation.

==== Filtered Query
=== Filtered Query
If we want to find all cars over $10,000 and also calculate the average price
for those cars, we can simply use a `filtered` query:

Expand Down Expand Up @@ -36,7 +36,7 @@ query like we discussed in the last section. The query (which happens to includ
a filter) returns a certain subset of documents, and the aggregation operates
on those documents.

==== Filter bucket
=== Filter bucket

But what if you would like to filter just the aggregation results? Imagine we
have are building the search page for our car dealership. We want to display
Expand Down Expand Up @@ -91,7 +91,7 @@ Since the `filter` bucket operates like any other bucket, you are free to nest
other buckets and metrics inside. All nested components will "inherit" the filter.
This allows you to filter selective portions of the aggregation as required.

==== Post Filter
=== Post Filter

So far, we have a way to filter the both search results and aggregations (a
`filtered` query), as well as filtering individual portions of the aggregation
Expand Down
18 changes: 6 additions & 12 deletions 300_Aggregations/50_sorting_ordering.asciidoc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

=== Sorting multi-value buckets
== Sorting multi-value buckets

Multi-value buckets -- like the `terms`, `histogram` and `date_histogram` --
dynamically produce many buckets. How does Elasticsearch decide what order
Expand All @@ -12,7 +12,7 @@ criteria: price, population, frequency.
But sometimes you'll want to modify this sort order, and there are a few ways to
do it depending on the bucket.

==== Intrinsic sorts
=== Intrinsic sorts

These sort modes are "intrinsic" to the bucket...they operate on data that bucket
generates such as `doc_count`. They share the same syntax but differ slightly
Expand Down Expand Up @@ -47,7 +47,7 @@ one of several values:
- `_key`: Sort by the numeric value of each bucket's key (conceptually similar to `_term`).
Works only with `histogram` and `date_histogram`

==== Sorting by a metric
=== Sorting by a metric

Often, you'll find yourself wanting to sort based on a metric's calculated value.
For our car sales analytics dashboard, we may want to build a bar chart of
Expand Down Expand Up @@ -86,14 +86,8 @@ the name of the metric. Some metrics, however, emit multiple values. The
`extended_stats` metric is a good example: it provides half a dozen individual
metrics.

[INFO]
.Applicable buckets
====
Metric-based sorting works with `terms`, `histogram` and `date_histogram`
====

If you want to sort on a multi-value metric, you just need to use the fully-qualified
dot path:
If you want to sort on a multi-value metric, you just need to use the
dot-path to the metric of interest:

[source,js]
--------------------------------------------------
Expand Down Expand Up @@ -122,7 +116,7 @@ GET /cars/transactions/_search?search_type=count
In this example we are sorting on the variance of each bucket, so that colors
with the least variance in price will appear before those that have more variance.

==== Sorting based on "deep" metrics
=== Sorting based on "deep" metrics

In the prior examples, the metric was a direct child of the bucket. An average
price was calculated for each term. It is possible to sort on "deeper" metrics,
Expand Down
2 changes: 2 additions & 0 deletions 304_Approximate_Aggregations.asciidoc
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@

== Approximate Aggregations (todo)
TODO
2 changes: 2 additions & 0 deletions 305_Significant_Terms.asciidoc
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@

== Significant Terms (todo)
TODO
2 changes: 2 additions & 0 deletions 306_Practical_Considerations.asciidoc
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@

== Practical Considerations (todo)
TODO

0 comments on commit 6be6384

Please sign in to comment.