Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update docs about partitioning. #3446

Open
wants to merge 11 commits into
base: latest
Choose a base branch
from
113 changes: 48 additions & 65 deletions api/add_dimension.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,79 +11,62 @@ api:

# add_dimension()

Add an additional partitioning dimension to a Timescale hypertable. You can only execute this
`add_dimension` command on an empty hypertable. To convert a normal table to a hypertable,
Add an additional partitioning dimension to a Timescale hypertable.

<Highlight type="cloud" header="These instructions are for self-hosted TimescaleDB deployments" button="Try Timescale Cloud">
Best practice is to not use additional dimensions. However, Timescale Cloud transparently provides seamless storage scaling,
both in terms of storage capacity and available storage IOPS/bandwidth.
</Highlight>

You can only execute this `add_dimension` command on an empty hypertable. To convert a normal table to a hypertable,
call [create hypertable][create_hypertable].

The column you select as the dimension can use either:

- Interval partitions: For example, for a second range partition.
- [hash partitions][hash-partition]: for [distributed hypertables][distributed-hypertables]

<Highlight type="note">
- [hash partitions][hash-partition]: to enable parallelization across multiple disks.

This page describes the generalized hypertable API introduced in [TimescaleDB v2.13.0][rn-2130].

For information about the deprecated interface, see [add_dimension(), deprecated interface][add-dimension-old].
</Highlight>

### Hash partitions

To achieve efficient scale-out performance, best practice is to use hash partitions
for [distributed hypertables][distributed-hypertables]. For [regular hypertables][regular-hypertables]
that exist on a single node only, it is possible to configure additional partitioning
for specialized use cases. However, this is is an expert option.

Every distinct item in hash partitioning is hashed to one of
*N* buckets. Remember that we are already using (flexible) range
intervals to manage chunk sizes; the main purpose of hash
partitioning is to enable parallelization across multiple
data nodes (in the case of distributed hypertables) or
across multiple disks within the same time interval
(in the case of single-node deployments).

### Parallelizing queries across multiple data nodes

In a distributed hypertable, hash partitioning enables inserts to be
parallelized across data nodes, even while the inserted rows share
timestamps from the same time interval, and thus increases the ingest rate.
Query performance also benefits by being able to parallelize queries
across nodes, particularly when full or partial aggregations can be
"pushed down" to data nodes (for example, as in the query
`avg(temperature) FROM conditions GROUP BY hour, location`
when using `location` as a hash partition). Please see our
[best practices about partitioning in distributed hypertables][distributed-hypertable-partitioning-best-practices]
for more information.

### Parallelizing disk I/O on a single node

Parallel I/O can benefit in two scenarios: (a) two or more concurrent
queries should be able to read from different disks in parallel, or
(b) a single query should be able to use query parallelization to read
from multiple disks in parallel.

Thus, users looking for parallel I/O have two options:

1. Use a RAID setup across multiple physical disks, and expose a
single logical disk to the hypertable (that is, via a single tablespace).

1. For each physical disk, add a separate tablespace to the
database. Timescale allows you to actually add multiple tablespaces
to a *single* hypertable (although under the covers, a hypertable's
chunks are spread across the tablespaces associated with that hypertable).

We recommend a RAID setup when possible, as it supports both forms of
parallelization described above (that is, separate queries to separate
disks, single query to multiple disks in parallel). The multiple
tablespace approach only supports the former. With a RAID setup,
*no spatial partitioning is required*.

That said, when using hash partitions, we recommend using 1
hash partition per disk.

Timescale does *not* benefit from a very large number of hash
partitions (such as the number of unique items you expect in partition
field). A very large number of such partitions leads both to poorer
Every distinct item in hash partitioning is hashed to one of *N* buckets. By default,
TimescaleDB uses flexible range intervals to manage chunk sizes. The main purpose of hash
partitioning is to enable parallelization across multiple disks within the same time
interval.

### Parallelizing disk I/O

You use Parallel I/O in the following scenarios:

- Two or more concurrent queries should be able to read from different disks in parallel.
- A single query should be able to use query parallelization to read from multiple disks in parallel.

For the following options:

- **RAID**: use a RAID setup across multiple physical disks, and expose a single logical disk to the hypertable.
That is, using a single tablespace.

Best practice is to use RAID when possible. This is because it supports the concurrent and single
scenarios and *no spatial partitioning is required*.
billy-the-fish marked this conversation as resolved.
Show resolved Hide resolved

- **Multiple tablespaces**: for each physical disk, add a separate tablespace to the database. TimescaleDB allows you to add
multiple tablespaces to a *single* hypertable. However, although under the hood, a hypertable's
chunks are spread across the tablespaces associated with that hypertable.

Multiple tablespaces only supports concurrent queries.
billy-the-fish marked this conversation as resolved.
Show resolved Hide resolved

When using hash partitions, best practice is to have at least one hash partition per disk.

set number of partitions to a multiple of number of disks. For example, the number of
partitions P=N*Pd where N is the number of disks and Pd is the number of partitions per
disk. This enables you to add more disks later and move partitions to the new disk from other disks.
billy-the-fish marked this conversation as resolved.
Show resolved Hide resolved


TimescaleDB does *not* benefit from a very large number of hash
partitions, such as the number of unique items you expect in partition
field. A very large number of hash partitions leads both to poorer
per-partition load balancing (the mapping of items to partitions using
hashing), as well as much increased planning latency for some types of
queries.
Expand Down Expand Up @@ -133,13 +116,13 @@ partitionining on `device_id`.

```sql
SELECT create_hypertable('conditions', by_range('time'));
SELECT add_dimension('conditions', , by_hash('location', 2));
SELECT add_dimension('conditions', by_hash('location', 2));
SELECT add_dimension('conditions', by_range('time_received', INTERVAL '1 day'));
SELECT add_dimension('conditions', by_hash('device_id', 2));
SELECT add_dimension('conditions', by_hash('device_id', 2), if_not_exists => true);
```

Now in a multi-node example for distributed hypertables with a cluster
In a multi-node example for distributed hypertables with a cluster
of one access node and two data nodes, configure the access node for
access to the two data nodes. Then, convert table `conditions` to
a distributed hypertable with just range partitioning on column `time`,
Expand All @@ -150,7 +133,7 @@ with two partitions (as the number of the attached data nodes).
SELECT add_data_node('dn1', host => 'dn1.example.com');
SELECT add_data_node('dn2', host => 'dn2.example.com');
SELECT create_distributed_hypertable('conditions', 'time');
SELECT add_dimension('conditions', by_hash('location', 2));
SELECT add_dimension('conditions', by_range('time'), by_hash('location', 2));
```

[create_hypertable]: /api/:currentVersion:/hypertable/create_hypertable/
Expand Down
76 changes: 38 additions & 38 deletions api/dimension_info.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,39 @@
# Dimension Builders

<Highlight type="note">
Dimension builders were introduced in TimescaleDB 2.13.
</Highlight>

The `create_hypertable` and `add_dimension` are used together with
dimension builders to specify the dimensions to partition a
hypertable on.
You call [`create_hypertable`][create_hypertable] and [`add_dimension`][add_dimension] to specify the dimensions to
partition a hypertable on. TimescaleDB supports partitioning [`by_range`][by-range] and [`by_hash`][by-hash]. You can
partition `by_range` on it's own.

TimescaleDB currently supports two partition types: partitioning by
range and partitioning by hash.
Hypertables must always have a primary range dimension, followed by an arbitrary number of additional dimensions that
can be either range or hash, Typically this is just one hash.


<Highlight type="tip">
For incompatible data types (for example, `jsonb`) you can specify a function to
For incompatible data types such as `jsonb`, you can specify a function to
the `partition_func` argument of the dimension build to extract a compatible
data type. Look in the example section below.
</Highlight>

Dimension builders were introduced in TimescaleDB 2.13.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the section on "dimension builders". I know it was in the docs before, but it seems the concept is never really introduced. What is a dimension builder and why do I care about it?

My understanding is that it is just a set of functions you use together with create_hypertable() and add_dimension(), but those pages don't even mention the concept or make any reference to this section although the "builder" functions are used in those sections.

I guess am wondering why we need a separate section on dimension builders when it is not mentioned elsewhere?

If we are sticking with this concept of dimension builders, I think this section needs a better introduction to what they are and why they are useful, and when to use them.

But, if it was up to me, I would just take what is useful here and merge it with the sections on create_hypertable and add_dimension. I would probably skip the concept of "dimension builder" and just explain how to use the functions by_range() and by_hash()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment, it really helps. I have had another look over this, and from what I understand this is talking about the _timescaledb_internal.dimension_info type returned when you call create_hypertable or add_dimension. How about as you suggest I add any useful information to those functions, and move the rest of the info to https://docs.timescale.com/api/latest/informational-views/ as a dimension_info object?

My head hurts.




## Partition Function

It is possible to specify a custom partitioning function for both
If you do not set custom partitioning, TimescaleDB calls PostgreSQL's internal hash function for the given type.
You use custom partitioning function for value types that do not have a native PostgreSQL hash
function.

You can specify a custom partitioning function for both
range and hash partitioning. A partitioning function should take a
`anyelement` argument as the only parameter and return a positive
`integer` hash value. Note that this hash value is _not_ a partition
identifier, but rather the inserted value's position in the
dimension's key space, which is then divided across the partitions.

If no custom partitioning function is specified, the default
partitioning function is used, which calls PostgreSQL's internal hash
function for the given type. Thus, a custom partitioning function can
be used for value types that do not have a native PostgreSQL hash
function.
`integer` hash value. This hash value is _not_ a partition identifier, but rather the
inserted value's position in the dimension's key space, which is then divided across
the partitions.


## by_range()

Creates a by-range dimension builder that can be used with
`create_hypertable` and `add_dimension`.
Create a by-range dimension builder that can be used with
[`create_hypertable`][create_hypertable] and [`add_dimension`][add_dimension].

### Required Arguments

Expand All @@ -59,19 +57,15 @@ information.

### Notes

The `partition_interval` should be specified as follows:
Specify the `partition_interval` as follows. If the column to be partitioned is a:

- If the column to be partitioned is a `TIMESTAMP`, `TIMESTAMPTZ`, or
`DATE`, this length should be specified either as an `INTERVAL` type
- `TIMESTAMP`, `TIMESTAMPTZ`, or `DATE`: specify `partition_interval` either as an `INTERVAL` type
or an integer value in *microseconds*.

- If the column is some other integer type, this length should be an
integer that reflects the column's underlying semantics (for example, the
`partition_interval` should be given in milliseconds if this column
is the number of milliseconds since the UNIX epoch).
- Another integer type: specify `partition_interval` as an integer that reflects the column's
underlying semantics. For example, if this column is in UNIX time, specify `partition_interval` in milliseconds.

A summary of the partition type and default value depending on the
column type is summarized below.
The partition type and default value depending on column type is:

| Column Type | Partition Type | Default value |
|------------------------------|------------------|---------------|
Expand All @@ -90,13 +84,12 @@ The simplest usage is to partition on a time column:
SELECT create_hypertable('my_table', by_range('time'));
```

In this case, the dimension builder can be excluded since
`create_hypertable` by default assumes that a single provided column
In this case, the dimension builder can be excluded since by default,
`create_hypertable` assumes that a single provided column
is range partitioned by time.

If you have a table with a non-time column containing the time, for
example a JSON column, you can add a partition function to extract the
time.
If you have a table with a non-time column containing the time, such as
a JSON column, add a partition function to extract the time.

```sql
CREATE TABLE my_table (
Expand Down Expand Up @@ -131,3 +124,10 @@ SELECT create_hypertable('my_table', by_range('data', '1 day', 'get_time'));
An *dimension builder*, which is an which is an opaque type
`_timescaledb_internal.dimension_info`, holding the dimension
information.

[create_hypertable]: /api/:currentVersion:/hypertable/create_hypertable/
[add_dimension]: /api/:currentVersion:/hypertable/add_dimension/
[dimension_builders]: /api/:currentVersion://hypertable/dimension_info/
[by-range]: /api/:currentVersion:/hypertable/dimension_info/#by_range
[by-hash]: /api/:currentVersion:/hypertable/dimension_info/#by_hash

2 changes: 1 addition & 1 deletion use-timescale/page-index/page-index.js
Original file line number Diff line number Diff line change
Expand Up @@ -716,7 +716,7 @@ module.exports = [
excerpt: "Change the schema of a hypertable",
},
{
title: "Index",
title: "Index data",
href: "indexing",
excerpt: "Create an index on a hypertable",
},
Expand Down
13 changes: 7 additions & 6 deletions use-timescale/schema-management/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ products: [cloud, mst, self_hosted]
keywords: [hypertables, indexes]
---

# Indexing data
# Index data

You can use an index on your database to speed up read operations. You can
create an index on any combination of columns, as long as you include the `time`
Expand Down Expand Up @@ -36,7 +36,7 @@ use this command:
CREATE INDEX ON conditions (time DESC);
```

When you create a hypertable with the `create_hypertable` command, and you
When you create a hypertable with [`create_hypertable`][create_hypertable], and you
specify an optional hash partition in addition to time, such as a `location`
column, an additional index is created on the optional column and time. For
example:
Expand All @@ -57,13 +57,12 @@ SELECT create_hypertable('conditions', by_range('time'))
CREATE_DEFAULT_INDEXES false;
```

<Highlight type="note">
The `by_range` dimension builder is an addition to TimescaleDB 2.13.
</Highlight>
[`by_range`][by-range] is an addition [dimension builder][dimension_builders] since TimescaleDB v2.13.


## Best practices for indexing

If you have sparse data, with columns that are often NULL, you can add a clause
If you have sparse data with columns that are often NULL, you can add a clause
to the index, saying `WHERE column IS NOT NULL`. This prevents the index from
indexing NULL data, which can lead to a more compact and efficient index. For
example:
Expand Down Expand Up @@ -95,3 +94,5 @@ to perform indexing transactions on an individual chunk.
[create_hypertable]: /api/:currentVersion:/hypertable/create_hypertable/
[about-index]: /use-timescale/:currentVersion:/schema-management/about-indexing/
[create-index]: https://docs.timescale.com/api/latest/hypertable/create_index/
[by-range]: /api/:currentVersion:/hypertable/dimension_info/#by_range
[dimension_builders]: /api/:currentVersion://hypertable/dimension_info/
Loading