add some docs in Chapter Table_design (#20)

* add table_design, add materialized_view, bitmap_index, bloomfilter_index. * update bitmap index * update toc * add release note1.19.0 * update release notes with 1.19.1
StarRocks · Nov 2, 2021 · 88bd3aa · 88bd3aa
1 parent 826e339
commit 88bd3aa
Show file tree

Hide file tree

Showing 7 changed files with 349 additions and 0 deletions.
diff --git a/TOC.md b/TOC.md
@@ -15,5 +15,11 @@
   + [Data model](/table_design/Data_model.md)
   + [Data distribution](/table_design/Data_distribution.md)
   + [Sort key and shortkey index](/table_design/Sort_key.md)
+  + [Materialized view](/table_design/Materialized_view.md)
+  + [Bitmap indexing](/table_design/Bitmap_index.md)
+  + [Bloomfilter indexing](/table_design/Bloomfilter_index.md)
 + Development
   + [Build in docker](/development/Build_in_docker.md)
++ Release Notes
+  + [v1.19](release_notes/release-1.19.md)
+
diff --git a/assets/3.6.1-1.png b/assets/3.6.1-1.png
diff --git a/assets/3.7.1.png b/assets/3.7.1.png
diff --git a/release_notes/release-1.19.md b/release_notes/release-1.19.md
@@ -0,0 +1,54 @@
+# StarRocks version 1.19
+
+## v1.19.0
+
+Release date: Octorber 22, 2021
+
+### New Feature
+
+* Implement Global Runtime Filter, which can enable runtime filter for shuffle join.
+* CBO Planner is enabled by default, improved colocated join, bucket shuffle, statistical information estimation, etc.
+* [Experimental Function] Primary Key model release: To better support real-time/frequent update feature, StarRocks has added a new table type: primary key model. The model supports Stream Load, Broker Load, Routine Load and also provides a second-level synchronization tool for MySQL data based on Flink-cdc.
+* [Experimental Function] Support write function for external tables. Support writing data to another StarRocks cluster table by external tables to solve the read/write separation requirement and provide better resource isolation.
+
+### Improvement
+
+#### StarRocks
+
+* Performance optimization.
+  * count distinct int statement
+  * group by int statement
+  * or statement
+* Optimize disk balance algorithm. Data can be automatically balanced after adding disks to a single machine.
+* Support partial column export.
+* Optimize show processlist to show specific SQL.
+* Support multiple variable settings in SET_VAR .
+* Improve the error reporting information, including table_sink, routine load, creation of materialized view, etc.
+
+#### StarRocks-DataX Connector
+
+* Support setting interval flush StarRocks-DataX Writer.
+
+### Bugfix
+
+* Fix the issue that the dynamic partition table cannot be created automatically after the data recovery operation is completed. [# 337](https://github.com/StarRocks/starrocks/issues/337)
+* Fix the problem of error reported by row_number function after CBO is opened.
+* Fix the problem of FE stuck due to statistical information collection
+* Fix the problem that set_var takes effect for session but not for statements.
+* Fix the problem that select count(*) returns abnormality on the Hive partition external table.
+
+## v1.19.1
+
+Release date: November 2, 2021
+
+### Improvement
+
+* Optimize the performance of `show frontends`. [# 507](https://github.com/StarRocks/starrocks/pull/507) [# 984](https://github.com/StarRocks/starrocks/pull/984)
+* Add monitoring of slow queries. [# 502](https://github.com/StarRocks/starrocks/pull/502) [# 891](https://github.com/StarRocks/starrocks/pull/891)
+* Optimize the fetching of Hive external metadata to achieve parallel fetching.[# 425](https://github.com/StarRocks/starrocks/pull/425) [# 451](https://github.com/StarRocks/starrocks/pull/451)
+
+### BugFix
+
+* Fix the problem of Thrift protocol compatibility, so that the Hive external table can be connected with Kerberos. [# 184](https://github.com/StarRocks/starrocks/pull/184) [# 947](https://github.com/StarRocks/starrocks/pull/947) [# 995](https://github.com/StarRocks/starrocks/pull/995) [# 999](https://github.com/StarRocks/starrocks/pull/999)
+* Fix several bugs in view creation. [# 972](https://github.com/StarRocks/starrocks/pull/972) [# 987](https://github.com/StarRocks/starrocks/pull/987)[# 1001](https://github.com/StarRocks/starrocks/pull/1001)
+* Fix the problem that FE cannot be upgraded in grayscale. [# 485](https://github.com/StarRocks/starrocks/pull/485) [# 890](https://github.com/StarRocks/starrocks/pull/890)
diff --git a/table_design/Bitmap_index.md b/table_design/Bitmap_index.md
@@ -0,0 +1,66 @@
+# Bitmap Indexing
+
+StarRocks supports bitmap-based indexing that significantly speeds up queries.
+
+## Principle
+
+### What is  Bitmap
+
+Bitmap is an array of one-bit elements with 0 and 1 value. Those elements can be set and cleared. Bitmap can be used in the following scenarios:
+
+* Use two long types to represent the gender of 16 students. 0 represents female and 1 represents male.
+* A bitmap indicates whether there is a null value in a set of data, where 0 represents the element is not null and 1 represents null.
+* A bitmap indicates quarters (Q1, Q2, Q3, Q4), where 1 represents Q4 and 0 represents the other quarters.
+
+### What is  Bitmap index
+
+![Bitmap Index](/assets/3.6.1-1.png)
+
+Bitmap can only represent an array of columns with two values. When the values of the columns are of multiple enumeration types, such as quarters (Q1, Q2, Q3, Q4) and system platforms (Linux, Windows, FreeBSD, MacOS), it is not possible to encode them in a single Bitmap. In this case, you can create a Bitmap for each value and a dictionary of the actual enumerated values.
+
+As shown above, there are 4 rows of data in the `platform` column, and the possible values are `Android` and `iOS`. StarRocks will first build a dictionary for the `platform` column, and then map `Android` and `iOS` to int. `Android` and `iOS` are encoded as 0 and 1 respectively.Since `Android` appears in rows 1, 2 and 3, the bitmap is 0111; `iOS` appears in row 4, the bitmap is 1000.
+
+If there is a SQL query against the table containing the `platform` column, e.g. `select xxx from table where Platform = iOS`, StarRocks will first look up the dictionary to find the `iOS` whose encoding value is 1, and then go to the bitmap index to find out that 1 corresponds to a bitmap of 1000. As a result, StarRocks will only read the 4th row of data as it meets the query condition.
+
+## Suitable scenarios
+
+### Non-prefix filtering
+
+Referring to [shortkey index](/table_design/Sort_key.md), StarRocks can quickly filter the first few columns by shortkey indexing. However, for the columns that are in the middle or the end, shortkey indexing doesn’t work. Instead, users can create a bitmap index for filtering.
+
+### Multi-Column Filtering
+
+Since Bitmap can perform bitwise operations quickly, users can consider creating a bitmap index for each column in a multi-column filtering scenario.
+
+## How to use
+
+### Create an Index
+
+Create a bitmap index for the `site_id` column on table1.
+
+~~~ SQL
+CREATE INDEX index_name ON table1 (site_id) USING BITMAP COMMENT 'balabala';
+~~~
+
+### View an index
+
+Show the index under the specified `table_name`.
+
+~~~ SQL
+SHOW INDEX FROM example_db.table_name;
+~~~
+
+### Delete an index
+
+Delete an index with the specified name from a table.
+
+~~~ SQL
+DROP INDEX index_name ON [db_name.]table_name;
+~~~
+
+## Notes
+
+1. For the duplicate model, all columns can be bitmap indexed; for the aggregate model, only the key column can be Bitmap indexed.
+2. Bitmap indexes should be created on columns that have enumerated values, a large number of duplicate values, and a low base. These columns are used for equivalence queries or can be converted to equivalence queries.
+3. Bitmap indexes are not supported for Float, Double, or Decimal type columns.
+4. To see whether a query hits the Bitmap index, check the its profile information.
diff --git a/table_design/Bloomfilter_index.md b/table_design/Bloomfilter_index.md
@@ -0,0 +1,65 @@
+# Bloom Filter Indexing
+
+## Principle
+
+### What is Bloom Filter
+
+Bloom Filter is a data structure used to determine whether an element is in a collection. The advantage is that it is more space and time efficient, and the disadvantage is that it has a certain rate of misclassification.
+
+![bloomfilter](/assets/3.7.1.png)
+
+Bloom Filter is composed of a bit array and a number of hash functions. The bit array is initially set to 0. When an element is inserted, the hash functions (number of n) compute on the element and obtain the slots (number of n)The corresponding number of slots in the bit array is set to 1.
+
+To confirm whether an element is in the set, the system will calculate the Hash value based on the hash functions. If the hash values in the bloom filter have at least one 0, the element does not exist. When all the corresponding slots in the bit are 1, the existence of the element cannot be confirmed. This is because the number of slots in the bloom filter is limited, and it is possible that all the slots calculated from this element are the same as the slots calculated from another existing element. Therefore, in the all-1 case, we need to go back to the source to confirm the existence of the element.
+
+### What is Bloom Filter Indexing
+
+When creating a table in StarRocks, you can specify the columns to be indexed by BloomFilter through `PROPERTIES{"bloom_filter_columns"="c1,c2,c3"}`. BloomFilter can quickly confirm whether a certain value exists in a column when querying. If the bloom filter determines that the specified value does not exist in the column, there is no need to read the data file. If it is an all-1 situation, it needs to read the data block to confirm whether the target value exists. In addition, bloom filter indexes cannot determine which specific row of data has the specified value.
+
+## Suitable scenarios
+
+A bloom filter index can be built when the following conditions are met.
+
+1. Bloom Filter is suitable for non-prefix filtering.
+2. The query will be frequently filtered according to the column, and most of the query conditions are `in` and `=`.
+3. Unlike Bitmap, BloomFilter is suitable for columns with a high base number.
+
+## How to use
+
+### Create an index
+
+When creating a table, use `bloom_filter_columns`.
+
+~~~ SQL
+PROPERTIES ( "bloom_filter_columns"="k1,k2,k3" )
+~~~
+
+### View an index
+
+View the bloom filter indexes under the `table_name`.
+
+~~~ SQL
+SHOW CREATE TABLE table_name;
+~~~
+
+### Delete an Index
+
+Deleting an index means removing the index column from the `bloom_filter_columns` property:
+
+~~~ SQL
+ALTER TABLE example_db.my_table SET ("bloom_filter_columns" = "");
+~~~
+
+### Modify an index
+
+Modifying an index means modifying the `bloom_filter_columns` property.
+
+~~~SQL
+ALTER TABLE example_db.my_table SET ("bloom_filter_columns" = "k1,k2,k3");
+~~~
+
+## Notes
+
+* Bloom Filter indexing is not supported for Tinyint, Float, Double type columns.
+* Bloom Filter indexing only has an accelerating effect on `in` and `=` filter queries.
+* To see whether a query hits the bloom filter index, check its profile information.
diff --git a/table_design/Materialized_view.md b/table_design/Materialized_view.md
@@ -0,0 +1,158 @@
+# Materialized View
+
+## Termnology
+
+1. Duplicate model: The data model stores detailed data in StarRocks. The model can be specified during table build, and the data in the metric columns will not be aggregated.
+2. Base table: The table created by `CREATE TABLE` in StarRocks.
+3. Materialized Views table: Pre-calculated data set that contains results of a query. Abbreviated as MV.
+
+## Usage Scenarios
+
+In a practical business scenario, there are usually two coexisting use cases: 1) aggregation analysis of fixed dimensions and 2) analysis of arbitrary dimensions of the original detailed data.
+
+For example, consider an E-commerce business, the order data contains the following dimensions:
+
+* `item_id`
+* `sold_time`
+* `customer_id`
+* `price`
+
+The two mentioned use cases are both valid since the users need to:
+
+1. get the sales number of a certain item on a certain day, so we need to aggregate price on the `item_id` and `sold_time` dimensions;
+2. analyze the transaction details of a certain item by a certain person on a certain day.
+
+In the existing StarRocks data model, if you create only one table with aggregation model, for example a table with `item_id`, `sold_time`, `customer_id`, `sum(price)`, you are not able to analyze detailed data since the aggregation drops some information of the data. If only a duplicate model is built, you can run queries on all the dimension, but the queries will not be accelerated because of lack of Rollup support. If you build a table with the aggregation model and a table with the duplicate model at the same time, you can get both the performance and can run queries on all the dimensions, but the two tables are not related to each other, so you need to choose the analysis table manually. It doesn’t offer good flexibility and usability.
+
+## How to use
+
+Queries that use aggregation functions such as `sum` and `count` can be executed more efficiently in tables that already contain aggregated data. You would want the improved efficiency when querying large amounts of data. The data in the table is materialized in a storage node and can be kept consistent with the base table in incremental updates. After a user creates a MVs table, the query optimizer will select and query the most efficient MVs table instead of using the base table. Since the data in MVs tables is typically much smaller than the base table, queries are more efficient.
+
+### **Build a materialized view：**
+
+~~~sql
+CREATE MATERIALIZED VIEW materialized_view_name
+AS SELECT id, SUM(clicks) AS sum_clicks
+FROM  database_name.base_table
+GROUP BY id ORDER BY id’
+~~~
+
+The creation of a materialized view is currently an asynchronous operation. The command for materialized view creation returns immediately, but the creation may still be running. You can use the `DESC "base_table_name" ALL` command to see the current materialized view of the base table. You can use the `SHOW ALTER TABLE MATERIALIZED VIEW FROM "database_name"` command to check the status of the current and the historical materialized views.
+
+* **Restrictions:**
+
+  * The `partition column` in the base table must be one of  the `group by` columns of the created materialized view
+  * Currently, only single-table materialized views are supported. No multi-table joins.
+  * Aggregation is not supported for key columns, but only for value columns.The type of aggregation operator cannot be changed.
+  * The materialized view must contain at least one key column.
+  * Expression calculation is not supported.
+  * Queries a specified materialized view is not supported.
+
+### **DROP MATERIALIZED VIEW**
+
+`DROP MATERIALIZED VIEW [IF EXISTS] [db_name]. <mv_name>`
+
+### **View materialized views:**
+
+* View all materialized views under this database
+
+`SHOW MATERIALIZED VIEW [IN|FROM db_name]`
+
+* View the table structure of the specified materialized view
+
+`DESC table_name all`
+( `DESC/DESCRIBUE mv_name` is no longer supported)
+
+* View the processing progress of a materialized view
+
+`SHOW ALTER MATERIALIZED VIEW FROM db_name`
+
+* Cancel the materialized view being created
+
+`CANCEL ALTER MATERIALIZED VIEW FROM db_name.table_name`
+
+* Confirm which materialized views were hit by the query
+
+Users do not need to know the existence of the materialized view when running the queries, which means the name of the materialized view does not have to be explicitly specified. The query optimizer can automatically determine whether the query can be routed to the appropriate materialized view. Whether the query plan is rewritten or not, it can be seen with `explain SQL`, which can be executed on the MySQL client:
+
+~~~SQL
+Explain SQL:
+
+0:OlapScanNode
+TABLE: lineorder4
+PREAGGREGATION: ON
+partitions=1/1
+RollUptable: lineorder4
+tabletRatio=1/1
+tabletList=15184
+cardinality=119994608
+avgRowSize=26.375498
+numNodes=1
+tuple ids: 0
+~~~
+
+The `RollUptable` field indicates which materialized view was hit. If the `PREAGGREGATION` field is `On`, the query will be faster. If the `PREAGGREGATION` field is `Off`, the reason will be shown. For example:
+
+~~~ TEXT
+PREAGGREGATION: OFF. Reason: The aggregate operator does not match.
+~~~
+
+This indicates the physical view cannot be used in the StarRocks storage engine because the aggregation function of the query does not match the one defined in the physical view, and requires on-site aggregation.
+
+### **Importing Data**
+
+Incremental imports to the base table are applied to all associated MVs tables. Data cannot be queried until the import to the base table and all MVs tables are all complete. StarRocks ensures that the data is consistent between the base and MVs tables. There is no data difference between querying the base table and the MVs table.
+
+## Notes
+
+### **Materialized view function support**
+
+The materialized view must be an aggregation of a single table. Only the following [aggregation functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions) are supported currently.
+
+* COUNT
+* MAX
+* MIN
+* SUM
+* PERCENTILE_APPROX
+* HLL_UNION
+* Performs HLL aggregation on duplicate data and uses HLL functions to analyze querieddata. (This is mainly used for quick and non-exact de-duplication calculations. To use `HLL_UNION aggregation` for duplicate data, you need to call the `hll_hash` function first to transform the original data.)
+
+~~~SQL
+create materialized view dt_uv as
+    select dt, page_id, HLL_UNION(hll_hash(user_id))
+    from user_view
+    group by dt, page_id;
+select ndv(user_id) from user_view; #The query can hit the materialized view
+~~~
+
+* `HLL_UNION` aggregation operator is not supported for `DECIMAL`/`BITMAP`/`HLL`/`PERCENTILE` type columns currently.
+
+* BITMAP_UNION
+
+* Users can perform BITMAP aggregation on duplicate data and use the BITMAP function to analyze the data. It is mainly used to quickly calculate the exact de-duplication of `count(distinct)`. To use `BITMAP_UNION` aggregation on the duplicate data, you need to call the `to_bitmap function` first to convert the original data.
+
+~~~SQL
+create materialized view dt_uv as
+    select dt, page_id, bitmap_union(to_bitmap(user_id))
+    from user_view
+    group by dt, page_id;
+select count(distinct user_id) from user_view; #Query can hit the materialized view
+~~~
+
+* Currently, only `TINYINT`/`SMALLINT`/`INT`/`BITINT` types are supported, and the stored content must be a positive integer (including 0).
+
+### **Intelligent routing of materialized views**
+
+There is no need to explicitly specify a MV’s name during the query, StarRocks can intelligently route to the best MV based on the query SQL. The rules for MV selection are as follows.
+
+1. Select the MV that contains all query columns
+2. Select the most matching MV by the column defined in the query’s sorting and filtering condition.
+3. Select the most matching MV by the column defined in the query’s joining condition.
+4. Select the MV with the smallest number of rows
+5. Select the MV with the smallest number of columns
+
+### **Other restrictions**
+
+1. The RollUp table model must accommodate the type of the base table (Use aggregate model for aggregate tables and duplicate model for duplicate tables.).
+2. The `Delete` operation is not allowed if a key in the `where` condition does not exist in a RollUp table. In this case, users can delete the materialized view first, then perform the `delete` operation, and finally re-add the materialized view.
+3. If the materialized view contains columns of the `REPLACE` aggregation, it must contain all key columns.