Skip to content

Commit 907e24e

Browse files
nineinchnickmosabua
authored andcommitted
Document using statistics in the Faker connector
1 parent eedf7db commit 907e24e

File tree

4 files changed

+109
-7
lines changed

4 files changed

+109
-7
lines changed

docs/src/main/sphinx/connector/faker.md

Lines changed: 102 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,15 @@ The following table details all general configuration properties:
5050
* - `faker.locale`
5151
- Default locale for generating character-based data, specified as a IETF BCP
5252
47 language tag string. Defaults to `en`.
53+
* - `faker.sequence-detection-enabled`
54+
- If true, when creating a table using existing data, columns with the number
55+
of distinct values close to the number of rows are treated as sequences.
56+
Defaults to `true`.
57+
* - `faker.dictionary-detection-enabled`
58+
- If true, when creating a table using existing data, columns with a low
59+
number of distinct values are treated as dictionaries, and get
60+
the `allowed_values` column property populated with random values.
61+
Defaults to `true`.
5362
:::
5463

5564
The following table details all supported schema properties. If they're not
@@ -66,6 +75,15 @@ set, values from corresponding configuration properties are used.
6675
them, in any table of this schema.
6776
* - `default_limit`
6877
- Default number of rows in a table.
78+
* - `sequence_detection_enabled`
79+
- If true, when creating a table using existing data, columns with the number
80+
of distinct values close to the number of rows are treated as sequences.
81+
Defaults to `true`.
82+
* - `dictionary_detection_enabled`
83+
- If true, when creating a table using existing data, columns with a low
84+
number of distinct values are treated as dictionaries, and get
85+
the `allowed_values` column property populated with random values.
86+
Defaults to `true`.
6987
:::
7088

7189
The following table details all supported table properties. If they're not set,
@@ -82,6 +100,15 @@ values from corresponding schema properties are used.
82100
`null` in the table.
83101
* - `default_limit`
84102
- Default number of rows in the table.
103+
* - `sequence_detection_enabled`
104+
- If true, when creating a table using existing data, columns with the number
105+
of distinct values close to the number of rows are treated as sequences.
106+
Defaults to `true`.
107+
* - `dictionary_detection_enabled`
108+
- If true, when creating a table using existing data, columns with a low
109+
number of distinct values are treated as dictionaries, and get
110+
the `allowed_values` column property populated with random values.
111+
Defaults to `true`.
85112
:::
86113

87114
The following table details all supported column properties.
@@ -245,7 +272,7 @@ operation](sql-read-operations) statements to generate data.
245272
To define the schema for generating data, it supports the following features:
246273

247274
- [](/sql/create-table)
248-
- [](/sql/create-table-as)
275+
- [](/sql/create-table-as), see also [](faker-statistics)
249276
- [](/sql/drop-table)
250277
- [](/sql/create-schema)
251278
- [](/sql/drop-schema)
@@ -317,3 +344,77 @@ CREATE TABLE generator.default.customer (
317344
group_id INTEGER WITH (allowed_values = ARRAY['10', '32', '81'])
318345
);
319346
```
347+
348+
(faker-statistics)=
349+
### Using existing data statistics
350+
351+
The Faker connector automatically sets the `default_limit` table property, and
352+
the `min`, `max`, and `null_probability` column properties, based on statistics
353+
collected by scanning existing data read by Trino from the data source. The
354+
connector uses these statistics to be able to generate data that is more similar
355+
to the original data set, without using any of that data:
356+
357+
358+
```sql
359+
CREATE TABLE generator.default.customer AS
360+
SELECT *
361+
FROM production.public.customer
362+
WHERE created_at > CURRENT_DATE - INTERVAL '1' YEAR;
363+
```
364+
365+
Instead of using range, or other predicates, tables can be sampled,
366+
see [](tablesample).
367+
368+
When the `SELECT` statement doesn't contain a `WHERE` clause, a shorter notation
369+
can be used:
370+
371+
```sql
372+
CREATE TABLE generator.default.customer AS TABLE production.public.customer;
373+
```
374+
375+
The Faker connector detects sequence columns, which are integer column with the
376+
number of distinct values almost equal to the number of rows in the table. For
377+
such columns, Faker sets the `step` column property to 1.
378+
379+
Sequence detection can be turned off using the `sequence_detection_enabled`
380+
table, or schema property or in the connector configuration file, using the
381+
`faker.sequence-detection-enabled` property.
382+
383+
The Faker connector detects dictionary columns, which are columns of
384+
non-character types with the number of distinct values lower or equal to 1000.
385+
For such columns, Faker generates a list of random values to choose from, and
386+
saves it in the `allowed_values` column property.
387+
388+
Dictionary detection can be turned off using the `dictionary_detection_enabled`
389+
table, or schema property or in the connector configuration file, using
390+
the `faker.dictionary-detection-enabled` property.
391+
392+
For example, copy the `orders` table from the TPC-H connector with
393+
statistics, using the following query:
394+
395+
```sql
396+
CREATE TABLE generator.default.orders AS TABLE tpch.tiny.orders;
397+
```
398+
399+
Inspect the schema of the table created by the Faker connector:
400+
```sql
401+
SHOW CREATE TABLE generator.default.orders;
402+
```
403+
404+
The table schema should contain additional column and table properties.
405+
```
406+
CREATE TABLE generator.default.orders (
407+
orderkey bigint WITH (max = '60000', min = '1', null_probability = 0E0, step = '1'),
408+
custkey bigint WITH (allowed_values = ARRAY['153','662','1453','63','784', ..., '1493','657'], null_probability = 0E0),
409+
orderstatus varchar(1),
410+
totalprice double WITH (max = '466001.28', min = '874.89', null_probability = 0E0),
411+
orderdate date WITH (max = '1998-08-02', min = '1992-01-01', null_probability = 0E0),
412+
orderpriority varchar(15),
413+
clerk varchar(15),
414+
shippriority integer WITH (allowed_values = ARRAY['0'], null_probability = 0E0),
415+
comment varchar(79)
416+
)
417+
WITH (
418+
default_limit = 15000
419+
)
420+
```

docs/src/main/sphinx/sql/select.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1039,6 +1039,7 @@ ORDER BY regionkey FETCH FIRST ROW WITH TIES;
10391039
(5 rows)
10401040
```
10411041

1042+
(tablesample)=
10421043
## TABLESAMPLE
10431044

10441045
There are multiple sample methods:

plugin/trino-faker/src/main/java/io/trino/plugin/faker/FakerConfig.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ public boolean isSequenceDetectionEnabled()
8080
@ConfigDescription(
8181
"""
8282
If true, when creating a table using existing data, columns with the number of distinct values close to
83-
the number of rows will be treated as sequences""")
83+
the number of rows are treated as sequences""")
8484
public FakerConfig setSequenceDetectionEnabled(boolean value)
8585
{
8686
this.sequenceDetectionEnabled = value;
@@ -96,7 +96,7 @@ public boolean isDictionaryDetectionEnabled()
9696
@ConfigDescription(
9797
"""
9898
If true, when creating a table using existing data, columns with a low number of distinct values
99-
will have the allowed_values column property populated with random values""")
99+
are treated as dictionaries, and get the allowed_values column property populated with random values""")
100100
public FakerConfig setDictionaryDetectionEnabled(boolean value)
101101
{
102102
this.dictionaryDetectionEnabled = value;

plugin/trino-faker/src/main/java/io/trino/plugin/faker/FakerConnector.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -131,14 +131,14 @@ public List<PropertyMetadata<?>> getSchemaProperties()
131131
SchemaInfo.SEQUENCE_DETECTION_ENABLED,
132132
"""
133133
If true, when creating a table using existing data, columns with the number of distinct values close to
134-
the number of rows will be treated as sequences""",
134+
the number of rows are treated as sequences""",
135135
null,
136136
false),
137137
booleanProperty(
138138
SchemaInfo.DICTIONARY_DETECTION_ENABLED,
139139
"""
140140
If true, when creating a table using existing data, columns with a low number of distinct values
141-
will have the allowed_values column property populated with random values""",
141+
are treated as dictionaries, and get the allowed_values column property populated with random values""",
142142
null,
143143
false));
144144
}
@@ -163,14 +163,14 @@ public List<PropertyMetadata<?>> getTableProperties()
163163
TableInfo.SEQUENCE_DETECTION_ENABLED,
164164
"""
165165
If true, when creating a table using existing data, columns with the number of distinct values close to
166-
the number of rows will be treated as sequences""",
166+
the number of rows are treated as sequences""",
167167
null,
168168
false),
169169
booleanProperty(
170170
TableInfo.DICTIONARY_DETECTION_ENABLED,
171171
"""
172172
If true, when creating a table using existing data, columns with a low number of distinct values
173-
will have the allowed_values column property populated with random values""",
173+
are treated as dictionaries, and get the allowed_values column property populated with random values""",
174174
null,
175175
false));
176176
}

0 commit comments

Comments
 (0)