@@ -50,6 +50,15 @@ The following table details all general configuration properties:
5050* - ` faker.locale `
5151 - Default locale for generating character-based data, specified as a IETF BCP
5252 47 language tag string. Defaults to ` en ` .
53+ * - ` faker.sequence-detection-enabled `
54+ - If true, when creating a table using existing data, columns with the number
55+ of distinct values close to the number of rows are treated as sequences.
56+ Defaults to ` true ` .
57+ * - ` faker.dictionary-detection-enabled `
58+ - If true, when creating a table using existing data, columns with a low
59+ number of distinct values are treated as dictionaries, and get
60+ the ` allowed_values ` column property populated with random values.
61+ Defaults to ` true ` .
5362:::
5463
5564The following table details all supported schema properties. If they're not
@@ -66,6 +75,15 @@ set, values from corresponding configuration properties are used.
6675 them, in any table of this schema.
6776* - ` default_limit `
6877 - Default number of rows in a table.
78+ * - ` sequence_detection_enabled `
79+ - If true, when creating a table using existing data, columns with the number
80+ of distinct values close to the number of rows are treated as sequences.
81+ Defaults to ` true ` .
82+ * - ` dictionary_detection_enabled `
83+ - If true, when creating a table using existing data, columns with a low
84+ number of distinct values are treated as dictionaries, and get
85+ the ` allowed_values ` column property populated with random values.
86+ Defaults to ` true ` .
6987:::
7088
7189The following table details all supported table properties. If they're not set,
@@ -82,6 +100,15 @@ values from corresponding schema properties are used.
82100 ` null ` in the table.
83101* - ` default_limit `
84102 - Default number of rows in the table.
103+ * - ` sequence_detection_enabled `
104+ - If true, when creating a table using existing data, columns with the number
105+ of distinct values close to the number of rows are treated as sequences.
106+ Defaults to ` true ` .
107+ * - ` dictionary_detection_enabled `
108+ - If true, when creating a table using existing data, columns with a low
109+ number of distinct values are treated as dictionaries, and get
110+ the ` allowed_values ` column property populated with random values.
111+ Defaults to ` true ` .
85112:::
86113
87114The following table details all supported column properties.
@@ -245,7 +272,7 @@ operation](sql-read-operations) statements to generate data.
245272To define the schema for generating data, it supports the following features:
246273
247274- [ ] ( /sql/create-table )
248- - [ ] ( /sql/create-table-as )
275+ - [ ] ( /sql/create-table-as ) , see also [ ] ( faker-statistics )
249276- [ ] ( /sql/drop-table )
250277- [ ] ( /sql/create-schema )
251278- [ ] ( /sql/drop-schema )
@@ -317,3 +344,77 @@ CREATE TABLE generator.default.customer (
317344 group_id INTEGER WITH (allowed_values = ARRAY[' 10' , ' 32' , ' 81' ])
318345);
319346```
347+
348+ (faker-statistics)=
349+ ### Using existing data statistics
350+
351+ The Faker connector automatically sets the ` default_limit ` table property, and
352+ the ` min ` , ` max ` , and ` null_probability ` column properties, based on statistics
353+ collected by scanning existing data read by Trino from the data source. The
354+ connector uses these statistics to be able to generate data that is more similar
355+ to the original data set, without using any of that data:
356+
357+
358+ ``` sql
359+ CREATE TABLE generator .default .customer AS
360+ SELECT *
361+ FROM production .public .customer
362+ WHERE created_at > CURRENT_DATE - INTERVAL ' 1' YEAR;
363+ ```
364+
365+ Instead of using range, or other predicates, tables can be sampled,
366+ see [ ] ( tablesample ) .
367+
368+ When the ` SELECT ` statement doesn't contain a ` WHERE ` clause, a shorter notation
369+ can be used:
370+
371+ ``` sql
372+ CREATE TABLE generator .default .customer AS TABLE production .public .customer;
373+ ```
374+
375+ The Faker connector detects sequence columns, which are integer column with the
376+ number of distinct values almost equal to the number of rows in the table. For
377+ such columns, Faker sets the ` step ` column property to 1.
378+
379+ Sequence detection can be turned off using the ` sequence_detection_enabled `
380+ table, or schema property or in the connector configuration file, using the
381+ ` faker.sequence-detection-enabled ` property.
382+
383+ The Faker connector detects dictionary columns, which are columns of
384+ non-character types with the number of distinct values lower or equal to 1000.
385+ For such columns, Faker generates a list of random values to choose from, and
386+ saves it in the ` allowed_values ` column property.
387+
388+ Dictionary detection can be turned off using the ` dictionary_detection_enabled `
389+ table, or schema property or in the connector configuration file, using
390+ the ` faker.dictionary-detection-enabled ` property.
391+
392+ For example, copy the ` orders ` table from the TPC-H connector with
393+ statistics, using the following query:
394+
395+ ``` sql
396+ CREATE TABLE generator .default .orders AS TABLE tpch .tiny .orders;
397+ ```
398+
399+ Inspect the schema of the table created by the Faker connector:
400+ ``` sql
401+ SHOW CREATE TABLE generator .default .orders;
402+ ```
403+
404+ The table schema should contain additional column and table properties.
405+ ```
406+ CREATE TABLE generator.default.orders (
407+ orderkey bigint WITH (max = '60000', min = '1', null_probability = 0E0, step = '1'),
408+ custkey bigint WITH (allowed_values = ARRAY['153','662','1453','63','784', ..., '1493','657'], null_probability = 0E0),
409+ orderstatus varchar(1),
410+ totalprice double WITH (max = '466001.28', min = '874.89', null_probability = 0E0),
411+ orderdate date WITH (max = '1998-08-02', min = '1992-01-01', null_probability = 0E0),
412+ orderpriority varchar(15),
413+ clerk varchar(15),
414+ shippriority integer WITH (allowed_values = ARRAY['0'], null_probability = 0E0),
415+ comment varchar(79)
416+ )
417+ WITH (
418+ default_limit = 15000
419+ )
420+ ```
0 commit comments