- column to column dataflow, the data of target column is coming from the source column(fdd).
- column to resultset(mainly select list), the number of row in the resultset is impacted by the source column(fdr).
- resultset to resultset, the number of row in a source table in the from clause impact the number of row in the target select list.(fdr)
- select list
- where clause
- function (case expression)
- group by(aggregate function)
- from clause
- handle of select * (Not finished yet)
SELECT a.empName "eName"
FROM scott.emp a
the data of target column "eName"
comes from scott.emp.empName
, so we have a dataflow relation like this:
scott.emp.empName -> fdd -> "eName"
the result generated by the select list called: resultset
likes a virtual table includes columns and rows.
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
The total number of row in the select list is impacted by the value of column sal
in the where clause. So we have a dataflow relation like this:
sal -> fdr -> resultset.PseudoRows
As you can see, we introduced a new pseudo column: PseudoRows
to represents the number of rows in the resultset.
During the dataflow analyzing, function
plays a key role. It accepts arguments which usually is column and generate resultset which maybe a scalar value or a set value.
select round(salary) as sal from scott.emp
The relation of the round
function in the above SQL :
scott.emp.salary -> fdd -> round(salary) -> fdd -> sal
SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
FROM scott.emp
Where city = 'NYC'
GROUP BY deptno
since SUM()
is an aggregate function, so deptno
column in the group by clause will be treated as an implict argument of the SUM()
function.
However, deptno
column doesn't directly contribute the value to the SUM()
function as column SAL
does, So, the relation type is fdr
:
scott.emp.deptno -> fdr -> SUM(SAL) -> fdd -> sal_sum
the columns in the having clause have the same relation as the columns in the group by clause as mentioned above.
The value of SUM()
function also effected by the total rows of the table scott.emp
, so, there is a relation like this:
scott.emp.pseudoRows -> fdr -> SUM(SAL) -> fdd -> sal_sum
The above rules apply to all aggregation functions, such as the count()
function in the SQL.
If the resultset of a subquery or CTE is used in the from clause of the upper-level statement, then the impact of the lower level resultset will be transferred to the upper-level.
WITH
cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
AS
(
SELECT EmployeeID, FirstName, LastName, ManagerID, 1 -- resultset1
FROM Employees
WHERE ManagerID IS NULL
)
SELECT
FirstName + ' ' + LastName AS FullName, EmpLevel -- resultset2
FROM cteReports
In the CTE, there is an impact relation:
Employees.ManagerID -> fdr -> resultset1.pseudoRows
Since cteReports
is used in the from clause of the upper-level statement, then the impact will carry on like this:
Employees.ManagerID -> fdr -> resultset1.pseudoRows -> fdd -> resultset2.pseudoRows
If we choose to ignore the intermediate resultset, the end to end dataflow is :
Employees.ManagerID -> fdr -> resultset2.pseudoRows
Every relation in the SQL is picked up by the tool, and connected together to show the whole dataflow chain. Sometimes, we only need to see the end to end relation and ignore all the intermediate relations.
If we need to convert a fully chained dataflow to an end to end
dataflow, we may consider the following rules:
-
A single dataflow chain with the mixed relation types: fdd and fdr.
A -> fdd -> B -> fdr -> C -> fdd -> D
the rule is: if any
fdr
relation appears in the chain, the relation fromA -> D
will be consider as type offdr
, otherwise, the final relation isfdd
for the end to end relation ofA -> D
. -
If there are multiple chains from
A -> D
A -> fdd -> B1 -> fdr -> C1 -> fdd -> D A -> fdd -> B2 -> fdr -> C1 -> fdd -> D A -> fdd -> B3 -> fdd -> C3 -> fdd -> D
The final relation should choose the
fdd
chain if any.
select
case when a.kamut=1 and b.teur IS null
then 'no locks'
when a.kamut=1
then b.teur
else 'locks'
end teur
from tbl a left join TT b on (a.key=b.key)
During the analyzing of dataflow, case expression is treated as a function. The column used inside the case expression will be treated like the arguments of a function. So for the above SQL, the following relation is discovered:
tbl.kamut -> fdd -> teur
TT.teur -> fdd -> teur
Columns in the join condition also effect the number of row in the resultset of the select list just like column in the where clause do. So, the following relation will be discoverd in the above SQL.
tbl.key -> fdr -> resultset.PseudoRows
TT.key -> fdr -> resultset.PseudoRows
create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
From this query, you will see how the column sal
in where clause impact the number of rows in the top level view vEmp
.
scott.emp.sal -> fdr -> resultset1.PseudoRows -> fdr -> vEmp.PseudoRows
So, from an end to end point of view, there will be a fdr
relation between column sal
and view vEmp
like this:
scott.emp.sal -> fdr -> vEmp.PseudoRows
alter table t2 rename to t3;
We also use PseudoRows
to represent the relation when rename a table, the relation type is fdd
.
t2.PseudoRows -> fdd -> t3.PseudoRows
create or replace stage exttable_part_stage
url='s3://load/encrypted_files/'
credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
encryption=(type='AWS_SSE_KMS' kms_key_id = 'aws/key');
create external table exttable_part(
date_part date as to_date(split_part(metadata$filename, '/', 3)
|| '/' || split_part(metadata$filename, '/', 4)
|| '/' || split_part(metadata$filename, '/', 5), 'YYYY/MM/DD'),
timestamp bigint as (value:timestamp::bigint),
col2 varchar as (value:col2::varchar))
partition by (date_part)
location=@exttable_part_stage/logs/
auto_refresh = true
file_format = (type = parquet);
The data of the external table exttable_part
comes from the stage: exttable_part_stage
exttable_part_stage (url='s3://load/encrypted_files/') -> fdd -> exttable_part(date_part,timestamp,col2)
CREATE EXTERNAL TABLE dataset.CsvTable OPTIONS (
format = 'CSV',
uris = ['gs://bucket/path1.csv', 'gs://bucket/path2.csv']
);
The data of the external table dataset.CsvTable
comes from the csv file: gs://bucket/path1.csv, gs://bucket/path2.csv
file (uri='gs://bucket/path1.csv') -> fdd -> dataset.CsvTable
file (uri='gs://bucket/path2.csv') -> fdd -> dataset.CsvTable
CREATE TABLE masteTable
(
masterColumn varchar(3) Primary Key,
);
CREATE TABLE foreignTable
(
foreignColumn1 varchar(3) NOT NULL ,
foreignColumn2 varchar(3) NOT NULL
FOREIGN KEY (foreignColumn1) REFERENCES masteTable(masterColumn),
FOREIGN KEY (foreignColumn2) REFERENCES masteTable(masterColumn)
)
The data flow is:
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn1
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn2
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
The data flow is:
file (/tmp/pv_2008-06-08_us.txt) -> fdd -> page_view(date,country)
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;
The data flow is:
pv_gender_sum(*) -> fdd -> file ('/tmp/pv_gender_sum')
The meaning of the letter in fdd, fdr. f: dataflow, d: data value, r: record set.
The first letter is always f,the second letter represents the source column,the third letter represents the target column, the fourth is reserved.
- fdd: data of the source column will used in the target column
- fdr: data of the source column will impact the number of the resultset in the select list, or will impact the result value of an anggreate function.