ChEMBL relational database ER diagram #1306

bioada · 2023-01-02T17:10:39Z

Hello,

Unfortunately, the ER diagram does not follow the Normal Form rules and has many errors. It means the resultsets from the JOIN queries are not reliable.

Best,
Saed

bioada · 2023-01-11T15:08:41Z

Before I explain any specific normalization issue with the ChEMBL ER diagram, I define a few terms for clarification.

Normalization:
A process of organizing the data in database to avoid data redundancy and insertion, update and deletion anomaly.

First Normal Form (1NF):
There should be only one value in an intersection between a row (entity/record/tuple) and a column (attribute/field)

2NF:
1NF and no partial dependencies. In a table with a combined primary keys, the non-prime attributes should have dependency to all keys in the primary keys.

3NF:
2NF and no transitive dependencies. There should not be a dependency between non-prime attributes

4NF:
3NF and no multivalued dependencies. There should not be more than one many-to-many relationship in a table.

5NF:
4NF and by joining the normalized tables we should not lose or gain rows

bioada · 2023-01-11T15:30:03Z

Uniqueness of Rows:
Before even starting the process of normalization, we need to make sure every row in a table is unique. This uniqueness MUST be based on a set attributes (columns/fields) and NOT based on a proxy attribute such as unique integer or string keys. Unfortunately, this is a mistake in many relational databases. Let me explain it with an example.

We have a table called Winner with two columns (FirstName and LastName) and the following records:
Michael, Jordan
Mohammad, Ali
Michael, Jordan
Tim, Clark
Milos, Vesna
Michael, Jordan

Now, to make it unique for add another column called WinnerID as integer and we make it the primary key.

1, Michael, Jordan
2, Mohammad, Ali
3, Michael, Jordan
4, Tim, Clark
5, Milos, Vesna
6, Michael, Jordan

Now, if the winner number is 6 and we call the winner "Michael Jordan", the question is which "Michael Jordan"?
If you say we send the WinnerID to every individual, it means you have a hidden knowledge about every person uniqueness but your table does not know.

WinnerID is a proxy unique key and is nothing to do with a real person and should not be created before defining a set of non-proxy attributes as a unique key (candidate key).

bioada · 2023-01-11T15:41:32Z

Problem:No unique key(s)

Table: target_dictionary

Proxy Columns:
tid and chembl_id are proxy unique keys and cannot be used for confirming uniqueness of rows in the table.

A combined key of the following columns still is not unique:
target_type
pref_name
tax_id
organism
species_group_flag

Conclusion:
target_dictionary does not have unique rows and cannot be included in an ER diagram (a relational database)

Unfortunately, there are many more tables in the ChEMBL database that have the same above issue.

When we proved that all records in target_dictionary table are unique, then we are allowed to use "tid" or "chembl_id" as a proxy primary key. anywhere in the ER diagram.

bioada · 2023-01-13T17:10:16Z

Why there is a table in a relational database only with one column?
table: actiivity_smid
column: smid

BZdrazil assigned eloyfelix and BZdrazil Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChEMBL relational database ER diagram #1306

ChEMBL relational database ER diagram #1306

bioada commented Jan 2, 2023

bioada commented Jan 11, 2023

bioada commented Jan 11, 2023

bioada commented Jan 11, 2023 •

edited

Loading

bioada commented Jan 13, 2023

ChEMBL relational database ER diagram #1306

ChEMBL relational database ER diagram #1306

Comments

bioada commented Jan 2, 2023

bioada commented Jan 11, 2023

bioada commented Jan 11, 2023

bioada commented Jan 11, 2023 • edited Loading

bioada commented Jan 13, 2023

bioada commented Jan 11, 2023 •

edited

Loading