Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChEMBL relational database ER diagram #1306

Open
bioada opened this issue Jan 2, 2023 · 4 comments
Open

ChEMBL relational database ER diagram #1306

bioada opened this issue Jan 2, 2023 · 4 comments
Assignees

Comments

@bioada
Copy link

bioada commented Jan 2, 2023

Hello,

Unfortunately, the ER diagram does not follow the Normal Form rules and has many errors. It means the resultsets from the JOIN queries are not reliable.

Best,
Saed

@bioada
Copy link
Author

bioada commented Jan 11, 2023

Before I explain any specific normalization issue with the ChEMBL ER diagram, I define a few terms for clarification.

Normalization:
A process of organizing the data in database to avoid data redundancy and insertion, update and deletion anomaly.

First Normal Form (1NF):
There should be only one value in an intersection between a row (entity/record/tuple) and a column (attribute/field)

2NF:
1NF and no partial dependencies. In a table with a combined primary keys, the non-prime attributes should have dependency to all keys in the primary keys.

3NF:
2NF and no transitive dependencies. There should not be a dependency between non-prime attributes

4NF:
3NF and no multivalued dependencies. There should not be more than one many-to-many relationship in a table.

5NF:
4NF and by joining the normalized tables we should not lose or gain rows

@bioada
Copy link
Author

bioada commented Jan 11, 2023

Uniqueness of Rows:
Before even starting the process of normalization, we need to make sure every row in a table is unique. This uniqueness MUST be based on a set attributes (columns/fields) and NOT based on a proxy attribute such as unique integer or string keys. Unfortunately, this is a mistake in many relational databases. Let me explain it with an example.

We have a table called Winner with two columns (FirstName and LastName) and the following records:
Michael, Jordan
Mohammad, Ali
Michael, Jordan
Tim, Clark
Milos, Vesna
Michael, Jordan

Now, to make it unique for add another column called WinnerID as integer and we make it the primary key.

1, Michael, Jordan
2, Mohammad, Ali
3, Michael, Jordan
4, Tim, Clark
5, Milos, Vesna
6, Michael, Jordan

Now, if the winner number is 6 and we call the winner "Michael Jordan", the question is which "Michael Jordan"?
If you say we send the WinnerID to every individual, it means you have a hidden knowledge about every person uniqueness but your table does not know.

WinnerID is a proxy unique key and is nothing to do with a real person and should not be created before defining a set of non-proxy attributes as a unique key (candidate key).

@bioada
Copy link
Author

bioada commented Jan 11, 2023

Problem:No unique key(s)

Table: target_dictionary

Proxy Columns:
tid and chembl_id are proxy unique keys and cannot be used for confirming uniqueness of rows in the table.

A combined key of the following columns still is not unique:
target_type
pref_name
tax_id
organism
species_group_flag

Conclusion:
target_dictionary does not have unique rows and cannot be included in an ER diagram (a relational database)

Unfortunately, there are many more tables in the ChEMBL database that have the same above issue.

When we proved that all records in target_dictionary table are unique, then we are allowed to use "tid" or "chembl_id" as a proxy primary key. anywhere in the ER diagram.

@bioada
Copy link
Author

bioada commented Jan 13, 2023

Why there is a table in a relational database only with one column?
table: actiivity_smid
column: smid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants