Skip to content

Commit

Permalink
fix issue #19, update promoters metadata (#78)
Browse files Browse the repository at this point in the history
* removes promoters, fixing issue #19

* updates to ml.bio promoters metadata

* remove the column instance

* remove promoters from dataset list

* remove instance from feature list

Co-authored-by: Trang Le <[email protected]>
  • Loading branch information
lacava and trangdata authored Aug 18, 2020
1 parent 94187a0 commit 223569e
Show file tree
Hide file tree
Showing 7 changed files with 87 additions and 149 deletions.
97 changes: 87 additions & 10 deletions datasets/molecular_biology_promoters/metadata.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,96 @@
#Generated automatically by pmlb/write_metadata.py
dataset: molecular_biology_promoters
description: None yet. See our contributing guide to help us add one.
source: None yet. See our contributing guide to help us add one.
publication: None yet. See our contributing guide to help us add one.
description: >
Title of Database: E. coli promoter gene sequences (DNA)
with associated imperfect domain theory
Past Usage:
(a) biological:
-- Harley, C. and Reynolds, R. 1987.
"Analysis of E. Coli Promoter Sequences."
Nucleic Acids Research, 15:2343-2361.
machine learning:
-- Towell, G., Shavlik, J. and Noordewier, M. 1990.
"Refinement of Approximate Domain Theories by Knowledge-Based
Artificial Neural Networks." In Proceedings of the Eighth National
Conference on Artificial Intelligence (AAAI-90).
(b) attributes predicted: member/non-member of class of sequences with
biological promoter activity (promoters initiate the process of gene
expression).
(c) Results of study indicated that machine learning techniques (neural
networks, nearest neighbor, contributors' KBANN system) performed as
well/better than classification based on canonical pattern matching
(method used in biological literature).
Relevant Information Paragraph:
This dataset has been developed to help evaluate a "hybrid" learning
algorithm ("KBANN") that uses examples to inductively refine preexisting
knowledge. Using a "leave-one-out" methodology, the following errors
were produced by various ML algorithms. (See Towell, Shavlik, &
Noordewier, 1990, for details.)
System Errors Comments
------ ------ --------
KBANN 4/106 a hybrid ML system
BP 8/106 std backprop with one hidden layer
O'Neill 12/106 ad hoc technique from the bio. lit.
Near-Neigh 13/106 a nearest-neighbor algo (k=3)
ID3 19/106 Quinlan's decision-tree builder
Type of domain: non-numeric, nominal (one of A, G, T, C)
-- Note: DNA nucleotides can be grouped into a hierarchy, as shown below:
X (any)
/ \
(purine) R Y (pyrimidine)
/ \ / \
A G T C
Number of Instances: 106
Number of Attributes: 59
-- class (positive or negative)
-- instance name
-- 57 sequential nucleotide ("base-pair") positions
Attribute information:
-- Statistics for numeric domains: No numeric features used.
-- Statistics for non-numeric domains
-- Frequencies: Promoters Non-Promoters
--------- -------------
A 27.7% 24.4%
G 20.0% 25.4%
T 30.2% 26.5%
C 22.1% 23.7%
Attribute #: Description:
============ ============
1 One of {+/-}, indicating the class ("+" = promoter).
2 The instance name (non-promoters named by position in the
1500-long nucleotide sequence provided by T. Record).
3-59 The remaining 57 fields are the sequence, starting at
position -50 (p-50) and ending at position +7 (p7). Each of
these fields is filled by one of {a, g, t, c}.
Code: a=0; c=1; g=2; t=3.
source: >
Creators:
1. promoter instances: C. Harley (CHARLEY '@' McMaster.CA) and R. Reynolds
2. non-promoter instances and domain theory: M. Noordewier
-- (non-promoters derived from work of lab of Prof. Tom Record, University of Wisconsin Biochemistry Department)
publication: >
Harley, C. and Reynolds, R. 1987. "Analysis of E. Coli Promoter Sequences." Nucleic Acids Research, 15:2343-2361.
Towell, G., Shavlik, J. and Noordewier, M. 1990. "Refinement of Approximate Domain Theories by Knowledge-Based Artificial Neural Networks." In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-90).
task: classification
target:
type: binary
description: None yet. See our contributing guide to help us add one.
code: None yet. See our contributing guide to help us add one.
description: Positive class indicates a promoter.
code: -:1, +:0 (promoter)
features: # list of features in the dataset
- name: instance
type: categorical
description: null # optional but recommended, what the feature measures/indicates, unit
code: null # optional, coding information, e.g., Control = 0, Case = 1
transform: ~ # optional, any transformation performed on the feature, e.g., log scaled
- name: p-50
type: categorical
- name: p-49
Expand Down
Binary file not shown.
6 changes: 0 additions & 6 deletions datasets/promoters/README.md

This file was deleted.

130 changes: 0 additions & 130 deletions datasets/promoters/metadata.yaml

This file was deleted.

Binary file removed datasets/promoters/promoters.tsv.gz
Binary file not shown.
2 changes: 0 additions & 2 deletions datasets/promoters/summary_stats.csv

This file was deleted.

1 change: 0 additions & 1 deletion pmlb/dataset_lists.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,6 @@
'prnn_fglass',
'prnn_synth',
'profb',
'promoters',
'ring',
'saheart',
'satimage',
Expand Down

0 comments on commit 223569e

Please sign in to comment.