-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KN Picard Metrics Aggregation Script #364
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,276 @@ | ||||||||||||||||
def read_in(file_name, label): | ||||||||||||||||
started = False | ||||||||||||||||
newlist = [] | ||||||||||||||||
wgsfile = file_name | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This variable is not used. |
||||||||||||||||
|
||||||||||||||||
with tfio.gfile.GFile(file_name, "r") as inp: | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for using TensorFlow's GFile here? Can this be replace with Python's built-in |
||||||||||||||||
for line in inp: | ||||||||||||||||
intable = line.startswith(label) | ||||||||||||||||
if intable == True: | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can replace these two lines with
Suggested change
|
||||||||||||||||
started = True | ||||||||||||||||
continue | ||||||||||||||||
if started: | ||||||||||||||||
if line != '\n': | ||||||||||||||||
newlist.append(line.rstrip('\n')) | ||||||||||||||||
if started and line == '\n': | ||||||||||||||||
return newlist | ||||||||||||||||
break | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This break statement is not reachable. |
||||||||||||||||
|
||||||||||||||||
def newtable(file_name, label): | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The most common convention for naming methods is to name them in a lower case with underscores separating words so this method would become In general we use PEP 8 style guide for formatting our projects (https://peps.python.org/pep-0008/), however in general you want to stick the conventions of the project you are working on, as long as you keep the formatting consistent. |
||||||||||||||||
columnnames = read_in(file_name, label)[0].split("\t") | ||||||||||||||||
values = read_in(file_name, label)[1].split("\t") | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are traversing the file twice here, instead of just doing it once and storing the result in a variable. This is probably not a big deal her, since I am understanding this code is just reading Picard metric files that are small. |
||||||||||||||||
dataframe = pd.DataFrame(columnnames).T | ||||||||||||||||
dataframe.columns = columnnames | ||||||||||||||||
dataframe_length = len(dataframe) | ||||||||||||||||
dataframe.loc[dataframe_length] = values | ||||||||||||||||
dataframe = dataframe.drop(0) | ||||||||||||||||
return dataframe | ||||||||||||||||
Comment on lines
+21
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something but I think this code is equivalent to
Suggested change
|
||||||||||||||||
|
||||||||||||||||
def addrows(file_name, label): | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here with the function naming convention and also below. |
||||||||||||||||
onerow = newtable(file_name, label) | ||||||||||||||||
for i in range(len(read_in(file_name, label))): | ||||||||||||||||
nextrow = read_in(file_name, label)[i].split("\t") | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As above, here you are traversing the same file in each iteration - in general I/O is a bottleneck in many cases and we only want to read the files into memory once. You can rewrite this function to read the file content into an array once and then fill in all the rows of the dataframe in one go. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. Reading files from disk is orders of magnitude slower than in-memory operations. I'd put my money on this being where you're seeing the biggest slowdowns. |
||||||||||||||||
onerow.loc[i] = nextrow | ||||||||||||||||
onerow = onerow.drop(0) | ||||||||||||||||
return onerow | ||||||||||||||||
|
||||||||||||||||
def concatrows(table, suffix): | ||||||||||||||||
newdata = pd.DataFrame() | ||||||||||||||||
for row in range(1, len(table) + 1): | ||||||||||||||||
addrow = pd.DataFrame(table.loc[row]).T.reset_index(drop=True) | ||||||||||||||||
addrow = addrow.add_suffix(suffix[row-1]) | ||||||||||||||||
newdata = pd.concat([newdata, addrow], axis = 1) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dataframe concatenations as well as creation are expensive operations and should be avoided if possible. You should be able to rewrite this function to avoid these operations. Feel free to reach out if you need any help in this. |
||||||||||||||||
return newdata | ||||||||||||||||
|
||||||||||||||||
def window_metrics_mean(table, window): | ||||||||||||||||
dataframe1 = pd.DataFrame(table[0:window]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
dataframe2 = pd.DataFrame(table[window:window*2]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
dataframe3 = pd.DataFrame(table[window*2:window*3]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
dataframe4 = pd.DataFrame(table[window*3:window*4]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
dataframe5 = pd.DataFrame(table[window*4:window*5]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
dataframe6 = pd.DataFrame(table[window*5:window*6]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
Comment on lines
+48
to
+53
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a lot of duplicated code here (and in other functions below), which is a sign that the code can benefit from refactoring (https://refactoring.guru/smells/duplicate-code). One simple way to refactor it is to add an inner function inside the |
||||||||||||||||
|
||||||||||||||||
bddf1mean = dataframe1.mean() | ||||||||||||||||
bddf2mean = dataframe2.mean() | ||||||||||||||||
bddf3mean = dataframe3.mean() | ||||||||||||||||
bddf4mean = dataframe4.mean() | ||||||||||||||||
bddf5mean = dataframe5.mean() | ||||||||||||||||
bddf6mean = dataframe6.mean() | ||||||||||||||||
|
||||||||||||||||
bddf1meanvalues = pd.DataFrame(bddf1mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window)) | ||||||||||||||||
bddf2meanvalues = pd.DataFrame(bddf2mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*2)) | ||||||||||||||||
bddf3meanvalues = pd.DataFrame(bddf3mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*3)) | ||||||||||||||||
bddf4meanvalues = pd.DataFrame(bddf4mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*4)) | ||||||||||||||||
bddf5meanvalues = pd.DataFrame(bddf5mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*5)) | ||||||||||||||||
bddf6meanvalues = pd.DataFrame(bddf6mean).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*6)) | ||||||||||||||||
|
||||||||||||||||
BDMresult = pd.concat([bddf1meanvalues, bddf2meanvalues, bddf3meanvalues, bddf4meanvalues, bddf5meanvalues, bddf6meanvalues], axis = 1) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Variable names (unless they are constants) should be lower case, separated by underscores. |
||||||||||||||||
return BDMresult | ||||||||||||||||
|
||||||||||||||||
def window_metrics_std(table, window): | ||||||||||||||||
mqdf1 = pd.DataFrame(table[0:window]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicate code here and below as well. |
||||||||||||||||
mqdf2 = pd.DataFrame(table[window:window*2]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
mqdf3 = pd.DataFrame(table[window*2:window*3]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
mqdf4 = pd.DataFrame(table[window*3:window*4]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
mqdf5 = pd.DataFrame(table[window*4:window*5]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
mqdf6 = pd.DataFrame(table[window*5:window*6]).reset_index(drop = True).apply(pd.to_numeric, errors='ignore') | ||||||||||||||||
|
||||||||||||||||
mqdf1std = mqdf1.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf1stdtable = pd.DataFrame(mqdf1std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window)) | ||||||||||||||||
mqdf2std = mqdf2.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf2stdtable = pd.DataFrame(mqdf2std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*2)) | ||||||||||||||||
mqdf3std = mqdf3.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf3stdtable = pd.DataFrame(mqdf3std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*3)) | ||||||||||||||||
mqdf4std = mqdf4.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf4stdtable = pd.DataFrame(mqdf4std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*4)) | ||||||||||||||||
mqdf5std = mqdf5.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf5stdtable = pd.DataFrame(mqdf5std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*5)) | ||||||||||||||||
mqdf6std = mqdf6.std().rename({'MEAN_QUALITY':'SD'}) | ||||||||||||||||
mqdf6stdtable = pd.DataFrame(mqdf6std).T.drop('CYCLE', axis = 1).add_suffix('_'+str(window*6)) | ||||||||||||||||
|
||||||||||||||||
MQMresult = pd.concat([mqdf1stdtable, mqdf2stdtable, mqdf3stdtable, mqdf4stdtable, mqdf5stdtable, mqdf6stdtable], axis = 1) | ||||||||||||||||
return MQMresult | ||||||||||||||||
|
||||||||||||||||
def sequencing_artifact_metrics(table): | ||||||||||||||||
sadfone = pd.DataFrame(table.loc[1]).T.reset_index(drop = True) | ||||||||||||||||
sadf1 = sadfone.add_suffix("_1").drop(['SAMPLE_ALIAS_1', 'LIBRARY_1', 'WORST_CXT_1', | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can also be refactored to get rid of the duplicate code, but also to store the strings |
||||||||||||||||
'WORST_CXT_QSCORE_1', 'WORST_PRE_CXT_1', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_1', 'WORST_POST_CXT_1', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_1'], axis = 1) | ||||||||||||||||
sadftwo = pd.DataFrame(table.loc[2]).T.reset_index(drop = True) | ||||||||||||||||
sadf2 = sadftwo.add_suffix("_2").drop(['SAMPLE_ALIAS_2', 'LIBRARY_2', 'WORST_CXT_2', | ||||||||||||||||
'WORST_CXT_QSCORE_2', 'WORST_PRE_CXT_2', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_2', 'WORST_POST_CXT_2', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_2'], axis = 1) | ||||||||||||||||
sadfthree = pd.DataFrame(table.loc[3]).T.reset_index(drop = True) | ||||||||||||||||
sadf3 = sadfthree.add_suffix("_3").drop(['SAMPLE_ALIAS_3', 'LIBRARY_3','WORST_CXT_3', | ||||||||||||||||
'WORST_CXT_QSCORE_3','WORST_PRE_CXT_3', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_3', 'WORST_POST_CXT_3', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_3'], axis = 1) | ||||||||||||||||
sadffour = pd.DataFrame(table.loc[4]).T.reset_index(drop = True) | ||||||||||||||||
sadf4 = sadffour.add_suffix("_4").drop(['SAMPLE_ALIAS_4', 'LIBRARY_4', 'WORST_CXT_4', | ||||||||||||||||
'WORST_CXT_QSCORE_4', 'WORST_PRE_CXT_4', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_4', 'WORST_POST_CXT_4', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_4'], axis = 1) | ||||||||||||||||
sadffive = pd.DataFrame(table.loc[5]).T.reset_index(drop = True) | ||||||||||||||||
sadf5 = sadffive.add_suffix("_5").drop(['SAMPLE_ALIAS_5', 'LIBRARY_5', | ||||||||||||||||
'WORST_CXT_5', 'WORST_CXT_QSCORE_5', 'WORST_PRE_CXT_5', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_5', 'WORST_POST_CXT_5', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_5'], axis = 1) | ||||||||||||||||
sadfsix = pd.DataFrame(table.loc[6]).T.reset_index(drop = True) | ||||||||||||||||
sadf6 = sadfsix.add_suffix("_6").drop(['SAMPLE_ALIAS_6', 'LIBRARY_6', | ||||||||||||||||
'WORST_CXT_6', 'WORST_CXT_QSCORE_6', | ||||||||||||||||
'WORST_PRE_CXT_6', 'WORST_PRE_CXT_QSCORE_6', 'WORST_POST_CXT_6', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_6'], axis = 1) | ||||||||||||||||
sadfseven = pd.DataFrame(table.loc[7]).T.reset_index(drop = True) | ||||||||||||||||
sadf7 = sadfseven.add_suffix("_7").drop(['SAMPLE_ALIAS_7', 'LIBRARY_7', | ||||||||||||||||
'WORST_CXT_7', 'WORST_CXT_QSCORE_7', | ||||||||||||||||
'WORST_PRE_CXT_7', 'WORST_PRE_CXT_QSCORE_7', 'WORST_POST_CXT_7', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_7'], axis = 1) | ||||||||||||||||
sadfeight = pd.DataFrame(table.loc[8]).T.reset_index(drop = True) | ||||||||||||||||
sadf8 = sadfeight.add_suffix("_8").drop(['SAMPLE_ALIAS_8', 'LIBRARY_8', 'WORST_CXT_8', | ||||||||||||||||
'WORST_CXT_QSCORE_8', 'WORST_PRE_CXT_8', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_8', 'WORST_POST_CXT_8', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_8'], axis = 1) | ||||||||||||||||
sadfnine = pd.DataFrame(table.loc[9]).T.reset_index(drop = True) | ||||||||||||||||
sadf9 = sadfnine.add_suffix("_9").drop(['SAMPLE_ALIAS_9', 'LIBRARY_9', | ||||||||||||||||
'WORST_CXT_9', 'WORST_CXT_QSCORE_9', | ||||||||||||||||
'WORST_PRE_CXT_9', 'WORST_PRE_CXT_QSCORE_9', 'WORST_POST_CXT_9', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_9'], axis = 1) | ||||||||||||||||
sadften = pd.DataFrame(table.loc[10]).T.reset_index(drop = True) | ||||||||||||||||
sadf10 = sadften.add_suffix("_10").drop(['SAMPLE_ALIAS_10', 'LIBRARY_10', | ||||||||||||||||
'WORST_CXT_10', 'WORST_CXT_QSCORE_10', | ||||||||||||||||
'WORST_PRE_CXT_10', 'WORST_PRE_CXT_QSCORE_10', 'WORST_POST_CXT_10', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_10'], axis = 1) | ||||||||||||||||
sadfeleven = pd.DataFrame(table.loc[11]).T.reset_index(drop = True) | ||||||||||||||||
sadf11 = sadfeleven.add_suffix("_11").drop(['SAMPLE_ALIAS_11', 'LIBRARY_11', | ||||||||||||||||
'WORST_CXT_11', 'WORST_CXT_QSCORE_11', | ||||||||||||||||
'WORST_PRE_CXT_11', 'WORST_PRE_CXT_QSCORE_11', 'WORST_POST_CXT_11', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_11'], axis = 1) | ||||||||||||||||
sadftwelve = pd.DataFrame(table.loc[12]).T.reset_index(drop = True) | ||||||||||||||||
sadf12 = sadftwelve.add_suffix("_12").drop(['SAMPLE_ALIAS_12', 'LIBRARY_12', | ||||||||||||||||
'WORST_CXT_12', 'WORST_CXT_QSCORE_12', 'WORST_PRE_CXT_12', | ||||||||||||||||
'WORST_PRE_CXT_QSCORE_12', 'WORST_POST_CXT_12', | ||||||||||||||||
'WORST_POST_CXT_QSCORE_12'], axis = 1) | ||||||||||||||||
SAMresult = pd.concat([sadf1,sadf2, sadf3, sadf4, sadf5, sadf6, sadf7, sadf8, sadf9, sadf10, sadf11, sadf12 ], axis = 1) | ||||||||||||||||
return SAMresult | ||||||||||||||||
|
||||||||||||||||
def windows(table, window): | ||||||||||||||||
if len(table.columns) == 2: | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is fine to use in the script, but in general having the output of the function be conditioned on the hardcoded variable like this is a sign that the code should be refactored. One way to refactor it is to rewrite this code in an object-oriented manner. Again, I don't think it is necessary here but for code that is more than few hundred lines long it's a good idea, and it will provide more readability and robustness. |
||||||||||||||||
return window_metrics_mean(table, window) | ||||||||||||||||
|
||||||||||||||||
if len(table.columns) == 7: | ||||||||||||||||
return window_metrics_std(table, window) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def mainfile(index): | ||||||||||||||||
allsamplecolumns = ['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics', | ||||||||||||||||
'insert_size_metrics', 'mean_quality_by_cycle_table', | ||||||||||||||||
'sequencing_artifact_summary_metrics', 'quality_score_table', 'quality_yield_metrics', | ||||||||||||||||
'raw_wgs_metrics', 'wgs_metrics'] | ||||||||||||||||
table_name = "sample" | ||||||||||||||||
samples = pd.read_csv(io.StringIO(fiss.fapi.get_entities_tsv(project, workspace, 'sample').text), sep='\t') | ||||||||||||||||
samples.rename(columns = {'entity:sample_id':'sample'}, inplace = True) | ||||||||||||||||
# specificcolumns = samples[['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics', | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove old code here. |
||||||||||||||||
# 'insert_size_metrics', 'mean_quality_by_cycle_table', | ||||||||||||||||
# 'sequencing_artifact_summary_metrics', 'quality_score_table', 'quality_yield_metrics', | ||||||||||||||||
# 'raw_wgs_metrics', 'wgs_metrics']] | ||||||||||||||||
# dropemptyrows = specificcolumns.dropna(axis = 0) | ||||||||||||||||
# dropemptycolumns = dropemptyrows.dropna(axis = 1) | ||||||||||||||||
# files = ! ls | ||||||||||||||||
|
||||||||||||||||
Dict = {} | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make the variable name more descriptive and lower case. |
||||||||||||||||
for i in index: | ||||||||||||||||
result = [] | ||||||||||||||||
for column in allsamplecolumns: | ||||||||||||||||
metric = samples.at[i, column] | ||||||||||||||||
if type(metric) == float and math.isnan(metric): | ||||||||||||||||
continue | ||||||||||||||||
else: | ||||||||||||||||
result.append(metric) | ||||||||||||||||
if result != []: | ||||||||||||||||
key1 = result[0].split('.')[0].split('/')[-1] | ||||||||||||||||
Dict[key1] = result | ||||||||||||||||
return Dict | ||||||||||||||||
|
||||||||||||||||
# for column in allsamplecolumns: | ||||||||||||||||
# for i in index: | ||||||||||||||||
# metric = dropemptycolumns[column][i] | ||||||||||||||||
# if metric == 'NaN': | ||||||||||||||||
# continue | ||||||||||||||||
# else: | ||||||||||||||||
# ! gsutil cp $metric . | ||||||||||||||||
# result.append(metric.split("/")[-1]) | ||||||||||||||||
# return result | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
sampledict = mainfile(range(50)) | ||||||||||||||||
allsamplecolumns = ['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics', | ||||||||||||||||
'insert_size_metrics', 'mean_quality_by_cycle_table', | ||||||||||||||||
'sequencing_artifact_summary_metrics', 'quality_score_table', 'quality_yield_metrics', | ||||||||||||||||
'raw_wgs_metrics', 'wgs_metrics'] | ||||||||||||||||
def concattables(): | ||||||||||||||||
result = pd.DataFrame() | ||||||||||||||||
for k in sampledict.keys(): | ||||||||||||||||
listoftables = [] | ||||||||||||||||
for v in sampledict.get(k): | ||||||||||||||||
if ('alignment_summary_metrics' in v): | ||||||||||||||||
alignment1 = addrows(v, '## METRICS CLASS') | ||||||||||||||||
suffixes = alignment1['CATEGORY'].tolist() | ||||||||||||||||
alignment = concatrows(alignment1, suffixes) | ||||||||||||||||
listoftables.append(alignment) | ||||||||||||||||
elif ('base_distribution_by_cycle_table' in v): | ||||||||||||||||
base = windows(addrows(v, '## METRICS CLASS'), 50) | ||||||||||||||||
listoftables.append(base) | ||||||||||||||||
elif('gc_bias_summary_metrics' in v): | ||||||||||||||||
gc = addrows(v, '## METRICS CLASS') | ||||||||||||||||
gcdrop = gc.reset_index(drop = True) | ||||||||||||||||
listoftables.append(gcdrop) | ||||||||||||||||
elif ('insert_size_metrics' in v): | ||||||||||||||||
insert1 = addrows(v, '## METRICS CLASS') | ||||||||||||||||
suffixes = insert1['PAIR_ORIENTATION'].tolist() | ||||||||||||||||
insert = concatrows(insert1, suffixes) | ||||||||||||||||
listoftables.append(insert) | ||||||||||||||||
elif('mean_quality_by_cycle_table' in v): | ||||||||||||||||
meanquality = windows(addrows(v, '## HISTOGRAM'), 50) | ||||||||||||||||
listoftables.append(meanquality) | ||||||||||||||||
elif('pre_adapter_summary_metrics' in v): | ||||||||||||||||
sequencing = sequencing_artifact_metrics(addrows(v, '## METRICS CLASS')) | ||||||||||||||||
listoftables.append(sequencing) | ||||||||||||||||
elif('quality_score_distribution' in v): | ||||||||||||||||
qs1 = addrows(v, '## HISTOGRAM') | ||||||||||||||||
suffixes = qs1['QUALITY'].tolist() | ||||||||||||||||
qs = concatrows(qs1, suffixes) | ||||||||||||||||
listoftables.append(qs) | ||||||||||||||||
elif('quality_yield_metrics' in v): | ||||||||||||||||
qym = addrows(v, '## METRICS CLASS') | ||||||||||||||||
qymdrop = qym.reset_index(drop = True) | ||||||||||||||||
listoftables.append(qymdrop) | ||||||||||||||||
elif('raw_wgs_metrics' in v): | ||||||||||||||||
raw = addrows(v, '## METRICS CLASS') | ||||||||||||||||
rawdrop = raw.reset_index(drop = True).add_suffix('raw') | ||||||||||||||||
listoftables.append(rawdrop) | ||||||||||||||||
elif('wgs_metrics' in v): | ||||||||||||||||
wgs = addrows(v, '## METRICS CLASS') | ||||||||||||||||
wgsdrop = wgs.reset_index(drop = True) | ||||||||||||||||
listoftables.append(wgsdrop) | ||||||||||||||||
newrow = pd.concat(listoftables, axis = 1) | ||||||||||||||||
test =[] | ||||||||||||||||
# for i in range(len(newrow.columns)): | ||||||||||||||||
# for j in range(i + 1, len(newrow.columns)): | ||||||||||||||||
# if newrow.columns[i] == newrow.columns[j]: | ||||||||||||||||
# test.append(newrow.columns[i]) | ||||||||||||||||
# print(len(test)) | ||||||||||||||||
# if len(set(newrow.columns)) != len(newrow.columns): | ||||||||||||||||
# print('duplicate') | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
# duplicates = find_duplicates(newrow) | ||||||||||||||||
# print(duplicates) | ||||||||||||||||
result = result.loc[:, ~result.columns.duplicated()].copy() | ||||||||||||||||
newrow = newrow.loc[:, ~newrow.columns.duplicated()].copy() | ||||||||||||||||
result = pd.concat([result, newrow], ignore_index = True) | ||||||||||||||||
return result | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Extra spaces here. |
||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can be more descriptive with the function name here. You could also consider adding a docstring with a description of the function and potentially inputs/outputs, where you can describe the format of the
file_name
file, and what does thelabel
represent.Versions of python > 3.5 also support typing (https://docs.python.org/3/library/typing.html) - you can specify types of the inputs, and function output. This really helps with readability and especially if you are reading someone else's code.
How detailed you want to be depends on whether this script will be used or extended again by you or other people in the future.