Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourcery Starbot ⭐ refactored sidphbot/Auto-Research #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SourceryAI
Copy link

Thanks for starring sourcery-ai/sourcery ✨ 🌟 ✨

Here's your pull request refactoring your most popular Python repo.

If you want Sourcery to refactor all your Python repos and incoming pull requests install our bot.

Review changes via command line

To manually merge these changes, make sure you're on the main branch, then run:

git fetch https://github.com/sourcery-ai-bot/Auto-Research main
git merge --ff-only FETCH_HEAD
git reset HEAD^

Copy link
Author

@SourceryAI SourceryAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to GitHub API limits, only the first 60 comments can be shown.

Comment on lines -76 to +103
st.sidebar.image(Image.open('logo_landscape.png'), use_column_width = 'always')
st.title('Auto-Research')
st.write('#### A no-code utility to generate a detailed well-cited survey with topic clustered sections'
'(draft paper format) and other interesting artifacts from a single research query or a curated set of papers(arxiv ids).')
st.write('##### Data Provider: arXiv Open Archive Initiative OAI')
st.write('##### GitHub: https://github.com/sidphbot/Auto-Research')
download_placeholder = st.container()

with st.sidebar.form(key="survey_keywords_form"):
session_data = sp.pydantic_input(key="keywords_input_model", model=KeywordsModel)
st.write('or')
session_data.update(sp.pydantic_input(key="arxiv_ids_input_model", model=ArxivIDsModel))
submit = st.form_submit_button(label="Submit")
st.sidebar.write('#### execution log:')

run_kwargs = {'surveyor':get_surveyor_instance(_print_fn=st.sidebar.write, _survey_print_fn=st.write),
'download_placeholder':download_placeholder}
if submit:
if session_data['research_keywords'] != '':
run_kwargs.update({'research_keywords':session_data['research_keywords'],
'max_search':session_data['max_search'],
'num_papers':session_data['num_papers']})
elif session_data['arxiv_ids'] != '':
run_kwargs.update({'arxiv_ids':[id.strip() for id in session_data['arxiv_ids'].split(',')]})

run_survey(**run_kwargs)
st.sidebar.image(Image.open('logo_landscape.png'), use_column_width = 'always')
st.title('Auto-Research')
st.write('#### A no-code utility to generate a detailed well-cited survey with topic clustered sections'
'(draft paper format) and other interesting artifacts from a single research query or a curated set of papers(arxiv ids).')
st.write('##### Data Provider: arXiv Open Archive Initiative OAI')
st.write('##### GitHub: https://github.com/sidphbot/Auto-Research')
download_placeholder = st.container()

with st.sidebar.form(key="survey_keywords_form"):
session_data = sp.pydantic_input(key="keywords_input_model", model=KeywordsModel)
st.write('or')
session_data.update(sp.pydantic_input(key="arxiv_ids_input_model", model=ArxivIDsModel))
submit = st.form_submit_button(label="Submit")
st.sidebar.write('#### execution log:')

run_kwargs = {'surveyor':get_surveyor_instance(_print_fn=st.sidebar.write, _survey_print_fn=st.write),
'download_placeholder':download_placeholder}
if submit:
if session_data['research_keywords'] != '':
run_kwargs.update({'research_keywords':session_data['research_keywords'],
'max_search':session_data['max_search'],
'num_papers':session_data['num_papers']})
elif session_data['arxiv_ids'] != '':
run_kwargs['arxiv_ids'] = [
id.strip() for id in session_data['arxiv_ids'].split(',')
]

run_survey(**run_kwargs)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 76-101 refactored with the following changes:

s = '{} {}'.format(match.group(2), match.group(3))
s = f'{match.group(2)} {match.group(3)}'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _parse_author_affil_split refactored with the following changes:

Comment on lines -200 to +201
else:
parts.append(pt)
last = pt
parts.append(pt)
last = pt
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _remove_double_commas refactored with the following changes:

Comment on lines -213 to +217
def _collaboration_at_start(names: List[str]) \
-> Tuple[List[str], List[List[str]], int]:
def _collaboration_at_start(names: List[str]) -> Tuple[List[str], List[List[str]], int]:
"""Perform special handling of collaboration at start."""
author_list = []

back_propagate_affiliations_to = 0
while len(names) > 0:
while names:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _collaboration_at_start refactored with the following changes:

Comment on lines -237 to +235
def _enum_collaboration_at_end(author_line: str)->Dict:
def _enum_collaboration_at_end(author_line: str) -> Dict:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _enum_collaboration_at_end refactored with the following changes:

This removes the following comments ( why? ):

# Now expect `1) affil1 ', discard if no match

Comment on lines -258 to +256
log.info('Searching "{}"...'.format(globber))
log.info('Found: {} pdfs'.format(len(pdffiles)))
log.info(f'Searching "{globber}"...')
log.info(f'Found: {len(pdffiles)} pdfs')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_directory refactored with the following changes:

Comment on lines -300 to +298
log.info('Searching "{}"...'.format(globber))
log.info('Found: {} pdfs'.format(len(pdffiles)))
log.info(f'Searching "{globber}"...')
log.info(f'Found: {len(pdffiles)} pdfs')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_directory_parallel refactored with the following changes:

Comment on lines -314 to +311
log.error('File conversion failed for {}: {}'.format(pdffile, e))
log.error(f'File conversion failed for {pdffile}: {e}')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert_safe refactored with the following changes:

Comment on lines -335 to +332
raise RuntimeError('No such path: %s' % path)
raise RuntimeError(f'No such path: {path}')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function convert refactored with the following changes:

for f in files:
if 'txt' in f:
out.append(os.path.join(root, f))

out.extend(os.path.join(root, f) for f in files if 'txt' in f)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function all_articles refactored with the following changes:

Comment on lines -78 to +80
log.info('Completed {} articles'.format(i))
log.info(f'Completed {i} articles')
try:
refs = extract_references(article)
cites[path_to_id(article)] = refs
except:
log.error("Error in {}".format(article))
log.error(f"Error in {article}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function citation_list_inner refactored with the following changes:

Comment on lines -103 to +100
log.info('Calculating citation network for {} articles'.format(len(articles)))
log.info(f'Calculating citation network for {len(articles)} articles')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function citation_list_parallel refactored with the following changes:

Comment on lines -126 to +123
log.info('Saving to "{}"'.format(filename))
log.info(f'Saving to "{filename}"')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function save_to_default_location refactored with the following changes:

Comment on lines -75 to +83
if response.status_code == 503:
secs = int(response.headers.get('Retry-After', 20)) * 1.5
log.info('Requested to wait, waiting {} seconds until retry...'.format(secs))

time.sleep(secs)
return get_list_record_chunk(resumptionToken=resumptionToken)
else:
if response.status_code != 503:
raise Exception(
'Unknown error in HTTP request {}, status code: {}'.format(
response.url, response.status_code
)
f'Unknown error in HTTP request {response.url}, status code: {response.status_code}'
)
secs = int(response.headers.get('Retry-After', 20)) * 1.5
log.info(f'Requested to wait, waiting {secs} seconds until retry...')

time.sleep(secs)
return get_list_record_chunk(resumptionToken=resumptionToken)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function get_list_record_chunk refactored with the following changes:

Comment on lines -90 to +87
item = elm.find('arXiv:{}'.format(name), OAI_XML_NAMESPACES)
item = elm.find(f'arXiv:{name}', OAI_XML_NAMESPACES)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _record_element_text refactored with the following changes:

Comment on lines -105 to +115
logger.info('Requesting "{}" (costs money!)'.format(filename))
logger.info(f'Requesting "{filename}" (costs money!)')
request = requests.get(url, stream=True)
response_iter = request.iter_content(chunk_size=chunk_size)
logger.info("\t Writing {}".format(outfile))
logger.info(f"\t Writing {outfile}")
with gzip.open(outfile, 'wb') as fout:
for i, chunk in enumerate(response_iter):
for chunk in response_iter:
fout.write(chunk)
md5.update(chunk)
else:
logger.info('Requesting "{}" (free!)'.format(filename))
logger.info("\t Writing {}".format(outfile))
logger.info(f'Requesting "{filename}" (free!)')
logger.info(f"\t Writing {outfile}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function download_file refactored with the following changes:

return os.path.join(DIR_PDFTARS, os.path.basename(filename)) + '.gz'
return f'{os.path.join(DIR_PDFTARS, os.path.basename(filename))}.gz'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function _tar_to_filename refactored with the following changes:

msg = "MD5 '{}' does not match expected '{}' for file '{}'".format(
md5_downloaded, md5_expected, filename
)
msg = f"MD5 '{md5_downloaded}' does not match expected '{md5_expected}' for file '{filename}'"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function download_check_tarfile refactored with the following changes:

Comment on lines -198 to +201
if dryrun:
logger.info(cmd)
return 0
else:
if not dryrun:
return subprocess.check_call(
shlex.split(cmd), stderr=None if debug else open(os.devnull, 'w')
)
logger.info(cmd)
return 0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function call refactored with the following changes:

Comment on lines -238 to +235
msg = 'Tarfile from manifest not found {}, skipping...'.format(outname)
msg = f'Tarfile from manifest not found {outname}, skipping...'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function process_tarfile_inner refactored with the following changes:

logger.info('Tar file appears processed, skipping {}...'.format(filename))
logger.info(f'Tar file appears processed, skipping {filename}...')
return

logger.info('Processing tar "{}" ...'.format(filename))
logger.info(f'Processing tar "{filename}" ...')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function process_tarfile refactored with the following changes:

Comment on lines -344 to +343
logger.info("Indexing {}...".format(name))
logger.info(f"Indexing {name}...")

tarname = os.path.join(DIR_PDFTARS, os.path.basename(name))+'.gz'
tarname = f'{os.path.join(DIR_PDFTARS, os.path.basename(name))}.gz'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function generate_tarfile_indices refactored with the following changes:

Comment on lines -359 to +356
logger.info("Checking {}...".format(tar))
logger.info(f"Checking {tar}...")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function check_missing_txt_files refactored with the following changes:

sort = list(reversed(
sorted([(k, v) for k, v in missing.items()], key=lambda x: len(x[1]))
))
sort = list(reversed(sorted(list(missing.items()), key=lambda x: len(x[1]))))

for tar, names in sort:
logger.info("Running {} ({} to do)...".format(tar, len(names)))
logger.info(f"Running {tar} ({len(names)} to do)...")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function rerun_missing refactored with the following changes:

Comment on lines -11 to +14
return '{}/{}.pdf'.format(ym, n)
return f'{ym}/{n}.pdf'
else:
ym = n.split('/')[1][:4]
return '{}/{}.pdf'.format(ym, n.replace('/', ''))
return f"{ym}/{n.replace('/', '')}.pdf"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function id_to_tarpdf refactored with the following changes:

Comment on lines -299 to +288
joblib.dump(papers, dump_dir + 'papers_selected_pdf_route.dmp')
joblib.dump(papers, f'{dump_dir}papers_selected_pdf_route.dmp')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function Surveyor.fetch_papers refactored with the following changes:

Comment on lines -338 to +409
file.write(research_sections['conclusion'])
self.survey_print_fn(research_sections['conclusion'])
file.write("")
self.survey_print_fn("")

file.write('REFERENCES')
self.survey_print_fn('REFERENCES')
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
for entry in bibentries:
file.write(entry)
self.survey_print_fn(entry)
with open(filename, 'w+') as file:
if query is None:
query = 'Internal(existing) research'
self.survey_print_fn("#### Generated_survey:")
file.write("----------------------------------------------------------------------")
file.write(f"Title: A survey on {query}")
self.survey_print_fn("")
self.survey_print_fn("----------------------------------------------------------------------")
self.survey_print_fn(f"Title: A survey on {query}")
file.write("Author: Auto-Research (github.com/sidphbot/Auto-Research)")
self.survey_print_fn("Author: Auto-Research (github.com/sidphbot/Auto-Research)")
file.write("Dev: Auto-Research (github.com/sidphbot/Auto-Research)")
self.survey_print_fn("Dev: Auto-Research (github.com/sidphbot/Auto-Research)")
file.write("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+
"\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+
"\nmined with proper citations. As All of the text is practically quoted texted, hence to "+
"\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+
"\nentries(only to avoid LaTex overhead). ")
self.survey_print_fn("Disclaimer: This survey is intended to be a research starter. This Survey is Machine-Summarized, "+
"\nhence some sentences might be wrangled or grammatically incorrect. However all sentences are "+
"\nmined with proper citations. As All of the text is practically quoted texted, hence to "+
"\nimprove visibility, all the papers are duly cited in the Bibiliography section. as bibtex "+
"\nentries(only to avoid LaTex overhead). ")
file.write("----------------------------------------------------------------------")
self.survey_print_fn("----------------------------------------------------------------------")
file.write("")
self.survey_print_fn("")
file.write('ABSTRACT')
self.survey_print_fn('ABSTRACT')
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
file.write(research_sections['abstract'])
self.survey_print_fn(research_sections['abstract'])
file.write("")
self.survey_print_fn("")
file.write('INTRODUCTION')
self.survey_print_fn('INTRODUCTION')
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
file.write(research_sections['introduction'])
self.survey_print_fn(research_sections['introduction'])
file.write("")
self.survey_print_fn("")
for k, v in research_sections.items():
if k not in ['abstract', 'introduction', 'conclusion']:
file.write(k.upper())
self.survey_print_fn(k.upper())
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
file.write(v)
self.survey_print_fn(v)
file.write("")
self.survey_print_fn("")
file.write('CONCLUSION')
self.survey_print_fn('CONCLUSION')
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
file.write(research_sections['conclusion'])
self.survey_print_fn(research_sections['conclusion'])
file.write("")
self.survey_print_fn("")
self.survey_print_fn("========================XXX=========================")
file.write("========================XXX=========================")
file.close()

file.write('REFERENCES')
self.survey_print_fn('REFERENCES')
self.survey_print_fn("=================================================")
file.write("=================================================")
file.write("")
self.survey_print_fn("")
for entry in bibentries:
file.write(entry)
self.survey_print_fn(entry)
file.write("")
self.survey_print_fn("")
self.survey_print_fn("========================XXX=========================")
file.write("========================XXX=========================")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function Surveyor.build_doc refactored with the following changes:

Comment on lines -433 to +422
res = set([str(sent) for sent in list(res.sents)])
summtext = ''.join([line for line in res])
res = {str(sent) for sent in list(res.sents)}
summtext = ''.join(list(res))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function Surveyor.build_basic_blocks refactored with the following changes:

Comment on lines -460 to +449
res = set([str(sent) for sent in list(res.sents)])
summtext = ''.join([line for line in res])
#self.print_fn("abstractive summary type:" + str(type(summary)))
return summtext
res = {str(sent) for sent in list(res.sents)}
return ''.join(list(res))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function Surveyor.abstractive_summary refactored with the following changes:

This removes the following comments ( why? ):

#self.print_fn("abstractive summary type:" + str(type(summary)))

Comment on lines -485 to +471
abstext = k + '. ' + v.replace('\n', ' ')
abstext = f'{k}. ' + v.replace('\n', ' ')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function Surveyor.get_corpus_lines refactored with the following changes:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant