Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many publications associated with a single author breaks augur export #571

Closed
huddlej opened this issue Jun 25, 2020 · 3 comments · Fixed by #581
Closed

Too many publications associated with a single author breaks augur export #571

huddlej opened this issue Jun 25, 2020 · 3 comments · Fixed by #581
Assignees
Labels
bug Something isn't working priority: high To be resolved before other issues

Comments

@huddlej
Copy link
Contributor

huddlej commented Jun 25, 2020

Current Behavior

Running augur export v2 with recent ncov data can produce a case when an author has more than associated 52 publications. In this case, the index of the author tuple (see traceback below) exceeds the letters provided in the alphabet string.

Traceback (most recent call last):
  File "/home/jhuddles/miniconda3/envs/nextstrain/bin/augur", line 8, in <module>
    sys.exit(main())
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/__init__.py", line 74, in run
    return args.__command__.run(args)
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/export.py", line 22, in run
    return run_v2(args)
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/export_v2.py", line 867, in run_v2
    set_node_attrs_on_tree(data_json, node_attrs)
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/export_v2.py", line 509, in set_node_attrs_on_tree
    author_data = create_author_data(node_attrs)
  File "/home/jhuddles/miniconda3/envs/nextstrain/lib/python3.6/site-packages/augur/export_v2.py", line 491, in create_author_data
    node_author_info[node_name]["value"] = author + " {}".format("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[index])
IndexError: string index out of range

Expected behavior

The above error should not happen.

How to reproduce

Download the attached data in ncov_issue.zip and run the following command with augur version 8.0.0 (the latest).

augur export v2 \
  --tree tree.nwk \
  --metadata metadata_adjusted.tsv \
  --node-data branch_lengths.json \
  --auspice-config my_auspice_config.json \
  --output ncov_with_accessions.json

Possible solution

  • Append a numeric value to the authors list instead of an alphabetical value. The index of the author tuple from the current code could work for this.
  • Alternately, use the URL associated with the author in the metadata table or the GISAID accession Either of these two options would also provide more information to the user than an alphabetical character or a number.

The solution should ideally be future-proof enough to not break in a data-dependent manner.

Context

The current author disambiguation approach was introduced in f0e4b1c. The implementation previous to this appears to have used the numeric approach I suggest above, so maybe the GISAID id or URL approach are better?

@huddlej huddlej added the bug Something isn't working label Jun 25, 2020
@huddlej huddlej added the needs triage Needs triage by a Nextstrain team member label Jun 25, 2020
@dpark01
Copy link

dpark01 commented Jun 25, 2020

@cmloreth and I have seen this recently too

@trvrb
Copy link
Member

trvrb commented Jun 25, 2020

How about a simple appending of A, B, C, ..., Z, AA, AB, AC, ..., AZ, BA, BB, etc...? I'd prefer Gemma Clark et al B to Gemma Clark et al 2 for aesthetic reasons.

But just having 1, 2, 3, etc... would be okay as well.

@huddlej
Copy link
Contributor Author

huddlej commented Jun 26, 2020

Thanks, @trvrb! Your suggestion sounds good. I'll try this out. For my own context, the reason why the numeric disambiguation is not preferred is partially because the authors are listed in auspice with the number of strains displayed in the tree like so:

Dave J. Baker et al A (2) Dave J. Baker et al B (1)

If we used numeric disambiguation, these authors would be listed like:

Dave J. Baker et al 1 (2) Dave J. Baker et al 2 (1)

which is much more confusing.

@huddlej huddlej self-assigned this Jun 26, 2020
@huddlej huddlej added priority: high To be resolved before other issues and removed needs triage Needs triage by a Nextstrain team member labels Jun 26, 2020
huddlej added a commit that referenced this issue Jun 27, 2020
Adds a function to convert a zero-indexed article count for a given author to a
alphabetical disambiguation suffix. For example, two articles by Bedford et
al. get disambiguated as "Bedford et al. A" and "Bedford et al. B".

The new function is slightly overkill in that it supports an infinite number of
articles per author. We could easily unroll the first two loops into redundant
code that only works for a fixed number of articles (702), but the idea here is
that we never have to touch this function again.

Fixes #571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: high To be resolved before other issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants