Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSSOM/TSV parser could ignore trailing tabs in embedded YAML header #566

Closed
gouttegd opened this issue Dec 13, 2024 · 1 comment · Fixed by #567
Closed

SSSOM/TSV parser could ignore trailing tabs in embedded YAML header #566

gouttegd opened this issue Dec 13, 2024 · 1 comment · Fixed by #567

Comments

@gouttegd
Copy link
Contributor

gouttegd commented Dec 13, 2024

When a SSSOM/TSV file containing an embedded YAML metadata header is modified in a standard spreadsheet editor (e.g. LibreOffice Calc, Microsoft Excel, or Apple Numbers), the editor, obviously unaware that lines starting with a # are comments, may treat the commented header lines as if they were normal data lines that merely happen to have only one cell. This means that, when writing the file back, the editor will insert trailing tabs at the end of the header lines (so that those lines have the same number of “columns” than the subsequent real data lines).

For example, a file like this one:

# curie_map:
#   FBbt: http://purl.obolibrary.org/obo/FBbt_
#   UBERON: http://purl.obolibrary.org/obo/UBERON_
subject_id   predicate_id      object_id     mapping_justification
FBbt:1234    skos:exactMatch   UBERON:5678   semapv:ManualMappingCuration

is “seen“ by the spreadsheet editor as

# curie_map:
# FBbt: http://purl.obolibrary.org/obo/FBbt_
# UBERON: http://purl.obolibrary.org/obo/UBERON_
subject_id predicate_id object_id mapping_justification
FBbt:1234 skos:exactMatch UBERON:5678 semapv:ManualMappingCuration

and will be written back as (extra tabs rendered as “➪” to make them visible):

# curie_map:➪➪➪
#   FBbt: http://purl.obolibrary.org/obo/FBbt_➪➪➪
#   UBERON: http://purl.obolibrary.org/obo/UBERON_➪➪➪
subject_id   predicate_id      object_id     mapping_justification
FBbt:1234    skos:exactMatch   UBERON:5678   semapv:ManualMappingCuration

This behaviour has been observed with all three aforementioned spreadsheet editors.

Those extra tabs make the YAML metadata block invalid as per the YAML syntax rules, and therefore make the entire file an invalid SSSOM/TSV file. This, in turn, makes it impossible to use a standard spreadsheet software to edit a SSSOM mapping set.

I think SSSOM tools could implement a quick and almost painless workaround here, to still accept those files despite them being, strictly speaking, invalid. All that is needed is to strip any trailing tabs in the header lines before passing the lines to the YAML parser.

@matentzn
Copy link
Collaborator

This is great, thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants