Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BEDv1 specification #570

Merged
merged 9 commits into from
Dec 6, 2021
Merged

Conversation

michaelmhoffman
Copy link
Contributor

@michaelmhoffman michaelmhoffman commented May 19, 2021

Add a Browser Extensible Data (BED) specification.

@michaelmhoffman
Copy link
Contributor Author

michaelmhoffman commented May 19, 2021

[I am editing my first comment to include the original description of this pull request because it is inside baseball technical hts-specs stuff not relevant to most who will find this]

Previous draft was in ga4gh/ga4gh-bed#2.

Building the TeX into PDF involves departures from previous practice in this repository:

  • latexmk (included in TeX Live) instead of rerun.sh because the LaTeX acronym package does not work with rerun.sh (Switch from rerun.sh to latexmk #569). The acronym package automatically handles the expansion of acronyms--for example, ensuring their expansion only on first use and producing a list at the end. I would rather not remove it because we would lose this functionality. Using latexmk also involves removing new/*.fls and new/*.fdb_latexmk as part of the mostlyclean target.

  • I developed the document with lualatex (included in TeX Live) as I usually do instead of pdflatex, because it fixes some warts especially in font selection and is where I understand TeX engine development has been focused for years. If necessary, I could change the font setup to use pdflatex instead.

  • I used the gitinfo2 package (included in TeX Live) to automatically include commit info in the PDF file instead of using the genversion.sh and .ver setup. In many ways this seems cleaner than generating all the *.ver files but gitinfo2 doesn't actually work unless a special .git/gitHeadInfo.gin file is generated. The gitinfo2 documentation suggests that one do this with a post-checkout/-commit/-merge hook. This is how to add it:

    post_hook_file="$(texdoc --list --nointeract --machine gitinfo2/post-xxx-sample.txt | cut -f 3)"
    for hook in checkout commit merge; do
        cp -v "$post_hook_file" ".git/hooks/post-$hook"
    done
    git checkout master

    Alternatively, one can run the post-xxx-sample script in the same place one would run genversion.sh. The other alternative is I could take this out and replace it with genversion.sh as used elsewhere.

This fails the CircleCI build because I am using the footnotehyper package to get footnotes to work well in a table. I'm able to build all PDFs fine on my Ubuntu 20.10 system (which includes TeX Live 2020). It's not present on the Ubuntu 18.04.5 (Tex Live 2015) container used on CircleCI. I was using Ubuntu 20.04 LTS until recently and I think that would also work if you want to stick to an LTS version. Would you be willing to upgrade?

@jmarshall
Copy link
Member

jmarshall commented May 19, 2021

The existing documents try to be somewhat conservative in their package requirements. It's not so much about what is available on the CI's installation but what is available on maintainers' and interested contributors' installations, and avoiding making it unnecessarily difficult for people to build the PDFs locally. e.g. @jkbonfield tends to have an elderly installation 😄 — James, do you have footnotehyper available?

@jmarshall jmarshall added the bed label May 19, 2021
@jmarshall
Copy link
Member

Re gitinfo2: on the inside, this hook script and genversion.sh are very similar. The significant difference is that genversion.sh identifies the most recent commit affecting the particular document's source files (47aaedf), while gitinfo2 identifies the most recent commit overall (i.e., HEAD). This is less than great in a repository containing more than one specification: regenerated gitinfo2-using PDFs will have spurious differences in their recorded git hashes and datestamps due to unrelated changes in other documents.

Hence I'd recommend just going with the existing genversion.sh infrastructure.

[BTW on my complete and fully updated TeX Live (macTeX) 2021 installation, I got Environment adjustwidth undefined. Is this missing a \usepackage[strict]{changepage}?]

@jkbonfield
Copy link
Contributor

This fails to build with me:

------------
Running 'lualatex  -recorder -output-directory="new"  "BEDv1.tex"'
------------
This is LuaTeX, Version 1.0.4 (TeX Live 2017/Debian) 
 restricted system commands enabled.
(./BEDv1.tex
LaTeX2e <2017-04-15>
(using write cache: /nfs/users/nfs_j/jkb/.texlive2017/texmf-var/luatex-cache/ge
neric)(using read cache: /var/lib/texmf/luatex-cache/generic /nfs/users/nfs_j/j
kb/.texlive2017/texmf-var/luatex-cache/generic)
luaotfload | main : initialization completed in 0.160 seconds
Babel <3.18> and hyphenation patterns for 1 language(s) loaded.
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size11.clo
luaotfload | db : Font names database not found, generating new one.
luaotfload | db : This can take several minutes; please be patient.(save: /nfs/users/nfs_j
/jkb/.texlive2017/texmf-var/luatex-cache/generic/fonts/otl/lmroman10-regular.lu
a)(save: /nfs/users/nfs_j/jkb/.texlive2017/texmf-var/luatex-cache/generic/fonts
/otl/lmroman10-regular.luc)))
(/usr/share/texlive/texmf-dist/tex/latex/base/fontenc.sty
(/usr/share/texlive/texmf-dist/tex/latex/base/t1enc.def)
(/usr/share/texmf/tex/latex/lm/t1lmr.fd))
(/usr/share/texlive/texmf-dist/tex/latex/geometry/geometry.sty

... snip ...

(/usr/share/texlive/texmf-dist/tex/latex/siunitx/siunitx-abbreviations.cfg)
(/usr/share/texlive/texmf-dist/tex/latex/siunitx/siunitx-binary.cfg)
! Undefined control sequence.
l.57 \Ac
      {BED} is a whitespace-delimited file format, where each~\textbf{file} ...

I get the same failure with pdflatex too though so it's not a lualatex thing.

Is this some capitalization issue? I see \ac{foo} everywhere except at the start of sentences where it is \Ac{foo}, yet in the file the only definition I see is for '\AC'. I thought latex was case sensitive so I don't understand this.

Replacing them with the lowercase version gives me a PDF that builds, but neither xpdf nor evince can open it. Doing a make clean and rebuilding then gave me an error bizarrely, so I'm really not sure what's going on.

! LaTeX Error: Environment adjustwidth undefined.

See the LaTeX manual or LaTeX Companion for explanation.
Type  H <return>  for immediate help.
 ...                                              
                                                  
l.158 \begin{adjustwidth}
                       {-0.5in}{-0.5in}

I'm also getting gitinfo warnings:

Package gitinfo2 Warning: I can't find the file '.git/gitHeadInfo.gin'.
(gitinfo2)                All git metadata has been set to '(None)'.

This is all on an Ubuntu 18.04 machine with the standard installed latex packages. I don't have admin rights on any of our machines, but can ask to get packages installed.

@jkbonfield
Copy link
Contributor

I just noticed John's comment - yes adding \usepackage[strict]{changepage} cures it. Evince can even view it now. :-)

@jmarshall
Copy link
Member

\Ac is a for-the-start-of-sentences variant of \ac, added in acronym 1.42 in Nov 2019. Presumably your TeX installation's edition of acronym.sty is too old to have it.

The gitinfo warnings are because you haven't installed the hook into your git repository or otherwise run the post-xxx-sample script. I don't think we should be requiring people to know how to set up repository hooks, but I left that out of my previous comment because I thought I had already flogged the gitinfo horse quite a bit…

@jkbonfield
Copy link
Contributor

Agreed IMO we shouldn't use gitinfo for the same reasons you mentioned. It's not as useful as the existing infrastructure there.

Our servers are running Ubuntu 18.04.5 (18.0 was released in 2018, with 18.04.5 in Aug 2020). It's a Long Term Support release, supported up to 2023. I could download a newer acronym.sty and install locally, but IMO if we have dependencies on specific versions then we should include them in the hts-specs respository itself so it builds cleanly on all "current" OS distributions, or provide a "bootstrap.sh" or similar that downloads and caches them automatically.

However in this case, the only time that command was used was on BED and UCSC acronyms, both starting with a capital letter, so we'd be better off just using the lowercase variant to avoid the recency dependency.

@michaelmhoffman
Copy link
Contributor Author

michaelmhoffman commented May 27, 2021

My plans:

  • re-add the changepage package (removed in an incorrect attempt to get this working on CircleCI)
  • switch from using gitinfo2 to using genversion.sh for revision info
  • \providecommand for \Ac
  • fix the acronym missing references issue described by @jmarshall in Switch from rerun.sh to latexmk #569.

I have added some strikethrough styling to the pull request description to reflect planned changes.

@jmarshall
Copy link
Member

Re the \Ac thing, you could add something like the following to your preamble and carry on using both \Ac and \ac as appropriate to keep your linter happy:

% work around outdated acronym.sty packages
\providecommand*{\Ac}[1]{\ac{#1}}

(TIL about \providecommand — very handy in just this circumstance!)

@michaelmhoffman
Copy link
Contributor Author

I believe I've fixed the discussed issues. The CircleCI build image still doesn't have footnotehyper though.

! LaTeX Error: File `footnotehyper.sty' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: sty)

Enter file name: 
! Emergency stop.
<read *> 
   
l.17 \usepackage
              [strict]{changepage}
 264 words of node memory still in use:
   2 hlist, 1 rule, 42 glue_spec, 1 write, 1 dir, 1 pdf_colorstack nodes
   avail lists: 2:17,3:1,6:2,7:1,9:2,12:1
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on BEDv1.log.
Latexmk: Missing input file: 'footnotehyper.sty' from line
  '! LaTeX Error: File `footnotehyper.sty' not found.'
Latexmk: Errors, so I did not complete making targets
Collected error summary (may duplicate other messages):
  pdflatex: Command for 'pdflatex' gave return code 256
Latexmk: Use the -f option to force complete processing,
 unless error was exceeding maximum runs of latex/pdflatex.
Makefile:40: recipe for target 'new/BEDv1.pdf' failed
make: *** [new/BEDv1.pdf] Error 12

Exited with code exit status 2

CircleCI received exit code 2

@jkbonfield @jmarshall Do you want to test that this works on your local TeX installation?

Also, we have yet to get to diff generation and if the build is still broken there, I would consider setting things up so that building a latexdiff PDF does not result in a build failure. Latexdiff doesn't understand everything and there are occasionally things that require manual intervention.

@jmarshall
Copy link
Member

That builds locally for me, though the table of acronyms is badly formatted (I haven't investigated yet).

I have started looking into updating the CI installation, and have been having similar thoughts about having a borked latexdiff not cause complete failure…

@michaelmhoffman
Copy link
Contributor Author

The formatting of the acronym list in the revised version is greatly improved.

@simonbrent
Copy link

The spec document states:

line separator: Either carriage return, line feed, or carriage return followed by line feed. The same
line separator must be used throughout the file.
...
1.3.1 Data lines
Data lines contain feature information. A data line is composed of fields separated by whitespace.
The whitespace must match the regex [[:space:]]+3
....
3
[[:space:]] includes the following characters: space, form-feed, newline, carriage-return, tab, and vertical-tab

It does not make sense that fields can be separated by newline or carriage-return, when those are the designated line separators.
I also think there should be consistency of language here: One refers to line feed and the other to newline (also "carriage return" and "carriage-return") - if the line separator must be one of "\r", "\n", or "\r\n", I think it should be stated as such, rather than leaving it a bit more open to interpretation, given that "line feed" can be taken to mean "\r\n", which would make "carriage return followed by line feed" = "\r\r\n"

@simonbrent
Copy link

simonbrent commented Jun 9, 2021

The spec states:

> chromEnd: End position of the feature on the chromosome or scaffold. chromEnd must be
an integer greater than or equal to the value of chromStart

In a 0-based, half-open coordinate system, shouldn't end be greater than start? Otherwise conversion to a 1-based system would result in a feature with a end greater than its start.

(The same applies to the thickEnd definition)

I see that the definitions are such that insertions can be described as end = start

@fkokocinski
Copy link

A data line is composed of fields separated by whitespace.

I know it would invalidate some existing files, but if we have the chance to specify tab "\t" as a separater going forward it would make the format much more stable, i.e. safer to parse, in my opinion.

@pdl
Copy link

pdl commented Jun 9, 2021

Getting a formal spec for BED would be great and it's really great to have this document as a basis for it. In order to achieve its purpose, the spec will need to address three concerns:

  1. How to represent features as BED
  2. How to determine if a file is conformant
  3. How to parse a conformant file

At the moment the spec feels like it's more descriptive than normative, i.e. that it is indicating the diversity in how information is represented in the files that currently exist and are interpreted by existing tooling, rather than establishing expectations for predictable behaviour which you'd need in order to write tooling in the future. The spec should absolutely be informed by a corpus of real files and hopefully this will be used to underpin the spec with a suite of tests as other specs in this repo are, but I suspect at least some existing documents might be found to be inconsistent with such a spec.

Though lines may use any kind of whitespace as a delimiter between fields, a single tab (\t) should
be used. This is because almost all tools support tabs while some tools do not support other kinds
of whitespace. Also, whitespace within the name field may be used only if the field delimiter is tab
throughout the file.

How is a parser supposed to determine that "the field delimiter is tab throughout the file", if other white space can be parsed as either a separator or as field content? This would lead to ambiguous parsing and also implies the entire file must be read before parsing begins.

The whitespace must match the regex [[:space:]]+

Again, this leads to ambiguity. If a column is empty (as user-defined fields are permitted to be), then how does a parser differentiate between two column separators and one column separator comprised of multiple whitespace characters?

Standardising on tab only as a field separator would solve these and mean that tools such as cut would work as expected on conformant files.

Again, to avoid ambiguity, fields should also not contain \r or \n (name and user-defined fields currently allow this), and it might be wise to specify that \x00..\x1f are not permitted in field data.

Related to this, the handling of null values seems inconsistently specified and documented - this should be clearer:

  • It seems like the first three columns must be present and not empty, but I'm not sure this is explicit
  • The regex for name allows it to be empty, but the text does not (unless delimiter is '\t')
  • A null value for score it seems should be encoded as 0 - the 'default value' according to the spec. It's not clear if 0 is always supposed to be interpreted as being of null content, or whether 'default value' means that parsers should assume that empty string means 0. It's not clear whether scores are arbitrary or whether scores should be scaled to the range 1..1000, and if so whether features with a score of 1000 are the most 'prominent'.
  • For strand, a dot is specified, so presumably this cannot be empty.
  • thickStart, thickEnd say "There is no specified default value"; thickEnd says "If this field is not specified but thickStart is" - does this mean "if the column doesn't exist", or "if the field has an empty string value"?
  • For blockCount, 'Null or empty', is disallowed, implying there is a distinction and these would be represented differently - how?

It's not clear how to distinguish between e.g. a BED6+6 file and a BED12 file.

@andrewyatz
Copy link
Contributor

How is a parser supposed to determine that "the field delimiter is tab throughout the file", if other white space can be parsed as either a separator or as field content? This would lead to ambiguous parsing and also implies the entire file must be read before parsing begins.

From experience, if you look at programs such as bedToBigBed there are explicit command line switches to enforce the use of tabs. Without that kind of change or some form of metadata payload in the bed file (out of scope as this spec is defined) then any whitespace must be assumed

@JspSrs
Copy link

JspSrs commented Jun 9, 2021

Regarding topic 3.2, on RGB color coding, I think a good point would be to add an additional line advising to avoid a Red/Green color scheme.
Something like: itemRgb: Eight or fewer colors should be used as too many colors may slow down visualizations
and are difficult for humans to distinguish.8
Avoid using a simple Red/Green color scheme for positive and negative, but add Yellow or Blue, as in proper scientific color use.

The reason for this is the many colorblind individuals (mostly males, but including females). Off note, colorblindness is quite a spectrum of color nuances due to simple polymorphisms to seeing dar-red and dark-green as the same kind of greyish. And for many years avoiding simple Red/Green use is supported by important journals, including Nature (i.e see their tutorials on article writing and artwork)

A line on using transparency to reduce color intensity seems worthwhile also, when establishing a formal standard. Unfortunately, GA4GH/ClinGen uses Red for deletions and Blue for gains in copy number visualization, while Blue often stands for 'colder' and 'less', while Red often is 'more intens' and 'fire'.

@JspSrs
Copy link

JspSrs commented Jun 9, 2021

Note for [email protected]: the domain is not accepting email. While this lsg-coordinator is the mailto link in the GA4GH call for Public Comment on the BED file format.

_ Hello [email protected],

_We're writing to let you know that the group you tried to contact (LSG-Coordinator) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:

  • You might have spelled or formatted the group name incorrectly.
  • The owner of the group may have removed this group.
  • You may need to join the group before receiving permission to post.
  • This group may not be open to posting.

If you have questions related to this or any other Google Group, visit the Help Center at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.google.com%2Fa%2Fga4gh.org%2Fbin%2Ftopic.py%3Ftopic%3D25838&amp;data=04%7C01%7Cj.saris%40erasmusmc.nl%7C6a89f5c950b84cc4720408d92b59af03%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C1%7C637588486455819461%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=%2FCVTUGF3kZ40hlrRZ3Xy4Vio8qLmWdrF55kr0mAIuiI%3D&amp;reserved=0.

Thanks,

ga4gh.org admins__

@ZhenyuZ
Copy link

ZhenyuZ commented Jun 9, 2021

Related to sorting in the draft spec

  1. For non-major contigs, how could numeric sorting requirement be implemented? Unlike SAM spec, the bed spec lacks description on how to specify reference genome or contigs in the header
  2. For two features with the same chrom, chromStart and chromEnd, do we need specification on how to sort them?

@michaelmhoffman
Copy link
Contributor Author

I know it would invalidate some existing files, but if we have the chance to specify tab "\t" as a separater going forward it would make the format much more stable, i.e. safer to parse, in my opinion.

@fkokocinski Thank you for your suggestion. I agree it would be nice, but we designed this version of the specification to formalize the existing BED description, which clearly allows other whitespace as a delimiter. There are probably many extant BED files, parsers, and generators that use non-tab whitespace, and it is undesirable for a "v1" specification to declare them non-conforming years after their creation according to a sensible reading of the previous description.

The "Recommended practice for the BED format" section recommends that a single tab be used as this presents many advantages.

@michaelmhoffman
Copy link
Contributor Author

@pdl Thank you for considering this specification so carefully. Some replies below:

Getting a formal spec for BED would be great and it's really great to have this document as a basis for it. In order to achieve its purpose, the spec will need to address three concerns:

1. How to represent features as BED
2. How to determine if a file is conformant
3. How to parse a conformant file

At the moment the spec feels like it's more descriptive than normative, i.e. that it is indicating the diversity in how information is represented in the files that currently exist and are interpreted by existing tooling, rather than establishing expectations for predictable behaviour which you'd need in order to write tooling in the future.

The spec is normative in that it clearly indicates that some files are valid and that others are invalid. I understand why you would say it feels more descriptive, in that, we are quite a bit more flexible about what is valid than what the drafter of a greenfield specification might allow. This is to avoid declaring files and implementations incompatible with this first version of the specification years after they were created, when those files and implementations were created according to reasonable interpretations of the previous BED description.

The spec should absolutely be informed by a corpus of real files and hopefully this will be used to underpin the spec with a suite of tests as other specs in this repo are, but I suspect at least some existing documents might be found to be inconsistent with such a spec.

We have created test files for many corner cases as a separate project. We hope to release a preprint on it later this summer.

Though lines may use any kind of whitespace as a delimiter between fields, a single tab (\t) should
be used. This is because almost all tools support tabs while some tools do not support other kinds
of whitespace. Also, whitespace within the name field may be used only if the field delimiter is tab
throughout the file.

How is a parser supposed to determine that "the field delimiter is tab throughout the file", if other white space can be parsed as either a separator or as field content? This would lead to ambiguous parsing and also implies the entire file must be read before parsing begins.

We wrote the spec to indicate whether files are valid or not, and does not guarantee an unambiguous parsing. I know this is undesirable but it is difficult to avoid without declaring invalid existing files and implementations created under a reasonable interpretation of the previous description. The recommendation is to use tabs only for new files.

Parsers could infer that the delimiter is tab only from the first line, and can then report an error if they run into a line that cannot be parsed later. Unfortunately I believe this could lead to some BED files considered valid under this specification not being parsed. The only way to guarantee that this will happen while still enabling the advantages of using tabs is to use out-of-band information declaring that the only delimiter is tab, as @andrewyatz points out.

Related to this, the handling of null values seems inconsistently specified and documented - this should be clearer:

* It seems like the first three columns must be present and not empty, but I'm not sure this is explicit

The current text says Each~\textbf{data line} contains between~3 and 12~whitespace-delimited~\textbf{field}s. The first 3~\textbf{field}s are mandatory, and the last 9~\textbf{field}s are optional. Do you think this is explicit enough?

* A null value for score it seems should be encoded as 0 - the 'default value' according to the spec. It's not clear if 0 is always supposed to be interpreted as being of null content, or whether 'default value' means that parsers should assume that empty string means 0. It's not clear whether scores are arbitrary or whether scores should be scaled to the range 1..1000, and if so whether features with a score of 1000 are the most 'prominent'.

Will address the default value issue better. The semantics of score are out-of-band information and not discussed in this spec.

@michaelmhoffman
Copy link
Contributor Author

michaelmhoffman commented Jun 25, 2021

Thank you everyone for all your helpful comments! @niujeffrey can you please draft the following changes by making a pull request against michaelmhoffman/hts-specs

  • change the field separator to not include newline or carriage return (thanks @simonbrent)
  • "line feed" -> "newline" (thanks @simonbrent)
  • "carriage-return" -> "carriage return" (thanks @simonbrent)
    - [ ] "form-feed" -> "form feed"
    - [ ] "vertical-tab" -> "vertical tab"
  • define "newline" on first use as '\n' (thanks @simonbrent)
  • clarify that when chromEnd is equal to chromStart represents a zero-length feature, which is a feature between two bases such as an insertion. chromStart=0, chromEnd=0 represents an insertion before the first nucleotide of a chromosome (thanks @simonbrent)
  • specify that fields, including custom fields, can only be empty when a single tab is used as the delimiter (thanks @pdl)
  • specify that field data is ASCII printable characters only---the range '\x20' to '\x7e' (thanks @pdl)
  • name: change regex so that it cannot be empty (thanks @pdl)
  • score: clarify that 0 should be used in BED5+ files where a score attribute of features would be uninformative (thanks @pdl)
  • strand:
    • clarify that . is the default value, and that a parser should treat BED5 files as if they have strand=.
    • explicitly specify that it cannot be empty in a BED6+ file (thanks @pdl)
  • thickEnd: clarify that the field is not specified but thickStart is means BED files that are not BED8+ (thanks @pdl)
  • change "null or empty" to only "empty" (thanks @pdl)
  • add recommendation to use colorblind-friendly color schemes, and especially to avoid red-green color schemes (thanks @JspSrs)
  • sorting:
    • add that arbitrary orderings of chrom are allowed as long as all lines with the same chrom value occur consecutively (thanks @ZhenyuZ)
    • specify that multiple features with the same chrom, chromStart, and chromEnd may appear in any order
  • add a section, before "UCSC track files", discussing information that is supplied out-of-band. This should include
    • which of the first 4-12 fields are standard BED fields and which are custom fields (thanks @pdl)
    • genome assembly used
    • semantics of score, itemRgb, thick vs. thin positions, block vs. non-block positions
    • definitions of custom fields
    • whether tab is the only delimiter between fields (thanks @pdl, thanks @andrewyatz)
  • add acknowledgments to any of the folks thanked in the above checklist, with full name and affiliation

@pdl
Copy link

pdl commented Jun 28, 2021

Thanks, @andrewyatz @michaelmhoffman @niujeffrey for indicating the required out-of-band information. All the changes proposed look great. I realise you don't want to innovate at this stage but in some version of the specification I hope it will be possible to standardise how to convey these things in-band (such as via a structured initial comment line) so that files can be self-contained.

@michaelmhoffman
Copy link
Contributor Author

Thanks @pdl, yes that is certainly something we have discussed. Having this specification makes it easier to identify both out-of-band information that we could formally specify and other warts and ambiguities that we could eliminate in a future specification. For example, one could imagine a "BEDX" specification which specifies files that are all valid BED v1, but also has structured comments with out-of-band info, single tab as the only allowed field separator, and other aspects where there are multiple choices now eliminated.

michaelmhoffman added a commit to michaelmhoffman/hts-specs that referenced this pull request Jun 28, 2021
Addresses public comments received on samtools#570:

- [x] change the field separator to not include newline or carriage return ([thanks](#issuecomment-857452438) @simonbrent)
- [x] "line feed" -> "newline" ([thanks](#issuecomment-857452438) @simonbrent)
- [x] "carriage-return" -> "carriage return" ([thanks](#issuecomment-857452438) @simonbrent)
- [x] define "newline" on first use as `'\n'` ([thanks](#issuecomment-857452438) @simonbrent)
- [x] clarify that when `chromEnd` is equal to `chromStart` represents a zero-length feature, which is a feature between two bases such as an insertion. `chromStart=0`, `chromEnd=0` represents an insertion before the first nucleotide of a chromosome ([thanks](#issuecomment-857460372) @simonbrent)
- [x] specify that fields, including custom fields, can only be empty when a single tab is used as the delimiter ([thanks](#issuecomment-857475922) @pdl)
- [x] specify that field data is ASCII printable characters only---the range `'\x20'` to `'\x7e'` ([thanks](#issuecomment-857475922) @pdl)
- [x] `name`: change regex so that it cannot be empty ([thanks](#issuecomment-857475922) @pdl)
- [x] `score`: clarify that `0` should be used in BED5+ files where a `score` attribute of features would be uninformative ([thanks](#issuecomment-857475922) @pdl)
- [x] `strand`:
  - [x] clarify that `.` is the default value, and that a parser should treat BED5 files as if they have `strand=.`
  - [x] explicitly specify that it cannot be empty in a BED6+ file ([thanks](#issuecomment-857475922) @pdl)
- [x] `thickEnd`: clarify that the field is not specified but `thickStart` is means BED files that are not BED8+ ([thanks](#issuecomment-857475922) @pdl)
- [x] change "null or empty" to only "empty" ([thanks](#issuecomment-857475922) @pdl)
- [x] add recommendation to use colorblind-friendly color schemes, and especially to avoid red-green color schemes ([thanks](#issuecomment-857788414) @JspSrs)
- [x] sorting:
  - [x] add that arbitrary orderings of `chrom` are allowed as long as all lines with the same `chrom` value occur consecutively ([thanks](#issuecomment-857955521) @ZhenyuZ)
  - [x] specify that multiple features with the same `chrom`, `chromStart`, and `chromEnd` may appear in any order
- [x] add a section, before "UCSC track files", discussing information that is supplied out-of-band. This should include
  - [x] which of the first 4-12 fields are standard BED fields and which are custom fields ([thanks](#issuecomment-857475922) @pdl)
  - [x] genome assembly used
  - [x] semantics of `score`, `itemRgb`, thick vs. thin positions, block vs. non-block positions
  - [x] definitions of custom fields
  - [x] whether tab is the only delimiter between fields ([thanks](#issuecomment-857475922) @pdl, [thanks](#issuecomment-857508028) @andrewyatz)
- [x] add acknowledgments to any of the folks thanked in the above checklist, with full name and affiliation

Additionally, define "field separator", remove form feed and vertical tab from valid field separators, specify "data line" instead of "line" in several places, and correct some places where boldface was not used but should be.

Co-authored-by: Michael Hoffman <[email protected]>
@michaelmhoffman
Copy link
Contributor Author

Thank you to all who have provided feedback. Getting this file to work on CircleCI is still in progress (see #576). I have used GitHub Releases to upload revised PDFs of:

latexdiff is not perfect and it has decided that all the tables have changed even where they haven't. Nonetheless the latexdiff should help show the differences from the last version.

@michaelmhoffman
Copy link
Contributor Author

michaelmhoffman commented Sep 1, 2021

I noticed comment line, blank line, and data line are in boldface but they are not defined separately as terms.

@brainstorm
Copy link
Contributor

brainstorm commented Sep 16, 2021

Just a minor observation on terminology: a few of the htslib-supported formats have the concept of "entries" as "records" (lines?) whereas this BED spec refers to those lines as "data lines"?

Would it make sense to substitute "data lines" for "records" or there's some historical reason for the naming that I'm missing here?

@michaelmhoffman
Copy link
Contributor Author

This is a line-based format, and for example has a line separator rather than a record separator. I think it is easier to understand the way it is.

@arq5x
Copy link

arq5x commented Sep 23, 2021

This is a line-based format, and for example has a line separator rather than a record separator. I think it is easier to understand the way it is.

Completely agree.

@brainstorm
Copy link
Contributor

brainstorm commented Sep 23, 2021

This is a line-based format, and for example has a line separator rather than a record separator. I think it is easier to understand the way it is.

Completely agree.

It's alright, while implementing BED support based on this spec I observed that the rest of the API surface referred to "records" across almost all other bioinfo formats. But I understand that users and already existing implementations might be more used to "data lines" (generational and/or historical argument?). My argument was raised on the aim to have some more API term(s) uniformity, so better cognitively across formats.

Personally I don't see a clear line between "record separator" and "line separator" pointed out by @michaelmhoffman since both terms seem quite abstract and interchangeable to me? Maybe stretching it, records and lines are different in the "human-readable" sense, but they are bytes anyway at the end of the day.

Anyways, data lines it is then ;)

michaelmhoffman added a commit to michaelmhoffman/hts-specs that referenced this pull request Sep 23, 2021
@michaelmhoffman
Copy link
Contributor Author

I have addressed the remaining comments from this issue and the peer review committee.

@michaelmhoffman
Copy link
Contributor Author

When merging the Makefile changes in 3cfa33e, will need to set target-specific variable LATEXMK_ENGINE=--lualatex for BEDv1.tex.

Can take a similar approach with a target-specific variable LATEXDIFF_ENGINE=--config LATEX=lualatex for diff/BEDv1.pdf.

@michaelmhoffman
Copy link
Contributor Author

The GA4GH Steering Committee has approved this specification. Thank you to everyone who has contributed to this process.

Further work on this pull request will be on any technical changes required to get this pulled into the master branch of hts-specs. Please file separate issues for any substantive BED specification concerns.

@michaelmhoffman
Copy link
Contributor Author

This fails CircleCI build because the current code requires a later version of TeX Live (see #576).

Additionally, CircleCI never gets this far, but locally I cannot build diff/BEDv1.pdf due to an error. I believe this error comes from a bad interaction between the hyperref package and the --only-changes option to latexdiff. After building a whole change document, latexdiff attempts to use gs to extract only changed pages, but then those pages have PDF bookmarks pointing to pages that no longer exist:

[...]
Output written on BEDv1.pdf (8 pages, 434584 bytes).
Transcript written on BEDv1.log.
Argument "9.53.3\n" isn't numeric in numeric ge (>=) at /bin/latexdiff-vc line 543.
Executing gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sPageList= -sOutputFile="BEDv1-changedpage.pdf" "BEDv1.pdf"GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages .
GPL Ghostscript 9.53.3: ERROR: A pdfmark destination page 8 points beyond the last page 1.

@jmarshall What is the best way forward on these two issues? For me, the ideal situation would be that the CircleCI TeX Live be upgraded so the current code works as-is and that the diff issue be ignored. But you may feel otherwise 😄

@michaelmhoffman
Copy link
Contributor Author

@jmarshall Building BEDv1.pdf disabled by default and build passes. This is ready to merge.

Description of Makefile changes in the message of 575e679. Use make WITH_LUALATEX=1 to include BEDv1.pdf in building and cleaning. Or put WITH_LUALATEX=1 in your GNUmakefile.

Can you please check in a BEDv1.pdf? You can use https://github.com/michaelmhoffman/hts-specs/releases/download/20210923/BEDv1.pdf but it sounds like your process is to build it yourself.

michaelmhoffman and others added 5 commits November 18, 2021 21:41
Source: ga4gh/ga4gh-bed#2, which pulls from michaelmhoffman/ga4gh-bed@7f13c7f

[Rebased onto mainline latexmk infrastructure changes.]
Add BEDv1.pdf to Makefile.
Departures from previous practice in this repository:

- [overrides LATEXMK_ENGINE] I developed the document with `lualatex`
  (included in TeX Live) as I usually do instead of `pdflatex`, because it
  fixes some warts especially in font selection and is where I understand
  TeX engine development has been focused for years. If necessary, I could
  change the font setup to use `pdflatex` instead.

add BED to `MAINTAINERS.md`

add version details from BEDv1.ver rather than using gitinfo2

add fallback `\providecommand` for `\Ac`

hack package `acronym` hyperlink using option `nohyperlinks` instead
make acronym list single-spaced and align to longest acronym
Addresses public comments received on samtools#570:

- [x] change the field separator to not include newline or carriage return ([thanks](#issuecomment-857452438) @simonbrent)
- [x] "line feed" -> "newline" ([thanks](#issuecomment-857452438) @simonbrent)
- [x] "carriage-return" -> "carriage return" ([thanks](#issuecomment-857452438) @simonbrent)
- [x] define "newline" on first use as `'\n'` ([thanks](#issuecomment-857452438) @simonbrent)
- [x] clarify that when `chromEnd` is equal to `chromStart` represents a zero-length feature, which is a feature between two bases such as an insertion. `chromStart=0`, `chromEnd=0` represents an insertion before the first nucleotide of a chromosome ([thanks](#issuecomment-857460372) @simonbrent)
- [x] specify that fields, including custom fields, can only be empty when a single tab is used as the delimiter ([thanks](#issuecomment-857475922) @pdl)
- [x] specify that field data is ASCII printable characters only---the range `'\x20'` to `'\x7e'` ([thanks](#issuecomment-857475922) @pdl)
- [x] `name`: change regex so that it cannot be empty ([thanks](#issuecomment-857475922) @pdl)
- [x] `score`: clarify that `0` should be used in BED5+ files where a `score` attribute of features would be uninformative ([thanks](#issuecomment-857475922) @pdl)
- [x] `strand`:
  - [x] clarify that `.` is the default value, and that a parser should treat BED5 files as if they have `strand=.`
  - [x] explicitly specify that it cannot be empty in a BED6+ file ([thanks](#issuecomment-857475922) @pdl)
- [x] `thickEnd`: clarify that the field is not specified but `thickStart` is means BED files that are not BED8+ ([thanks](#issuecomment-857475922) @pdl)
- [x] change "null or empty" to only "empty" ([thanks](#issuecomment-857475922) @pdl)
- [x] add recommendation to use colorblind-friendly color schemes, and especially to avoid red-green color schemes ([thanks](#issuecomment-857788414) @JspSrs)
- [x] sorting:
  - [x] add that arbitrary orderings of `chrom` are allowed as long as all lines with the same `chrom` value occur consecutively ([thanks](#issuecomment-857955521) @ZhenyuZ)
  - [x] specify that multiple features with the same `chrom`, `chromStart`, and `chromEnd` may appear in any order
- [x] add a section, before "UCSC track files", discussing information that is supplied out-of-band. This should include
  - [x] which of the first 4-12 fields are standard BED fields and which are custom fields ([thanks](#issuecomment-857475922) @pdl)
  - [x] genome assembly used
  - [x] semantics of `score`, `itemRgb`, thick vs. thin positions, block vs. non-block positions
  - [x] definitions of custom fields
  - [x] whether tab is the only delimiter between fields ([thanks](#issuecomment-857475922) @pdl, [thanks](#issuecomment-857508028) @andrewyatz)
- [x] add acknowledgments to any of the folks thanked in the above checklist, with full name and affiliation

Additionally, define "field separator", remove form feed and vertical tab from valid field separators, specify "data line" instead of "line" in several places, and correct some places where boldface was not used but should be.

Co-authored-by: Michael Hoffman <[email protected]>
…0210830 draft]

* Update BEDv1.tex

* polished edits

* Edits in response to public comments 2021-06-28 through 2021-07-09

* Edits in response to public comments 2021-06-28 through 2021-07-09 fix typo

* Edits in reponse to GA4GH PRC

* line edit

* Edits in response to public comments 2021-07-09

* further line edits and footnote fixing

* fix texlint issues

* WIP address uninformative/default/empty issues

* fix empty/uninformative issues

* clarify language on BED fields not being empty

* add special-case `diff/BEDv1.pdf` target that uses lualatex

* fix minor typo

Co-authored-by: Michael Hoffman <[email protected]>
@michaelmhoffman
Copy link
Contributor Author

@jmarshall Is there a chance you could check in BEDv1.pdf soon? GA4GH wants to do some publicity for it and it would be helpful to have a more stable URL? Thanks!

* replace `user` with more specific terms; addresses comment on samtools#570

* move part of `Custom fields` description from recommendation to specification

* add constraint on whitespace in custom fields; addresses comment on samtools#570

* change definition of character and string to use printable characters

* exclude Character and String from comma-separated lists

* define `BED field` and `custom field` as terminology and use them when possible

* clarify definition of BEDn+; addresses comment on samtools#570
…bles

No longer have a hard-coded special command line for this target.

Add `LATEXDIFF_ENGINE` variable to serve the same purpose
as `LATEXMK_ENGINE` but for latexdiff.
in new `Discrete genomic feature data files` heading
@jmarshall
Copy link
Member

This PR has now been rebased to after the change to latexmk on the repository's mainline, squashed more or less appropriately (while preserving commits corresponding to the various dated drafts along the way), and merged. I regenerated BEDv1.pdf rather than use your PDF from September, as the regenerated one contains the correct commit hash as eventually merged.

I used a different method of temporarily preventing CircleCI from attempting to build BEDv1.pdf, so you won't need to define anything special to build it locally. (See 2a34fbf's change to Makefile's PDFS list.)

The BED specification is now available at https://samtools.github.io/hts-specs/, specifically as BEDv1.pdf. Apologies for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.