-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BEDv1 specification #570
Conversation
[I am editing my first comment to include the original description of this pull request because it is inside baseball technical hts-specs stuff not relevant to most who will find this] Previous draft was in ga4gh/ga4gh-bed#2. Building the TeX into PDF involves departures from previous practice in this repository:
This fails the CircleCI build because I am using the |
The existing documents try to be somewhat conservative in their package requirements. It's not so much about what is available on the CI's installation but what is available on maintainers' and interested contributors' installations, and avoiding making it unnecessarily difficult for people to build the PDFs locally. e.g. @jkbonfield tends to have an elderly installation 😄 — James, do you have |
Re gitinfo2: on the inside, this hook script and Hence I'd recommend just going with the existing [BTW on my complete and fully updated TeX Live (macTeX) 2021 installation, I got |
This fails to build with me:
I get the same failure with pdflatex too though so it's not a lualatex thing. Is this some capitalization issue? I see Replacing them with the lowercase version gives me a PDF that builds, but neither xpdf nor evince can open it. Doing a make clean and rebuilding then gave me an error bizarrely, so I'm really not sure what's going on.
I'm also getting gitinfo warnings:
This is all on an Ubuntu 18.04 machine with the standard installed latex packages. I don't have admin rights on any of our machines, but can ask to get packages installed. |
I just noticed John's comment - yes adding |
The gitinfo warnings are because you haven't installed the hook into your git repository or otherwise run the |
Agreed IMO we shouldn't use gitinfo for the same reasons you mentioned. It's not as useful as the existing infrastructure there. Our servers are running Ubuntu 18.04.5 (18.0 was released in 2018, with 18.04.5 in Aug 2020). It's a Long Term Support release, supported up to 2023. I could download a newer acronym.sty and install locally, but IMO if we have dependencies on specific versions then we should include them in the hts-specs respository itself so it builds cleanly on all "current" OS distributions, or provide a "bootstrap.sh" or similar that downloads and caches them automatically. However in this case, the only time that command was used was on BED and UCSC acronyms, both starting with a capital letter, so we'd be better off just using the lowercase variant to avoid the recency dependency. |
My plans:
I have added some strikethrough styling to the pull request description to reflect planned changes. |
Re the % work around outdated acronym.sty packages
\providecommand*{\Ac}[1]{\ac{#1}} (TIL about |
I believe I've fixed the discussed issues. The CircleCI build image still doesn't have
@jkbonfield @jmarshall Do you want to test that this works on your local TeX installation? Also, we have yet to get to diff generation and if the build is still broken there, I would consider setting things up so that building a latexdiff PDF does not result in a build failure. Latexdiff doesn't understand everything and there are occasionally things that require manual intervention. |
That builds locally for me, though the table of acronyms is badly formatted (I haven't investigated yet). I have started looking into updating the CI installation, and have been having similar thoughts about having a borked latexdiff not cause complete failure… |
The formatting of the acronym list in the revised version is greatly improved. |
The spec document states:
It does not make sense that fields can be separated by newline or carriage-return, when those are the designated line separators. |
I see that the definitions are such that insertions can be described as end = start |
I know it would invalidate some existing files, but if we have the chance to specify tab "\t" as a separater going forward it would make the format much more stable, i.e. safer to parse, in my opinion. |
Getting a formal spec for BED would be great and it's really great to have this document as a basis for it. In order to achieve its purpose, the spec will need to address three concerns:
At the moment the spec feels like it's more descriptive than normative, i.e. that it is indicating the diversity in how information is represented in the files that currently exist and are interpreted by existing tooling, rather than establishing expectations for predictable behaviour which you'd need in order to write tooling in the future. The spec should absolutely be informed by a corpus of real files and hopefully this will be used to underpin the spec with a suite of tests as other specs in this repo are, but I suspect at least some existing documents might be found to be inconsistent with such a spec.
How is a parser supposed to determine that "the field delimiter is tab throughout the file", if other white space can be parsed as either a separator or as field content? This would lead to ambiguous parsing and also implies the entire file must be read before parsing begins.
Again, this leads to ambiguity. If a column is empty (as user-defined fields are permitted to be), then how does a parser differentiate between two column separators and one column separator comprised of multiple whitespace characters? Standardising on tab only as a field separator would solve these and mean that tools such as Again, to avoid ambiguity, fields should also not contain Related to this, the handling of null values seems inconsistently specified and documented - this should be clearer:
It's not clear how to distinguish between e.g. a BED6+6 file and a BED12 file. |
From experience, if you look at programs such as bedToBigBed there are explicit command line switches to enforce the use of tabs. Without that kind of change or some form of metadata payload in the bed file (out of scope as this spec is defined) then any whitespace must be assumed |
Regarding topic 3.2, on RGB color coding, I think a good point would be to add an additional line advising to avoid a Red/Green color scheme. The reason for this is the many colorblind individuals (mostly males, but including females). Off note, colorblindness is quite a spectrum of color nuances due to simple polymorphisms to seeing dar-red and dark-green as the same kind of greyish. And for many years avoiding simple Red/Green use is supported by important journals, including Nature (i.e see their tutorials on article writing and artwork) A line on using transparency to reduce color intensity seems worthwhile also, when establishing a formal standard. Unfortunately, GA4GH/ClinGen uses Red for deletions and Blue for gains in copy number visualization, while Blue often stands for 'colder' and 'less', while Red often is 'more intens' and 'fire'. |
Note for [email protected]: the domain is not accepting email. While this lsg-coordinator is the mailto link in the GA4GH call for Public Comment on the BED file format. _ Hello [email protected], _We're writing to let you know that the group you tried to contact (LSG-Coordinator) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:
If you have questions related to this or any other Google Group, visit the Help Center at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.google.com%2Fa%2Fga4gh.org%2Fbin%2Ftopic.py%3Ftopic%3D25838&data=04%7C01%7Cj.saris%40erasmusmc.nl%7C6a89f5c950b84cc4720408d92b59af03%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C1%7C637588486455819461%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=%2FCVTUGF3kZ40hlrRZ3Xy4Vio8qLmWdrF55kr0mAIuiI%3D&reserved=0. Thanks, ga4gh.org admins__ |
Related to sorting in the draft spec
|
@fkokocinski Thank you for your suggestion. I agree it would be nice, but we designed this version of the specification to formalize the existing BED description, which clearly allows other whitespace as a delimiter. There are probably many extant BED files, parsers, and generators that use non-tab whitespace, and it is undesirable for a "v1" specification to declare them non-conforming years after their creation according to a sensible reading of the previous description. The "Recommended practice for the BED format" section recommends that a single tab be used as this presents many advantages. |
@pdl Thank you for considering this specification so carefully. Some replies below:
The spec is normative in that it clearly indicates that some files are valid and that others are invalid. I understand why you would say it feels more descriptive, in that, we are quite a bit more flexible about what is valid than what the drafter of a greenfield specification might allow. This is to avoid declaring files and implementations incompatible with this first version of the specification years after they were created, when those files and implementations were created according to reasonable interpretations of the previous BED description.
We have created test files for many corner cases as a separate project. We hope to release a preprint on it later this summer.
We wrote the spec to indicate whether files are valid or not, and does not guarantee an unambiguous parsing. I know this is undesirable but it is difficult to avoid without declaring invalid existing files and implementations created under a reasonable interpretation of the previous description. The recommendation is to use tabs only for new files. Parsers could infer that the delimiter is tab only from the first line, and can then report an error if they run into a line that cannot be parsed later. Unfortunately I believe this could lead to some BED files considered valid under this specification not being parsed. The only way to guarantee that this will happen while still enabling the advantages of using tabs is to use out-of-band information declaring that the only delimiter is tab, as @andrewyatz points out.
The current text says
Will address the default value issue better. The semantics of |
Thank you everyone for all your helpful comments! @niujeffrey can you please draft the following changes by making a pull request against michaelmhoffman/hts-specs
|
Thanks, @andrewyatz @michaelmhoffman @niujeffrey for indicating the required out-of-band information. All the changes proposed look great. I realise you don't want to innovate at this stage but in some version of the specification I hope it will be possible to standardise how to convey these things in-band (such as via a structured initial comment line) so that files can be self-contained. |
Thanks @pdl, yes that is certainly something we have discussed. Having this specification makes it easier to identify both out-of-band information that we could formally specify and other warts and ambiguities that we could eliminate in a future specification. For example, one could imagine a "BEDX" specification which specifies files that are all valid BED v1, but also has structured comments with out-of-band info, single tab as the only allowed field separator, and other aspects where there are multiple choices now eliminated. |
Addresses public comments received on samtools#570: - [x] change the field separator to not include newline or carriage return ([thanks](#issuecomment-857452438) @simonbrent) - [x] "line feed" -> "newline" ([thanks](#issuecomment-857452438) @simonbrent) - [x] "carriage-return" -> "carriage return" ([thanks](#issuecomment-857452438) @simonbrent) - [x] define "newline" on first use as `'\n'` ([thanks](#issuecomment-857452438) @simonbrent) - [x] clarify that when `chromEnd` is equal to `chromStart` represents a zero-length feature, which is a feature between two bases such as an insertion. `chromStart=0`, `chromEnd=0` represents an insertion before the first nucleotide of a chromosome ([thanks](#issuecomment-857460372) @simonbrent) - [x] specify that fields, including custom fields, can only be empty when a single tab is used as the delimiter ([thanks](#issuecomment-857475922) @pdl) - [x] specify that field data is ASCII printable characters only---the range `'\x20'` to `'\x7e'` ([thanks](#issuecomment-857475922) @pdl) - [x] `name`: change regex so that it cannot be empty ([thanks](#issuecomment-857475922) @pdl) - [x] `score`: clarify that `0` should be used in BED5+ files where a `score` attribute of features would be uninformative ([thanks](#issuecomment-857475922) @pdl) - [x] `strand`: - [x] clarify that `.` is the default value, and that a parser should treat BED5 files as if they have `strand=.` - [x] explicitly specify that it cannot be empty in a BED6+ file ([thanks](#issuecomment-857475922) @pdl) - [x] `thickEnd`: clarify that the field is not specified but `thickStart` is means BED files that are not BED8+ ([thanks](#issuecomment-857475922) @pdl) - [x] change "null or empty" to only "empty" ([thanks](#issuecomment-857475922) @pdl) - [x] add recommendation to use colorblind-friendly color schemes, and especially to avoid red-green color schemes ([thanks](#issuecomment-857788414) @JspSrs) - [x] sorting: - [x] add that arbitrary orderings of `chrom` are allowed as long as all lines with the same `chrom` value occur consecutively ([thanks](#issuecomment-857955521) @ZhenyuZ) - [x] specify that multiple features with the same `chrom`, `chromStart`, and `chromEnd` may appear in any order - [x] add a section, before "UCSC track files", discussing information that is supplied out-of-band. This should include - [x] which of the first 4-12 fields are standard BED fields and which are custom fields ([thanks](#issuecomment-857475922) @pdl) - [x] genome assembly used - [x] semantics of `score`, `itemRgb`, thick vs. thin positions, block vs. non-block positions - [x] definitions of custom fields - [x] whether tab is the only delimiter between fields ([thanks](#issuecomment-857475922) @pdl, [thanks](#issuecomment-857508028) @andrewyatz) - [x] add acknowledgments to any of the folks thanked in the above checklist, with full name and affiliation Additionally, define "field separator", remove form feed and vertical tab from valid field separators, specify "data line" instead of "line" in several places, and correct some places where boldface was not used but should be. Co-authored-by: Michael Hoffman <[email protected]>
Thank you to all who have provided feedback. Getting this file to work on CircleCI is still in progress (see #576). I have used GitHub Releases to upload revised PDFs of:
latexdiff is not perfect and it has decided that all the tables have changed even where they haven't. Nonetheless the latexdiff should help show the differences from the last version. |
I noticed comment line, blank line, and data line are in boldface but they are not defined separately as terms. |
Just a minor observation on terminology: a few of the htslib-supported formats have the concept of "entries" as "records" (lines?) whereas this BED spec refers to those lines as "data lines"? Would it make sense to substitute "data lines" for "records" or there's some historical reason for the naming that I'm missing here? |
This is a line-based format, and for example has a line separator rather than a record separator. I think it is easier to understand the way it is. |
Completely agree. |
It's alright, while implementing BED support based on this spec I observed that the rest of the API surface referred to "records" across almost all other bioinfo formats. But I understand that users and already existing implementations might be more used to "data lines" (generational and/or historical argument?). My argument was raised on the aim to have some more API term(s) uniformity, so better cognitively across formats. Personally I don't see a clear line between "record separator" and "line separator" pointed out by @michaelmhoffman since both terms seem quite abstract and interchangeable to me? Maybe stretching it, records and lines are different in the "human-readable" sense, but they are bytes anyway at the end of the day. Anyways, data lines it is then ;) |
I have addressed the remaining comments from this issue and the peer review committee.
|
When merging the Can take a similar approach with a target-specific variable |
The GA4GH Steering Committee has approved this specification. Thank you to everyone who has contributed to this process. Further work on this pull request will be on any technical changes required to get this pulled into the master branch of hts-specs. Please file separate issues for any substantive BED specification concerns. |
This fails CircleCI build because the current code requires a later version of TeX Live (see #576). Additionally, CircleCI never gets this far, but locally I cannot build
@jmarshall What is the best way forward on these two issues? For me, the ideal situation would be that the CircleCI TeX Live be upgraded so the current code works as-is and that the diff issue be ignored. But you may feel otherwise 😄 |
@jmarshall Building Description of Can you please check in a |
Source: ga4gh/ga4gh-bed#2, which pulls from michaelmhoffman/ga4gh-bed@7f13c7f [Rebased onto mainline latexmk infrastructure changes.]
Add BEDv1.pdf to Makefile. Departures from previous practice in this repository: - [overrides LATEXMK_ENGINE] I developed the document with `lualatex` (included in TeX Live) as I usually do instead of `pdflatex`, because it fixes some warts especially in font selection and is where I understand TeX engine development has been focused for years. If necessary, I could change the font setup to use `pdflatex` instead. add BED to `MAINTAINERS.md` add version details from BEDv1.ver rather than using gitinfo2 add fallback `\providecommand` for `\Ac` hack package `acronym` hyperlink using option `nohyperlinks` instead make acronym list single-spaced and align to longest acronym
Addresses public comments received on samtools#570: - [x] change the field separator to not include newline or carriage return ([thanks](#issuecomment-857452438) @simonbrent) - [x] "line feed" -> "newline" ([thanks](#issuecomment-857452438) @simonbrent) - [x] "carriage-return" -> "carriage return" ([thanks](#issuecomment-857452438) @simonbrent) - [x] define "newline" on first use as `'\n'` ([thanks](#issuecomment-857452438) @simonbrent) - [x] clarify that when `chromEnd` is equal to `chromStart` represents a zero-length feature, which is a feature between two bases such as an insertion. `chromStart=0`, `chromEnd=0` represents an insertion before the first nucleotide of a chromosome ([thanks](#issuecomment-857460372) @simonbrent) - [x] specify that fields, including custom fields, can only be empty when a single tab is used as the delimiter ([thanks](#issuecomment-857475922) @pdl) - [x] specify that field data is ASCII printable characters only---the range `'\x20'` to `'\x7e'` ([thanks](#issuecomment-857475922) @pdl) - [x] `name`: change regex so that it cannot be empty ([thanks](#issuecomment-857475922) @pdl) - [x] `score`: clarify that `0` should be used in BED5+ files where a `score` attribute of features would be uninformative ([thanks](#issuecomment-857475922) @pdl) - [x] `strand`: - [x] clarify that `.` is the default value, and that a parser should treat BED5 files as if they have `strand=.` - [x] explicitly specify that it cannot be empty in a BED6+ file ([thanks](#issuecomment-857475922) @pdl) - [x] `thickEnd`: clarify that the field is not specified but `thickStart` is means BED files that are not BED8+ ([thanks](#issuecomment-857475922) @pdl) - [x] change "null or empty" to only "empty" ([thanks](#issuecomment-857475922) @pdl) - [x] add recommendation to use colorblind-friendly color schemes, and especially to avoid red-green color schemes ([thanks](#issuecomment-857788414) @JspSrs) - [x] sorting: - [x] add that arbitrary orderings of `chrom` are allowed as long as all lines with the same `chrom` value occur consecutively ([thanks](#issuecomment-857955521) @ZhenyuZ) - [x] specify that multiple features with the same `chrom`, `chromStart`, and `chromEnd` may appear in any order - [x] add a section, before "UCSC track files", discussing information that is supplied out-of-band. This should include - [x] which of the first 4-12 fields are standard BED fields and which are custom fields ([thanks](#issuecomment-857475922) @pdl) - [x] genome assembly used - [x] semantics of `score`, `itemRgb`, thick vs. thin positions, block vs. non-block positions - [x] definitions of custom fields - [x] whether tab is the only delimiter between fields ([thanks](#issuecomment-857475922) @pdl, [thanks](#issuecomment-857508028) @andrewyatz) - [x] add acknowledgments to any of the folks thanked in the above checklist, with full name and affiliation Additionally, define "field separator", remove form feed and vertical tab from valid field separators, specify "data line" instead of "line" in several places, and correct some places where boldface was not used but should be. Co-authored-by: Michael Hoffman <[email protected]>
…0210830 draft] * Update BEDv1.tex * polished edits * Edits in response to public comments 2021-06-28 through 2021-07-09 * Edits in response to public comments 2021-06-28 through 2021-07-09 fix typo * Edits in reponse to GA4GH PRC * line edit * Edits in response to public comments 2021-07-09 * further line edits and footnote fixing * fix texlint issues * WIP address uninformative/default/empty issues * fix empty/uninformative issues * clarify language on BED fields not being empty * add special-case `diff/BEDv1.pdf` target that uses lualatex * fix minor typo Co-authored-by: Michael Hoffman <[email protected]>
@jmarshall Is there a chance you could check in |
* replace `user` with more specific terms; addresses comment on samtools#570 * move part of `Custom fields` description from recommendation to specification * add constraint on whitespace in custom fields; addresses comment on samtools#570 * change definition of character and string to use printable characters * exclude Character and String from comma-separated lists * define `BED field` and `custom field` as terminology and use them when possible * clarify definition of BEDn+; addresses comment on samtools#570
…bles No longer have a hard-coded special command line for this target. Add `LATEXDIFF_ENGINE` variable to serve the same purpose as `LATEXMK_ENGINE` but for latexdiff.
in new `Discrete genomic feature data files` heading
This PR has now been rebased to after the change to I used a different method of temporarily preventing CircleCI from attempting to build BEDv1.pdf, so you won't need to define anything special to build it locally. (See 2a34fbf's change to Makefile's The BED specification is now available at https://samtools.github.io/hts-specs/, specifically as BEDv1.pdf. Apologies for the delay. |
Add a Browser Extensible Data (BED) specification.
BEDv1.pdf draft 2021-08-31 revision 2a5cd5dBEDv1.pdf draft 2021-08-30 revision af943feBEDv1.pdf draft 2021-06-28 revision 1caa7c6