Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using autoSQL in bigbed files to present user flexibility in choosing label field. (Was Remote BigBed File JASPAR2022_hg19.bb displays matrix ID and not transcription factor name) #1089

Closed
malcook opened this issue Jan 18, 2022 · 25 comments

Comments

@malcook
Copy link

malcook commented Jan 18, 2022

Due to an change in the design of the bigbed files (as discussed in wassermanlab/JASPAR-UCSC-tracks#11) , the Matrix ID is displayed as the label on the glyph when loaded in the IGV browser.

eg: https://user-images.githubusercontent.com/484282/148572447-ecbdbed0-b798-4bc5-824b-122608323bfe.png

This display is less useful to most end users than displaying the TF name.

The TF name is now present in column 7 of the underlying bed file instead of column 4 (as before).

UCSC genome browser accommodates the change by continuing to display the TF name.

IGV does not.

Arguably IGV could be improved by displaying column 7 value if present, otherwise displaying the name column (4).

(Note: I brought this up as tangentially to #1085 (comment) which was resolved without addressing this tangent, so I thought I'd give it its own issue...)

(Note: A workaround could be to reformat the bigbed to use IGV's neat ability to display GFF column 9 formatted attribute value pairs when they appear in column 4, however you might agree it would be advantageous to use the bigbeds as produced by wassermanlab).

@jrobinso
Copy link
Contributor

I'm not sure what you are suggesting. Obviously we can't have a general rule that if column 7 is present in a bigbed file that it is interpreted as a name.

@malcook
Copy link
Author

malcook commented Jan 18, 2022

I agree it is probably a "bad idea"(tm).

But...

Would you then conclude with me that the agreement between the wassermanlab and UCSC that "the bigbeds contain the TF name as an extra field" was a bad idea insofar as it these files do not comport to bigbed spec (despite bigBedToBed happily rematerializing them, as below).

bigBedToBed -chrom=chr1 -start=10001 -end=10005 http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/JASPAR2022_hg19.bb /dev/stdout 
chr1	10001	10018	MA0883.1	328	-	Dmbx1
chr1	10003	10013	MA0599.1	239	+	KLF5
chr1	10003	10015	MA0712.2	275	-	OTX2
chr1	10004	10013	MA0714.1	268	+	PITX3
chr1	10004	10014	MA0467.2	314	-	Crx
chr1	10004	10014	MA0891.1	265	+	GSC2
chr1	10004	10019	MA1574.1	341	-	THRB

FWIW: Consistent with their use of this arguably non-conforming bigbed format, UCSC track configuration provides choice to display 'TF Name' (col 7) instead of MatrixID (col 4) only for 2022 version of this resource, as can be seen in this screenshot:

image

I guess I was suggesting that IGV follow suit somehow, but I understand if you close issue as not really being IGV's.

@jrobinso
Copy link
Contributor

I don't know that there is a "spec" for bed files, after the first 3 columns anything goes. It makes it somewhat challenging. In some contexts I think this bed file would be referred to as "bed6+" as the first 6 columns are standard.

This is a custom UCSC track, in general I don't have the resources, in the parlance of our times, to do custom tracks and we don't host this file in any event.

Its possible we could so something for this problem using the autoSQL, the solution would be (probably) to add a menu item the user could use to choose available columns for name (in this case they would choose TFNAME). If you don't mind we can leave this open and I'll rename it accordingly.

@brainstorm
Copy link
Contributor

There's actually a very recent, official, spec for BED files. It was merged a month ago: samtools/hts-specs#570 ;)

@jrobinso
Copy link
Contributor

@brainstorm ok, great, might be helpful in the future. I don't see how it helps with this situation, however. In fact the document says that it does not specify a means of identifying the contents of columns 4-12. This information must be supplied "out-of-band". These are the columns I am referring to when I say there's not really a spec, they are not nailed down and you have to know what they mean by other means.

Some information about a BED file can only be supplied unambiguously separately from the data
lines of the BED file. This specification does not contain a means for interchanging this information.
Information that must be supplied out-of-band include:
• Which of the first 4 to 12 fields are standard BED fields and which are custom fields.

@jrobinso jrobinso changed the title Remote BigBed File JASPAR2022_hg19.bb (100GB+) displays matrix ID and not transcription factor name Consider using autoSQL in bigbed files to present user flexibility in choosing label field. (Was Remote BigBed File JASPAR2022_hg19.bb displays matrix ID and not transcription factor name) Jan 18, 2022
@maximilianh
Copy link

maximilianh commented Jan 18, 2022 via email

@jrobinso
Copy link
Contributor

@maximilianh I don't think it is a bad idea, and I agree its entirely compatible. The issue here is IGV uses the "name" field (column 4) as a label, and @malcook would prefer column 7 for this particular bigBed file. I renamed this issue to suggest the autoSQL might be useful for IGV to present a choice of fields to the user to use for the label. I will look into this possibility when I have time, thus leave the issue open. This is not a bigBed or UCSC issue, sorry for any confusion.

@malcook
Copy link
Author

malcook commented Jan 19, 2022

the solution would be (probably) to add a menu item the user could use to choose available columns for name (in this case they would choose TFNAME). If you don't mind we can leave this open and I'll rename it accordingly

who could ask for anything more?

@malcook
Copy link
Author

malcook commented Jan 19, 2022

I don't know that there is a "spec" for bed files,

Referring to the samtools BedV1 specification, I see now that the wassermanlab's files might be thought of as "bed6+1" with a single custom field.

I had been looking at https://genome.ucsc.edu/FAQ/FAQformat.html#format1 which purports to define the range for each column and does not refer to custom fields.

@jrobinso
Copy link
Contributor

@malcook Understood, in practice we deal with the files as they exist. I think the autoSql might be helpful here.

@maximilianh
Copy link

maximilianh commented Jan 19, 2022 via email

@jrobinso
Copy link
Contributor

jrobinso commented Jan 19, 2022

From IGV's perspective this is just a bigBed file, so no the trackDB is not read and I'm not even sure how it could be.

@jrobinso
Copy link
Contributor

@maximilianh @malcook Perhaps a general fix this this problem, which maybe you are suggesting, would be to support loading from a track hub rather than directly from the bigBed. Of course loading directly from bigBed will always be supported.

@malcook
Copy link
Author

malcook commented Jan 19, 2022

the trackDB is not read and I'm not even sure how it could be

Hmm. Does it seem like I suggested it could? Did you mean to direct this comment to @maximilianh ?

@jrobinso
Copy link
Contributor

@malcook yes (meant for maximilianh).

@jrobinso
Copy link
Contributor

Note to self:

bigBedInfo -as http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/JASPAR2022_hg19.bb
version: 4
fieldCount: 7
hasHeaderExtension: yes
isCompressed: yes
isSwapped: 0
extraIndexCount: 0
itemCount: 12,473,778,656
primaryDataSize: 119,887,888,128
primaryIndexSize: 782,301,588
zoomLevels: 10
chromCount: 93
as:
table JASPAR_TFBS
"TFBS predictions for profiles in the JASPAR CORE collections"
(
    string  chrom;      "Reference sequence chromosome or scaffold"
    uint    chromStart; "Start position of feature on chromosome"
    uint    chromEnd;   "End position of feature on chromosome"
    string  name;       "Matrix ID"
    uint    score;      "Score"
    char[1] strand;     "+ or - for strand"
    string  TFName;     "TF name"
)basesCovered: 2,897,225,363
meanDepth (of bases covered): 46.102859
minDepth: 1.000000
maxDepth: 993.000000
std of depth: 43.105940

@maximilianh
Copy link

Track hubs are supported by Ensembl, NCBI and UCSC. So yes, it would be great if IGV had some support for track hubs. A basic version could be very minimal, shortLabel and longLabel, visibility and type are the most important keywords.

@jrobinso
Copy link
Contributor

@maximilianh I will do this, although IGV is not in the same class as the big server based browsers you mention it is certainly worth doing. As a quick fix for JASPAR I'm thinking of just defining a "hosted" track in IGV for at least human and mouse assemblies using the basic data from the trackDB. I will not copy those 100GB bb files rather reference them. Anyway thanks for the tips and help as always.

@maximilianh
Copy link

maximilianh commented Jan 20, 2022 via email

@jrobinso
Copy link
Contributor

Thanks @maximilianh . RE "useOneFile", that would be the decision of the track hub creator, correct? I will support it where its available.

@malcook
Copy link
Author

malcook commented Jan 20, 2022

defining a "hosted" track in IGV for at least human and mouse
@jrobinso - could you please include zebrafish in any short term patch solution - that is the use case the drove my initial request

@malcook
Copy link
Author

malcook commented May 18, 2022

@jrobinso - I'm still hoping somehow to be able to display as glyph label in IGV the bigbed's column 6 (TFName). Any chance of providing such functionality, possibly as a "workaround", in the near term (preferably not requiring reference to remote track hubs)?

@jrobinso
Copy link
Contributor

@malcook A workaround would be to convert that file to a standard 12 column bed with the name you want in the standard name column. You can do this with a simple script.

@malcook
Copy link
Author

malcook commented May 26, 2022

snapshot looks good in my hands. Thanks so much!

@jrobinso
Copy link
Contributor

@malcook I assume you found the "set label field" menu item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants