diff --git a/CRAMv3.pdf b/CRAMv3.pdf index 0a76485e8..e7e6e6c8c 100644 Binary files a/CRAMv3.pdf and b/CRAMv3.pdf differ diff --git a/SAMtags.pdf b/SAMtags.pdf index 2483fe16f..b06eb0f3d 100644 Binary files a/SAMtags.pdf and b/SAMtags.pdf differ diff --git a/SAMtags.tex b/SAMtags.tex index f9b267cde..5180c76b8 100644 --- a/SAMtags.tex +++ b/SAMtags.tex @@ -2,6 +2,7 @@ \usepackage[margin=1in]{geometry} \usepackage{longtable} \usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref} +\usepackage[title]{appendix} \newcommand{\mailtourl}[1]{\href{mailto:#1}{\tt #1}} \newcommand{\tagvalue}[1]{\tt #1} @@ -55,8 +56,9 @@ \section{Standard tags} \hline {\tt AM} & i & The smallest template-independent mapping quality of segments in the rest \\ {\tt AS} & i & Alignment score generated by aligner \\ - {\tt BC} & Z & Barcode sequence \\ + {\tt BC} & Z & Barcode sequence identifying the sample \\ {\tt BQ} & Z & Offset to base alignment quality (BAQ) \\ + {\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\ {\tt CC} & Z & Reference name of the next hit \\ {\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\ {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM})\\ @@ -64,7 +66,7 @@ \section{Standard tags} {\tt CP} & i & Leftmost coordinate of the next hit \\ {\tt CQ} & Z & Color read base qualities \\ {\tt CS} & Z & Color read sequence \\ - {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features.\\ + {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features \\ {\tt E2} & Z & The 2nd most likely base calls \\ {\tt FI} & i & The index of segment in the template \\ {\tt FS} & Z & Segment suffix \\ @@ -76,32 +78,36 @@ \section{Standard tags} {\tt H1} & i & Number of 1-difference hits (see also {\tt NM}) \\ {\tt H2} & i & Number of 2-difference hits \\ {\tt HI} & i & Query hit index \\ - {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record\\ + {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record \\ {\tt LB} & Z & Library \\ - {\tt MC} & Z & CIGAR string for mate/next segment\\ + {\tt MC} & Z & CIGAR string for mate/next segment \\ {\tt MD} & Z & String for mismatching positions \\ {\tt MF} & ? & Reserved for backwards compatibility reasons \\ + {\tt MI} & Z & Molecular identifier; a string that uniquely identifies the molecule from which the record was derived \\ {\tt MQ} & i & Mapping quality of the mate/next segment \\ - {\tt NH} & i & Number of reported alignments that contains the query in the current record\\ + {\tt NH} & i & Number of reported alignments that contains the query in the current record \\ {\tt NM} & i & Edit distance to the reference \\ {\tt OC} & Z & Original CIGAR \\ {\tt OP} & i & Original mapping position \\ {\tt OQ} & Z & Original base quality \\ + {\tt OX} & Z & Original unique molecular barcode bases \\ {\tt PG} & Z & Program \\ {\tt PQ} & i & Phred likelihood of the template \\ {\tt PT} & Z & Read annotations for parts of the padded read sequence \\ {\tt PU} & Z & Platform unit \\ - {\tt QT} & Z & Barcode ({\tt BC} or {\tt RT}) phred-scaled base qualities \\ {\tt Q2} & Z & Phred quality of the mate/next segment sequence in the {\tt R2} tag \\ + {\tt QT} & Z & Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag \\ + {\tt QX} & Z & Quality score of the unique molecular identifier in the {\tt RX} tag \\ {\tt R2} & Z & Sequence of the mate/next segment in the template \\ {\tt RG} & Z & Read group \\ {\tt RT} & Z & Barcode sequence (deprecated; use {\tt BC} instead) \\ + {\tt RX} & Z & Sequence bases of the (possibly corrected) unique molecular identifier \\ {\tt SA} & Z & Other canonical alignments in a chimeric alignment \\ {\tt SM} & i & Template-independent mapping quality \\ {\tt SQ} & ? & Reserved for backwards compatibility reasons \\ {\tt S2} & ? & Reserved for backwards compatibility reasons \\ {\tt TC} & i & The number of segments in the template \\ - {\tt U2} & Z & Phred probility of the 2nd call being wrong conditional on the best being wrong \\ + {\tt U2} & Z & Phred probability of the 2nd call being wrong conditional on the best being wrong \\ {\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\ {\tt X?} & ? & Reserved for end users \\ {\tt Y?} & ? & Reserved for end users \\ @@ -244,10 +250,44 @@ \subsection{Barcodes} \begin{description} \item[BC:Z:\tagvalue{sequence}] -Barcode sequence, with any quality scores stored in the {\tt QT} tag. - -\item[QT:Z:\tagvalue{qualities}] -Phred quality of the barcode sequence in the {\tt BC} (or {\tt RT}) tag. Same encoding as {\sf QUAL}. +Barcode sequence (Identifying the sample/library), with any quality scores (optionally) stored in the {\tt QT} tag. +The {\tt BC} tag should match the {\tt QT} tag in length. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes and places a hyphen (`{\tt -}') between the barcodes from the same template. + +\item[QT:Z:\tagvalue{qualities}] +Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with spaces (`{\tt \textvisiblespace}') between the different strings from the same template. + +\item[RX:Z:\tagvalue{sequence+}] +Sequence bases from the unique molecular identifier. +These could be either corrected or uncorrected. Unlike {\tt MI}, the value may be non-unique in the file. +Should be comprised of a sequence of bases. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes. + +If the bases represent corrected bases, the original sequence can be stored in {\tt OX} (similar to {\tt OQ} storing the original qualities of bases.) + +\item[QX:Z:\tagvalue{qualities+}] +Phred quality of the unique molecular identifier sequence in the {\tt RX} tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +The qualities here may have been corrected (Raw bases and qualities can be stored in {\tt OX} and {\tt BZ} respectively.) +The lengths of the {\tt QX} and the {\tt RX} tags must match. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings. + +\item[MI:Z:\tagvalue{str}] +Molecular Identifier. +A unique ID within the SAM file for the source molecule from which this read is derived. +All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule. + +\item[OX:Z:\tagvalue{sequence+}] +Raw (uncorrected) unique molecular identifier bases, with any quality scores (optionally) stored in the {\tt BZ} tag. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes. + +\item[BZ:Z:\tagvalue{qualities+}] +Phred quality of the (uncorrected) unique molecular identifier sequence in the {\tt OX} tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +The {\tt OX} tags should match the {\tt BZ} tag in length. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings. \item[RT:Z:\tagvalue{sequence}] Deprecated alternative to {\tt BC} tag originally used at Sanger. @@ -345,4 +385,19 @@ \section{Locally-defined tags} \url{https://github.com/samtools/hts-specs/issues} and/or by sending email to \mailtourl{samtools-devel@lists.sourceforge.net}. +\begin{appendices} +\appendix +\section{SAM Tags History}\label{sec:history} + +This lists the date of each tagged SAM version along with changes that +have been made while that version was current. + +\subsection*{1.5: 23 May 2013 to current} +\begin{itemize} +\item Add UMI-related tags (RX, QX, OX, BZ, MI) and clarified usage of sample barcode tag BC. (August 2017) +\item SAMtags.txt (this file) created with tags from SAMv1 +\end{itemize} + +\end{appendices} + \end{document} diff --git a/SAMv1.pdf b/SAMv1.pdf index 9e7b0be6b..78d21e19a 100644 Binary files a/SAMv1.pdf and b/SAMv1.pdf differ diff --git a/SAMv1.tex b/SAMv1.tex index 4d18039c3..9ccd130ab 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -7,6 +7,7 @@ \usepackage{longtable} \usepackage{makecell} \usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref} +\usepackage[title]{appendix} \makeindex @@ -35,6 +36,10 @@ \section{The SAM Format Specification} information such as mapping position, and variable number of optional fields for flexible or aligner specific information. +This specification is for version 1.5 of the SAM and BAM formats. Each SAM and +BAM file may optionally specify the version being used via the +{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}. + \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lower cases clipped from the alignment. Read {\tt r001/1} and {\tt r001/2} @@ -194,14 +199,28 @@ \subsection{The header section} grouped by {\sf QNAME}), and {\tt reference} (alignments are grouped by {\sf RNAME}/{\sf POS}).\\\cline{1-3} \multicolumn{2}{|l}{\tt @SQ} & Reference sequence dictionary. The order of {\tt @SQ} lines defines the alignment sorting order.\\\cline{2-3} - & {\tt SN}* & Reference sequence name. Each {\tt @SQ} line must have a unique {\tt SN} tag. The value of this - field is used in the + & {\tt SN}* & Reference sequence name. +The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines +must be distinct. + The value of this field is used in the alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3} & {\tt LN}* & Reference sequence length. \emph{Range}: {\tt [1,2$^{31}$-1]}\\\cline{2-3} & {\tt AH} & Indicates that this sequence is an alternate locus.% \footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.} The value is the locus in the primary assembly for which this sequence is an alternative, in the format `\emph{chr}{\tt :}\emph{start}{\tt -}\emph{end}', `\emph{chr}' (if known), or `{\tt *}' (if unknown), where `\emph{chr}' is a sequence in the primary assembly. Must not be present on sequences in the primary assembly.\\\cline{2-3} + & {\tt AN} & Alternative reference sequence names. +A comma-separated list of alternative names that tools may use when referring +to this reference sequence.% +\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}', +tools can ensure that a user's request for any of `MT', `chrMT', `M', +or~`chrM' succeeds and refers to the same sequence. +Note the restricted set of characters allowed in an alternative name.} +These alternative names are not used elsewhere within the SAM file; +in particular, they must not appear in alignment records' {\sf RNAME} +or~{\sf RNEXT} fields. +\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*} +where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3} & {\tt AS} & Genome assembly identifier. \\\cline{2-3} & {\tt M5} & MD5 checksum of the sequence in the uppercase, excluding spaces but including pads (as `*'s).\\\cline{2-3} & {\tt SP} & Species.\\\cline{2-3} @@ -1057,4 +1076,75 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s \end{verbatim} } +\pagebreak + +\begin{appendices} +\appendix +\section{SAM Version History}\label{sec:history} + +This lists the date of each tagged SAM version along with changes that +have been made while that version was current. The key changes +that caused the version number to change are shown in bold. + +Note the auxiliary tags have now moved to their own +specification with its own version numbering.\footnote{ +\href{http://samtools.github.io/hts-specs/SAMtags.pdf}{http://samtools.github.io/hts-specs/SAMtags.pdf}} + +\subsection*{1.5: 23 May 2013 to current} + +\begin{itemize} +\item Add {\tt @SQ AH} header tag. (Mar 2017) +\item Auxiliary tags migrated to SAMtags document. (Sep 2016) +\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016) +\item QNAME limited to 254 bytes (was 255). (Aug 2015) +\item Generalise 0x200 flag bit as filtered-out bit. (Aug 2015) +\item Add {\tt @HD GO} for group order. (Mar 2015) +\item Add {\tt ONT} to the {\tt @RG PL} and {\tt @RG PM} header tags. (Mar 2015) +\item Add meaning to reverse FLAG on unmapped reads. (Mar 2015) +\item Document the {\tt idxstats} .bai elements. (Nov 2014) +\item Addition of CSI index. (Sep 2014) +\item Add {\tt MC} auxiliary tag. (Dec 2013) +\item Add {\tt @PG DS} header field. (Dec 2013) +\item Document the BAM EOF byte values. (Dec 2013) +\item Glossary of alignment types. (May 2013) +\item Add {\tt SA:Z} tag; PNEXT/RNEXT points to next read, not + segment. (May 2013) +\item \textbf{Add SUPPLEMENTARY flag bit}. (May 2013) +\end{itemize} + +\subsection*{1.4: 21 April 2011 to May 2013} + +\begin{itemize} +\item Add guide to using sequence annotations ({\tt CT/PT tags}). (Mar 2012) +\item Increase max reference length from $2^{29}$ to $2^{31}$. (Sep + 2011) +\item Add {\tt CO} and {\tt RT} auxiliary tags. (Sep 2011) +\item Clarify {\tt @SQ M5} header tag generation. (Sep 2011) +\item Describe padded alignments and add {\tt CT/PT tags}. (Sep 2011) +\item Add {\tt BC} barcode auxiliary tag. (Sep 2011) +\item Change {\tt FZ} tag from type {\tt H} to type {\tt B,S}. (Aug 2011) +\item Add {\tt @RG FO}, {\tt KS} header fields. (Apr 2011) +\item Add {\tt FZ} auxiliary tag. (Apr 2011) +\item Clarify chaining of PG records. (Apr 2011) +\item \textbf{Add {\tt B} array auxiliary tag type.} (Apr 2011)\ +\item \textbf{Permit IUPAC in SEQ and {\tt MD} auxiliary tag.} (Apr 2011) +\item \textbf{Permit QNAME ``{\tt *}''.} (Apr 2011) +\end{itemize} + +\subsection*{1.3: July 2010 to April 2011} + +\begin{itemize} +\item Re-add {\tt CC} and {\tt CP} auxiliary tags. (Mar 2011) +\item Add CIGAR N intron/skip operator. (Dec 2010) +\item Add {\tt BQ} BAQ tag. (Nov 2010) +\item Add {\tt RG PG} header field. (Nov 2010) +\item Add BAM description and index sections. (Nov 2010) +\item \textbf{Removal of FLAG letters.} (July 2010) +\end{itemize} + +\subsection*{1.0: 2009 to July 2010} + +Initial edition. + +\end{appendices} \end{document} diff --git a/VCFv4.1.pdf b/VCFv4.1.pdf index 747fe6c23..352e91392 100644 Binary files a/VCFv4.1.pdf and b/VCFv4.1.pdf differ diff --git a/VCFv4.1.tex b/VCFv4.1.tex index ab9cabbe6..ce486d6e1 100644 --- a/VCFv4.1.tex +++ b/VCFv4.1.tex @@ -1155,7 +1155,7 @@ \subsubsection{Type encoding} \vspace{0.3cm} -\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. +\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values. \vspace{0.3cm} \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard. This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example). BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have: diff --git a/VCFv4.2.pdf b/VCFv4.2.pdf index 30ab04b87..fddd8cc83 100644 Binary files a/VCFv4.2.pdf and b/VCFv4.2.pdf differ diff --git a/VCFv4.2.tex b/VCFv4.2.tex index 2409501b4..6360bd092 100644 --- a/VCFv4.2.tex +++ b/VCFv4.2.tex @@ -1172,7 +1172,7 @@ \subsubsection{Type encoding} \vspace{0.3cm} -\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. +\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values. \vspace{0.3cm} \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard. This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example). BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have: diff --git a/VCFv4.3.pdf b/VCFv4.3.pdf index e6034623e..029c98948 100644 Binary files a/VCFv4.3.pdf and b/VCFv4.3.pdf differ diff --git a/VCFv4.3.tex b/VCFv4.3.tex index 3e3d83c37..9d75b37ff 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -160,9 +160,9 @@ \subsubsection{Information field format} certain special characters used to define special cases: \begin{itemize} - \item A: The field has one value per alternate allele. - \item R: The field has one value for each possible allele, including the reference. - \item G: The field has one value for each possible genotype (more relevant to the FORMAT tags). + \item A: The field has one value per alternate allele. The values must be in the same order as listed in the ALT column (described in section \ref{data-lines}). + \item R: The field has one value for each possible allele, including the reference. The order of the values must be the reference allele first, then the alternate alleles as listed in the ALT column. + \item G: The field has one value for each possible genotype. The values must be in the same order as prescribed in section \ref{genotype-fields:genotype-ordering} (see \textsc{Genotype Ordering}). \item . (dot): The number of possible values varies, is unknown or unbounded. \end{itemize} @@ -292,6 +292,7 @@ \subsection{Header line syntax} and there must be no tab characters at the end of the line. \subsection{Data lines} +\label{data-lines} All data lines are tab-delimited with no tab character at the end of the line. The last data line must end with a line separator. In all cases, missing values are specified with a dot (`.'). @@ -324,7 +325,7 @@ \subsubsection{Fixed fields} (thus R as a reference base is converted to A in VCF.) - \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a an overlapping deletion. If there are no alternative alleles, then the missing value must be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) + \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or a missing value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a an overlapping deletion. If there are no alternative alleles, then the missing value must be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value must be specified. (Float) \item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and must not be used as a filter String. If filters have not been applied, then this field must be set to the missing value. (String, no white-space or semi-colons permitted, duplicate values not allowed.) \item INFO - additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of @@ -413,7 +414,8 @@ \subsubsection{Genotype fields} \item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed. - \textsc{Genotype Ordering.} In general case of ploidy P and N alternate alleles (0 is the REF and $1\ldots N$ + \textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering} + In general case of ploidy P and N alternate alleles (0 is the REF and $1\ldots N$ the alternate alleles), the ordering of genotypes for the likelihoods can be expressed by the following pseudocode with as many nested loops as ploidy:\footnote{Note that we use inclusive \texttt{for} loop boundaries.} \begingroup diff --git a/htsget.md b/htsget.md index 0b042486c..d78514fe3 100644 --- a/htsget.md +++ b/htsget.md @@ -25,12 +25,22 @@ Explicitly this API does NOT: # Protocol essentials -All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output. Successful requests result with HTTP status code 200 and have UTF8-encoded JSON in the response body, with the content-type `application/json`. The server may provide responses with chunked transfer encoding. The client and server may mutually negotiate HTTP/2 upgrade using the standard mechanism. +All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output. Successful requests result with HTTP status code 200 and have UTF8-encoded JSON in the response body. The server may provide responses with chunked transfer encoding. The client and server may mutually negotiate HTTP/2 upgrade using the standard mechanism. + +The JSON response is an object with the single key `htsget` as described in the [Response JSON fields](#response-json-fields) and [Error Response JSON fields](#error-response-json-fields) sections. This ensures that, apart from whitespace differences, the message always starts with the same prefix. The presence of this prefix can be used as part of a client's response validation. Any timestamps that appear in the response from an API method are given as [ISO 8601] date/time format. HTTP responses may be compressed using [RFC 2616] `transfer-coding`, not `content-coding`. +Requests adhering to this specification MAY include an `Accept` header specifying the htsget protocol version they are using: + + Accept: application/vnd.ga4gh.htsget.v0.2rc+json + +JSON responses SHOULD include a `Content-Type` header describing the htsget protocol version defining the JSON schema used in the response, e.g., + + Content-Type: application/vnd.ga4gh.htsget.v0.2rc+json; charset=utf-8 + ## Authentication Requests to the retrieval API endpoint may be authenticated by means of an OAuth2 bearer token included in the request headers, as detailed in [RFC 6750]. Briefly, the client supplies the header `Authorization: Bearer xxxx` with each HTTPS request, where `xxxx` is a private token. The mechanisms by which clients originally obtain their authentication tokens, and by which servers verify them, are currently beyond the scope of this specification. Servers may honor non-authenticated requests at their discretion. @@ -43,6 +53,12 @@ For errors that are specific to the `htsget` protocol, the response body SHOULD ### Error Response JSON fields + + +
+`htsget` +_object_ + +Container for response object.
`error` @@ -57,6 +73,8 @@ _string_ A message specific to the error providing information on how to debug the problem. Clients MAY display this message to the user.
+
The following errors types are defined: @@ -72,8 +90,10 @@ InvalidRange | 400 | The requested range cannot be satisfied The error type SHOULD be chosen from this table and be accompanied by the specified HTTP status code. An example of a valid JSON error response is: ```json { - "error": "NotFound", - "message": "No such accession 'ENS16232164'" + "htsget" : { + "error": "NotFound", + "message": "No such accession 'ENS16232164'" + } } ``` @@ -198,6 +218,12 @@ Example: `fields=QNAME,FLAG,POS`. ## Response JSON fields + + +
+`htsget` +_object_ + +Container for response object.
`format` @@ -240,6 +266,39 @@ _optional hex string_ MD5 digest of the blob resulting from concatenating all of the "payload" data --- the url data blocks.
+
+ +An example of a JSON response is: +```json +{ + "htsget" : { + "format" : "BAM", + "urls" : [ + { + "url" : "data:application/vnd.ga4gh.bam;base64,QkFNAQ==" + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/header" + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/run1.bam", + "headers" : { + "Authorization" : "Bearer xxxx", + "Range" : "bytes=65536-1003750" + } + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/run1.bam", + "headers" : { + "Authorization" : "Bearer xxxx", + "Range" : "bytes=2744831-9375732" + } + } + ] + } +} +``` ## Response data blocks