background.tex

\section*{Background}
Describing regions of biological sequences is a vital part of genome and protein sequence
annotation, and in areas beyond this such as describing modifications such methylation
of nucleotide sequences, or glycosylation of proteins.
There are multiple different conventions for storing this kind of information in
plain text flat file formats such as GTF, GFF3, GenBank and EMBL,
and more structured domain specific XML formats such as the INSDC or UniProt XML.
None of the above tools or formats are flexible enough to discuss all of genetics or proteomics. 
Furthermore, fundamental details are inconsistent, for example both zero-based and
one-based counting standards exist, a regular source of off-by-one programming
errors which experienced bioinformaticians learn to look out for.

Although non-trivial, file format interconversion is a common background task
in current script-centric bioinformatics pipelines, often essential for combining
tools supporting different formats or format variables.
As a result of this common need, file format parsing is a particular strength of
community developed open source bioinformatics libraries like BioPerl
\cite{BioPerl2002}, Biopython \cite{Biopython2009}, BioRuby \cite{BioRuby2010}
and BioJava \cite{BioJava2012}. While using such shared libraries can reduce the
programmer time spent dealing with different file formats, adopting semantic
web technologies has even greater potential to simplify data integration tasks.

As part of the Integrated Database Project to integrate life science databases in
Japan, the National Bioscience Database Center (NBDC) and the Database
Center for Life Science (DBCLS) have hosted an annual ``BioHackathon'' series
of meetings bringing together biological database teams, open source programmers,
and domain experts in Semantic Web and Linked Data \cite{BioHack2010,BioHack2011and2012}.
At these meetings it was recognised that failure to standardise how to describe positions
and regions on biological sequences would be an obstacle to the adoption of federalized
SPARQL/RDF queries which have the potential to enable cross-database queries and
analyses. Discussion and prototyping with representatives from major sequence databases
such as UniProt, DDBJ (part of the INSDC partnership with the NCBI and EMBL-Bank),
and major glycomics databases \textit{(TODO - which ones? introduce later? e.g.
 Bacterial Carbohydrate Structure Database (BCSDB), GlycomeDB,
 GLYCOSCIENCES.de,
 Japan Consortium for Glycobiology and Glycotechnology Database (JCGGDB),
 MonosaccharideDB,
 Resource for INformatics of Glycomes at Soka (RINGS),
 and UniCarbKB)}
\textit{(TODO - and PDBj too?)} and assorted open source developers during this meeting
led to the development of the Feature Annotation Location Description Ontology (FALDO).

FALDO has a deliberately narrow scope which does not address general annotation
issues about the meaning or evidence of a location, rather FALDO is intended be
used in combination with other relevant ontologies such as the Sequence Ontology
(SO) \cite{SequenceOntology2005} or a local database specific ontology.

The ontology has been designed to be general enough to describe the annotation
of proteins and nucleotides using the various levels of location complexity used
in major databases such as INSDC (DDBJ/ENA/GenBank) and UniProt, their
associated file formats and other generic annotation file formats such as GTF
and GFF3. This includes compound locations which are the combination of
several regions (such as the `join' location string in INSDC), and ambiguous
positions.

The proposed standard allows us to accurately describe the position of a feature on multiple sequences.
This is expected to be most useful when lifting annotation from one draft assembly version to another.
For example a gene can start at position $X$ on genome assembly $A$,
while conceptually the same gene is positioned on $X'$ on genome assembly $A+1$.