New tutorial for pepxml files

chhh · chhh · commit 672103aaa4de · 2017-07-11T14:02:12.000-04:00
diff --git a/website/content/tutorial/custom-drawing.md b/website/content/tutorial/custom-drawing.md
@@ -1,5 +1,5 @@
 ---
-weight: 55
+weight: 54
 title: Display custom data on 2D map
 summary: "How to display your custom data on Map2D without any coding. You'll need to provide a simple file format."
 menu:
diff --git a/website/content/tutorial/data-access-layer.md b/website/content/tutorial/data-access-layer.md
@@ -1,6 +1,6 @@
 ---
-weight: 53
-title: Using data access library
+weight: 51
+title: Data access library (LC/MS files and simple Peptide ID examples)
 summary: "The data access library provides a relatively rich API to mzML/mzXML files (MS level, polarity, precursor isolation window, instrument data, etc.) and a few other file formats common to the proteomics field, such as PepXML, ProtXML and MzIdentML. In this tutorial will step through parsing some data, using the library as a jar in a simple console window application."
 menu:
   main:
@@ -232,60 +232,3 @@ for (MsmsRunSummary msmsRunSummary : msmsRunSummaries) {
     System.out.printf("Done with MS/MS run summary: %s\n", msmsRunSummary.getBaseName());
 }
 ```
-
-## Parsing huge identification files more efficiently
-Sometimes you might have PepXML files that are many gigabytes in size, this happens when you combine search results from multiple experiments and store them in a single output file. In that case, using `XMLStreamReader` class it is possible to first rewind the input stream to some large structural element of the underlying file, such as `<msms_run_summary>` in PepXML files.  
-You will need to have an idea of how the files are organized for this to work in general though, explore the corresponding XML schemas. The schemas can also be found in the sources of the library in file-specific sub-packages of `umich.ms.fileio.filetypes` in `resources` directories.
-
-```java
-try {
-  // we'll manually iterate over msmsRunSummaries - won't need so much memory
-  // at once for processing large files.
-  JAXBContext ctx = JAXBContext.newInstance(MsmsRunSummary.class);
-  Unmarshaller unmarshaller = ctx.createUnmarshaller();
-
-  XMLInputFactory xif = XMLInputFactory.newFactory();
-  StreamSource ss = new StreamSource(is);
-  XMLStreamReader xsr = xif.createXMLStreamReader(ss);
-
-
-  while (advanceReaderToNextRunSummary(xsr)) {
-    // we've advanced to the next MsmsRunSummary in the file
-    long timeLo = System.nanoTime();
-    JAXBElement<MsmsRunSummary> unmarshalled = unmarshaller
-                                          .unmarshal(xsr, MsmsRunSummary.class);
-    long timeHi = System.nanoTime();
-    System.out.printf("Unmarshalling took %.4fms (%.2fs)\n",
-                      (timeHi-timeLo)/1e6, (timeHi-timeLo)/1e9);
-    MsmsRunSummary runSummary = unmarshalled.getValue();
-    if (runSummary.getSpectrumQuery().isEmpty()) {
-      String msg = String.format("Parsed msms_run_summary was empty for file " +
-          "'%s', summary base_name '%'", uri.toString(), runSummary.getBaseName());
-      System.out.println(msg);
-    }
-  }
-} catch (JAXBException | XMLStreamException e) {
-  // do something with the exception
-}
-
-```
-and here is the meat of it, the code to rewind the `XMLStreamReader` - `advanceReaderToNextRunSummary(XMLStreamReader)`.
-In this case the example assumes we try to parse multiple msms_run_summary tags one by one from the file.
-```java
-
-private static final String TAG_RUN_SUMMARY = "msms_run_summary";
-
-private static boolean advanceReaderToNextRunSummary(XMLStreamReader xsr)
-    throws XMLStreamException {
-  long timeLo = System.nanoTime();
-  do {
-      if (xsr.next() == XMLStreamConstants.END_DOCUMENT)
-          return false;
-  } while (!(xsr.isStartElement() && xsr.getLocalName().equals(TAG_RUN_SUMMARY)));
-
-  long timeHi = System.nanoTime();
-  System.out.printf("Advancing reader took: %.4fms\n", (timeHi-timeLo)/1e6d);
-
-  return true;
-}
-```
diff --git a/website/content/tutorial/developing-first-plugin.md b/website/content/tutorial/developing-first-plugin.md
@@ -1,5 +1,5 @@
 ---
-weight: 52
+weight: 56
 title: Developing the first plugin
 summary: "We will step through developing one complete plugin, which will add support for a new type of files holding LC/MS feature information, which will be viewable as a table and can be overlaid on top of Map 2D view."
 menu:
diff --git a/website/content/tutorial/overlay-peptide-ids-on-map2d.md b/website/content/tutorial/overlay-peptide-ids-on-map2d.md
@@ -1,5 +1,5 @@
 ---
-weight: 54
+weight: 53
 title: Overlay peptide IDs on 2D map
 summary: "Overlaying contents of pepxml files on a 2D map."
 menu:
diff --git a/website/content/tutorial/parsing-pep-ids.md b/website/content/tutorial/parsing-pep-ids.md
@@ -0,0 +1,116 @@
+---
+weight: 52
+title: Parsing pep.xml files
+summary: "The data access library provides parsers for file formats common to the proteomics field, such as PepXML, ProtXML and MzIdentML. In this tutorial I'll show you how to parse a PepXML file."
+menu:
+  main:
+    parent: Tutorials
+    identifier: "Parsing pep xml files"
+---
+
+All the classes responsible for parsing files live in `umich.ms.fileio.filetypes` package, each in its own subpackage, e.g. `umich.ms.fileio.filetypes.pepxml` for PepXML files. Most of those sub-packages contain a separate package `example` with working examples.  
+
+## Parsing identification files (PepXML, ProtXML, MzIdentML)
+The library gives low level access file formats storing peptide identifications.
+There is no unifying API here, as the formats are very different. These parsers are not hand optimized for efficiency, so they might consume quite a bit more memory than they should, but they also are error resilient.
+
+Working with these files is as simple as making a single call to `parse(Path)` method
+of a corresponding parser. You get a single data-structure that follows the respective
+XML schemas for the format. Here's a quick PepXML example:
+
+```java
+Path path = Paths.get("some-path-to.pep.xml");
+// a single call to parse the whole file
+MsmsPipelineAnalysis analysis = PepXmlParser.parse(path);
+```
+
+And that's it. The whole file is parsed and stored in memory. Let's explore
+the contents of the file:
+
+```java
+// iterate over the parsed search results
+List<MsmsRunSummary> runSummaries = analysis.getMsmsRunSummary();
+for (MsmsRunSummary runSummary : runSummaries) {
+    List<SpectrumQuery> spectrumQueries = runSummary.getSpectrumQuery();
+    System.out.printf("Spectrum queries from MS/MS run summary: %s\n",
+                      runSummary.getBaseName());
+    for (SpectrumQuery sq : spectrumQueries) {
+        System.out.printf("Spec ID: [%s], RT: [%.2f], precursor neutral mass: [%.3f]\n",
+                          sq.getSpectrum(), sq.getRetentionTimeSec(), sq.getPrecursorNeutralMass());
+    }
+    System.out.printf("Done with MS/MS run summary: %s\n", runSummary.getBaseName());
+}
+```
+
+## Parsing huge identification files more efficiently
+Sometimes you might have PepXML files that are many gigabytes in size. This happens when you combine search results from multiple experiments and store them in a single output file. In that case, using `XMLStreamReader` class it is possible to first rewind the input stream to some large structural element of the underlying file, such as `<msms_run_summary>` in PepXML files.  
+You will need to have an idea of how the files are organized for this to work in general though, explore the corresponding XML schemas for insights. The schemas can also be found in the sources of the library in file-specific sub-packages of `umich.ms.fileio.filetypes` in `resources` directories.
+
+```java
+String file = "/path/to/some.pep.xml";
+Path path = Paths.get(file);
+
+try (FileInputStream fis = new FileInputStream(file)) {
+    // we'll manually iterate over msmsRunSummaries - won't need so much memory
+    // at once for processing large files.
+    JAXBContext ctx = JAXBContext.newInstance(MsmsRunSummary.class);
+    Unmarshaller unmarshaller = ctx.createUnmarshaller();
+
+    XMLInputFactory xif = XMLInputFactory.newFactory();
+
+    StreamSource ss = new StreamSource(fis);
+    XMLStreamReader xsr = xif.createXMLStreamReader(ss);
+
+
+    while (advanceReaderToNextRunSummary(xsr)) {
+        // we've advanced to the next MsmsRunSummary in the file
+        long timeLo = System.nanoTime();
+        JAXBElement<MsmsRunSummary> unmarshalled = unmarshaller
+                .unmarshal(xsr, MsmsRunSummary.class);
+        long timeHi = System.nanoTime();
+        System.out.printf("Unmarshalling took %.4fms (%.2fs)\n",
+                          (timeHi-timeLo)/1e6, (timeHi-timeLo)/1e9);
+        MsmsRunSummary runSummary = unmarshalled.getValue();
+        if (runSummary.getSpectrumQuery().isEmpty()) {
+            String msg = String.format("Parsed msms_run_summary was empty for " +
+                        "'%s', summary base_name '%s'",
+                        path.toUri().toString(), runSummary.getBaseName());
+            System.out.println(msg);
+        }
+    }
+}
+```
+
+The secret ingredient here is the code to rewind the `XMLStreamReader`, the `advanceReaderToNextRunSummary(XMLStreamReader)` method.
+In this case the example assumes we try to parse multiple msms_run_summary tags one by one from the file.
+```java
+private static boolean advanceReaderToNextRunSummary(XMLStreamReader xsr)
+    throws XMLStreamException {
+  do {
+      if (xsr.next() == XMLStreamConstants.END_DOCUMENT)
+          return false;
+  } while (!(xsr.isStartElement() && xsr.getLocalName().equals("msms_run_summary")));
+
+  return true;
+}
+```
+
+And here are all the import statements for the last example:
+```java
+import umich.ms.fileio.filetypes.pepxml.PepXmlParser;
+import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsPipelineAnalysis;
+import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsRunSummary;
+
+import javax.xml.bind.JAXBContext;
+import javax.xml.bind.JAXBElement;
+import javax.xml.bind.JAXBException;
+import javax.xml.bind.Unmarshaller;
+import javax.xml.stream.XMLInputFactory;
+import javax.xml.stream.XMLStreamConstants;
+import javax.xml.stream.XMLStreamException;
+import javax.xml.stream.XMLStreamReader;
+import javax.xml.transform.stream.StreamSource;
+import java.io.FileInputStream;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+```
diff --git a/website/content/tutorial/setting-up-development-environment.md b/website/content/tutorial/setting-up-development-environment.md
@@ -1,5 +1,5 @@
 ---
-weight: 51
+weight: 55
 title: Setting up development environment
 summary: "This guide will quickly step you through setting up the environment for developing new functionality for BatMass. All the downloads, setting up the IDE and up to building BatMass from scratch."
 menu: