Skip to content

Commit 672103a

Browse files
committed
New tutorial for pepxml files
1 parent 60136d7 commit 672103a

6 files changed

+122
-63
lines changed

website/content/tutorial/custom-drawing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
weight: 55
2+
weight: 54
33
title: Display custom data on 2D map
44
summary: "How to display your custom data on Map2D without any coding. You'll need to provide a simple file format."
55
menu:

website/content/tutorial/data-access-layer.md

+2-59
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
weight: 53
3-
title: Using data access library
2+
weight: 51
3+
title: Data access library (LC/MS files and simple Peptide ID examples)
44
summary: "The data access library provides a relatively rich API to mzML/mzXML files (MS level, polarity, precursor isolation window, instrument data, etc.) and a few other file formats common to the proteomics field, such as PepXML, ProtXML and MzIdentML. In this tutorial will step through parsing some data, using the library as a jar in a simple console window application."
55
menu:
66
main:
@@ -232,60 +232,3 @@ for (MsmsRunSummary msmsRunSummary : msmsRunSummaries) {
232232
System.out.printf("Done with MS/MS run summary: %s\n", msmsRunSummary.getBaseName());
233233
}
234234
```
235-
236-
## Parsing huge identification files more efficiently
237-
Sometimes you might have PepXML files that are many gigabytes in size, this happens when you combine search results from multiple experiments and store them in a single output file. In that case, using `XMLStreamReader` class it is possible to first rewind the input stream to some large structural element of the underlying file, such as `<msms_run_summary>` in PepXML files.
238-
You will need to have an idea of how the files are organized for this to work in general though, explore the corresponding XML schemas. The schemas can also be found in the sources of the library in file-specific sub-packages of `umich.ms.fileio.filetypes` in `resources` directories.
239-
240-
```java
241-
try {
242-
// we'll manually iterate over msmsRunSummaries - won't need so much memory
243-
// at once for processing large files.
244-
JAXBContext ctx = JAXBContext.newInstance(MsmsRunSummary.class);
245-
Unmarshaller unmarshaller = ctx.createUnmarshaller();
246-
247-
XMLInputFactory xif = XMLInputFactory.newFactory();
248-
StreamSource ss = new StreamSource(is);
249-
XMLStreamReader xsr = xif.createXMLStreamReader(ss);
250-
251-
252-
while (advanceReaderToNextRunSummary(xsr)) {
253-
// we've advanced to the next MsmsRunSummary in the file
254-
long timeLo = System.nanoTime();
255-
JAXBElement<MsmsRunSummary> unmarshalled = unmarshaller
256-
.unmarshal(xsr, MsmsRunSummary.class);
257-
long timeHi = System.nanoTime();
258-
System.out.printf("Unmarshalling took %.4fms (%.2fs)\n",
259-
(timeHi-timeLo)/1e6, (timeHi-timeLo)/1e9);
260-
MsmsRunSummary runSummary = unmarshalled.getValue();
261-
if (runSummary.getSpectrumQuery().isEmpty()) {
262-
String msg = String.format("Parsed msms_run_summary was empty for file " +
263-
"'%s', summary base_name '%'", uri.toString(), runSummary.getBaseName());
264-
System.out.println(msg);
265-
}
266-
}
267-
} catch (JAXBException | XMLStreamException e) {
268-
// do something with the exception
269-
}
270-
271-
```
272-
and here is the meat of it, the code to rewind the `XMLStreamReader` - `advanceReaderToNextRunSummary(XMLStreamReader)`.
273-
In this case the example assumes we try to parse multiple msms_run_summary tags one by one from the file.
274-
```java
275-
276-
private static final String TAG_RUN_SUMMARY = "msms_run_summary";
277-
278-
private static boolean advanceReaderToNextRunSummary(XMLStreamReader xsr)
279-
throws XMLStreamException {
280-
long timeLo = System.nanoTime();
281-
do {
282-
if (xsr.next() == XMLStreamConstants.END_DOCUMENT)
283-
return false;
284-
} while (!(xsr.isStartElement() && xsr.getLocalName().equals(TAG_RUN_SUMMARY)));
285-
286-
long timeHi = System.nanoTime();
287-
System.out.printf("Advancing reader took: %.4fms\n", (timeHi-timeLo)/1e6d);
288-
289-
return true;
290-
}
291-
```

website/content/tutorial/developing-first-plugin.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
weight: 52
2+
weight: 56
33
title: Developing the first plugin
44
summary: "We will step through developing one complete plugin, which will add support for a new type of files holding LC/MS feature information, which will be viewable as a table and can be overlaid on top of Map 2D view."
55
menu:

website/content/tutorial/overlay-peptide-ids-on-map2d.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
weight: 54
2+
weight: 53
33
title: Overlay peptide IDs on 2D map
44
summary: "Overlaying contents of pepxml files on a 2D map."
55
menu:
+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
weight: 52
3+
title: Parsing pep.xml files
4+
summary: "The data access library provides parsers for file formats common to the proteomics field, such as PepXML, ProtXML and MzIdentML. In this tutorial I'll show you how to parse a PepXML file."
5+
menu:
6+
main:
7+
parent: Tutorials
8+
identifier: "Parsing pep xml files"
9+
---
10+
11+
All the classes responsible for parsing files live in `umich.ms.fileio.filetypes` package, each in its own subpackage, e.g. `umich.ms.fileio.filetypes.pepxml` for PepXML files. Most of those sub-packages contain a separate package `example` with working examples.
12+
13+
## Parsing identification files (PepXML, ProtXML, MzIdentML)
14+
The library gives low level access file formats storing peptide identifications.
15+
There is no unifying API here, as the formats are very different. These parsers are not hand optimized for efficiency, so they might consume quite a bit more memory than they should, but they also are error resilient.
16+
17+
Working with these files is as simple as making a single call to `parse(Path)` method
18+
of a corresponding parser. You get a single data-structure that follows the respective
19+
XML schemas for the format. Here's a quick PepXML example:
20+
21+
```java
22+
Path path = Paths.get("some-path-to.pep.xml");
23+
// a single call to parse the whole file
24+
MsmsPipelineAnalysis analysis = PepXmlParser.parse(path);
25+
```
26+
27+
And that's it. The whole file is parsed and stored in memory. Let's explore
28+
the contents of the file:
29+
30+
```java
31+
// iterate over the parsed search results
32+
List<MsmsRunSummary> runSummaries = analysis.getMsmsRunSummary();
33+
for (MsmsRunSummary runSummary : runSummaries) {
34+
List<SpectrumQuery> spectrumQueries = runSummary.getSpectrumQuery();
35+
System.out.printf("Spectrum queries from MS/MS run summary: %s\n",
36+
runSummary.getBaseName());
37+
for (SpectrumQuery sq : spectrumQueries) {
38+
System.out.printf("Spec ID: [%s], RT: [%.2f], precursor neutral mass: [%.3f]\n",
39+
sq.getSpectrum(), sq.getRetentionTimeSec(), sq.getPrecursorNeutralMass());
40+
}
41+
System.out.printf("Done with MS/MS run summary: %s\n", runSummary.getBaseName());
42+
}
43+
```
44+
45+
## Parsing huge identification files more efficiently
46+
Sometimes you might have PepXML files that are many gigabytes in size. This happens when you combine search results from multiple experiments and store them in a single output file. In that case, using `XMLStreamReader` class it is possible to first rewind the input stream to some large structural element of the underlying file, such as `<msms_run_summary>` in PepXML files.
47+
You will need to have an idea of how the files are organized for this to work in general though, explore the corresponding XML schemas for insights. The schemas can also be found in the sources of the library in file-specific sub-packages of `umich.ms.fileio.filetypes` in `resources` directories.
48+
49+
```java
50+
String file = "/path/to/some.pep.xml";
51+
Path path = Paths.get(file);
52+
53+
try (FileInputStream fis = new FileInputStream(file)) {
54+
// we'll manually iterate over msmsRunSummaries - won't need so much memory
55+
// at once for processing large files.
56+
JAXBContext ctx = JAXBContext.newInstance(MsmsRunSummary.class);
57+
Unmarshaller unmarshaller = ctx.createUnmarshaller();
58+
59+
XMLInputFactory xif = XMLInputFactory.newFactory();
60+
61+
StreamSource ss = new StreamSource(fis);
62+
XMLStreamReader xsr = xif.createXMLStreamReader(ss);
63+
64+
65+
while (advanceReaderToNextRunSummary(xsr)) {
66+
// we've advanced to the next MsmsRunSummary in the file
67+
long timeLo = System.nanoTime();
68+
JAXBElement<MsmsRunSummary> unmarshalled = unmarshaller
69+
.unmarshal(xsr, MsmsRunSummary.class);
70+
long timeHi = System.nanoTime();
71+
System.out.printf("Unmarshalling took %.4fms (%.2fs)\n",
72+
(timeHi-timeLo)/1e6, (timeHi-timeLo)/1e9);
73+
MsmsRunSummary runSummary = unmarshalled.getValue();
74+
if (runSummary.getSpectrumQuery().isEmpty()) {
75+
String msg = String.format("Parsed msms_run_summary was empty for " +
76+
"'%s', summary base_name '%s'",
77+
path.toUri().toString(), runSummary.getBaseName());
78+
System.out.println(msg);
79+
}
80+
}
81+
}
82+
```
83+
84+
The secret ingredient here is the code to rewind the `XMLStreamReader`, the `advanceReaderToNextRunSummary(XMLStreamReader)` method.
85+
In this case the example assumes we try to parse multiple msms_run_summary tags one by one from the file.
86+
```java
87+
private static boolean advanceReaderToNextRunSummary(XMLStreamReader xsr)
88+
throws XMLStreamException {
89+
do {
90+
if (xsr.next() == XMLStreamConstants.END_DOCUMENT)
91+
return false;
92+
} while (!(xsr.isStartElement() && xsr.getLocalName().equals("msms_run_summary")));
93+
94+
return true;
95+
}
96+
```
97+
98+
And here are all the import statements for the last example:
99+
```java
100+
import umich.ms.fileio.filetypes.pepxml.PepXmlParser;
101+
import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsPipelineAnalysis;
102+
import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsRunSummary;
103+
104+
import javax.xml.bind.JAXBContext;
105+
import javax.xml.bind.JAXBElement;
106+
import javax.xml.bind.JAXBException;
107+
import javax.xml.bind.Unmarshaller;
108+
import javax.xml.stream.XMLInputFactory;
109+
import javax.xml.stream.XMLStreamConstants;
110+
import javax.xml.stream.XMLStreamException;
111+
import javax.xml.stream.XMLStreamReader;
112+
import javax.xml.transform.stream.StreamSource;
113+
import java.io.FileInputStream;
114+
import java.nio.file.Path;
115+
import java.nio.file.Paths;
116+
```

website/content/tutorial/setting-up-development-environment.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
weight: 51
2+
weight: 55
33
title: Setting up development environment
44
summary: "This guide will quickly step you through setting up the environment for developing new functionality for BatMass. All the downloads, setting up the IDE and up to building BatMass from scratch."
55
menu:

0 commit comments

Comments
 (0)