|
| 1 | +--- |
| 2 | +weight: 52 |
| 3 | +title: Parsing pep.xml files |
| 4 | +summary: "The data access library provides parsers for file formats common to the proteomics field, such as PepXML, ProtXML and MzIdentML. In this tutorial I'll show you how to parse a PepXML file." |
| 5 | +menu: |
| 6 | + main: |
| 7 | + parent: Tutorials |
| 8 | + identifier: "Parsing pep xml files" |
| 9 | +--- |
| 10 | + |
| 11 | +All the classes responsible for parsing files live in `umich.ms.fileio.filetypes` package, each in its own subpackage, e.g. `umich.ms.fileio.filetypes.pepxml` for PepXML files. Most of those sub-packages contain a separate package `example` with working examples. |
| 12 | + |
| 13 | +## Parsing identification files (PepXML, ProtXML, MzIdentML) |
| 14 | +The library gives low level access file formats storing peptide identifications. |
| 15 | +There is no unifying API here, as the formats are very different. These parsers are not hand optimized for efficiency, so they might consume quite a bit more memory than they should, but they also are error resilient. |
| 16 | + |
| 17 | +Working with these files is as simple as making a single call to `parse(Path)` method |
| 18 | +of a corresponding parser. You get a single data-structure that follows the respective |
| 19 | +XML schemas for the format. Here's a quick PepXML example: |
| 20 | + |
| 21 | +```java |
| 22 | +Path path = Paths.get("some-path-to.pep.xml"); |
| 23 | +// a single call to parse the whole file |
| 24 | +MsmsPipelineAnalysis analysis = PepXmlParser.parse(path); |
| 25 | +``` |
| 26 | + |
| 27 | +And that's it. The whole file is parsed and stored in memory. Let's explore |
| 28 | +the contents of the file: |
| 29 | + |
| 30 | +```java |
| 31 | +// iterate over the parsed search results |
| 32 | +List<MsmsRunSummary> runSummaries = analysis.getMsmsRunSummary(); |
| 33 | +for (MsmsRunSummary runSummary : runSummaries) { |
| 34 | + List<SpectrumQuery> spectrumQueries = runSummary.getSpectrumQuery(); |
| 35 | + System.out.printf("Spectrum queries from MS/MS run summary: %s\n", |
| 36 | + runSummary.getBaseName()); |
| 37 | + for (SpectrumQuery sq : spectrumQueries) { |
| 38 | + System.out.printf("Spec ID: [%s], RT: [%.2f], precursor neutral mass: [%.3f]\n", |
| 39 | + sq.getSpectrum(), sq.getRetentionTimeSec(), sq.getPrecursorNeutralMass()); |
| 40 | + } |
| 41 | + System.out.printf("Done with MS/MS run summary: %s\n", runSummary.getBaseName()); |
| 42 | +} |
| 43 | +``` |
| 44 | + |
| 45 | +## Parsing huge identification files more efficiently |
| 46 | +Sometimes you might have PepXML files that are many gigabytes in size. This happens when you combine search results from multiple experiments and store them in a single output file. In that case, using `XMLStreamReader` class it is possible to first rewind the input stream to some large structural element of the underlying file, such as `<msms_run_summary>` in PepXML files. |
| 47 | +You will need to have an idea of how the files are organized for this to work in general though, explore the corresponding XML schemas for insights. The schemas can also be found in the sources of the library in file-specific sub-packages of `umich.ms.fileio.filetypes` in `resources` directories. |
| 48 | + |
| 49 | +```java |
| 50 | +String file = "/path/to/some.pep.xml"; |
| 51 | +Path path = Paths.get(file); |
| 52 | + |
| 53 | +try (FileInputStream fis = new FileInputStream(file)) { |
| 54 | + // we'll manually iterate over msmsRunSummaries - won't need so much memory |
| 55 | + // at once for processing large files. |
| 56 | + JAXBContext ctx = JAXBContext.newInstance(MsmsRunSummary.class); |
| 57 | + Unmarshaller unmarshaller = ctx.createUnmarshaller(); |
| 58 | + |
| 59 | + XMLInputFactory xif = XMLInputFactory.newFactory(); |
| 60 | + |
| 61 | + StreamSource ss = new StreamSource(fis); |
| 62 | + XMLStreamReader xsr = xif.createXMLStreamReader(ss); |
| 63 | + |
| 64 | + |
| 65 | + while (advanceReaderToNextRunSummary(xsr)) { |
| 66 | + // we've advanced to the next MsmsRunSummary in the file |
| 67 | + long timeLo = System.nanoTime(); |
| 68 | + JAXBElement<MsmsRunSummary> unmarshalled = unmarshaller |
| 69 | + .unmarshal(xsr, MsmsRunSummary.class); |
| 70 | + long timeHi = System.nanoTime(); |
| 71 | + System.out.printf("Unmarshalling took %.4fms (%.2fs)\n", |
| 72 | + (timeHi-timeLo)/1e6, (timeHi-timeLo)/1e9); |
| 73 | + MsmsRunSummary runSummary = unmarshalled.getValue(); |
| 74 | + if (runSummary.getSpectrumQuery().isEmpty()) { |
| 75 | + String msg = String.format("Parsed msms_run_summary was empty for " + |
| 76 | + "'%s', summary base_name '%s'", |
| 77 | + path.toUri().toString(), runSummary.getBaseName()); |
| 78 | + System.out.println(msg); |
| 79 | + } |
| 80 | + } |
| 81 | +} |
| 82 | +``` |
| 83 | + |
| 84 | +The secret ingredient here is the code to rewind the `XMLStreamReader`, the `advanceReaderToNextRunSummary(XMLStreamReader)` method. |
| 85 | +In this case the example assumes we try to parse multiple msms_run_summary tags one by one from the file. |
| 86 | +```java |
| 87 | +private static boolean advanceReaderToNextRunSummary(XMLStreamReader xsr) |
| 88 | + throws XMLStreamException { |
| 89 | + do { |
| 90 | + if (xsr.next() == XMLStreamConstants.END_DOCUMENT) |
| 91 | + return false; |
| 92 | + } while (!(xsr.isStartElement() && xsr.getLocalName().equals("msms_run_summary"))); |
| 93 | + |
| 94 | + return true; |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +And here are all the import statements for the last example: |
| 99 | +```java |
| 100 | +import umich.ms.fileio.filetypes.pepxml.PepXmlParser; |
| 101 | +import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsPipelineAnalysis; |
| 102 | +import umich.ms.fileio.filetypes.pepxml.jaxb.standard.MsmsRunSummary; |
| 103 | + |
| 104 | +import javax.xml.bind.JAXBContext; |
| 105 | +import javax.xml.bind.JAXBElement; |
| 106 | +import javax.xml.bind.JAXBException; |
| 107 | +import javax.xml.bind.Unmarshaller; |
| 108 | +import javax.xml.stream.XMLInputFactory; |
| 109 | +import javax.xml.stream.XMLStreamConstants; |
| 110 | +import javax.xml.stream.XMLStreamException; |
| 111 | +import javax.xml.stream.XMLStreamReader; |
| 112 | +import javax.xml.transform.stream.StreamSource; |
| 113 | +import java.io.FileInputStream; |
| 114 | +import java.nio.file.Path; |
| 115 | +import java.nio.file.Paths; |
| 116 | +``` |
0 commit comments