A discussion of basic data cleaning concepts for quantitative proteomics data and some useful notebook quality control (QC) metrics.
Starting proteomics data tables may have extra information not needed for quantitative use in notebooks and R scripts. R likes rectangular, numerical tables with little or no missing values (NAs or Zeros). Typical proteomics protein summary tables have mixed data types (sometimes mixed in the same column) with numerical and text columns. Cells can also be messy. It is not uncommon to have single values or lists in individual cells. These data ills are above and beyond the “wide” and “long” versions of tables related to “tidy” data. R does not have very good native support for text strings (there are packages that help). R kind of considers any text to be data labels (also called factors) and factors are internally converted to distinct integers.
I like Excel for table exploration and filtering (data cleaning/wrangling) to prepare simpler quantitative tables for use in notebooks with R. Every experiment and software tool behaves a little differently and any canned algorithm for data prepping will be complicated and fragile. Prepping data in Excel is counter to reproducible data processing, although the prep steps could be automated in a script once the processing steps are known for particular data.
I have an IRS script in my pipeline for multiple-plex TMT experiments that does many of these data cleaning steps in the script. There is some data prep work done to the main protein summary with TMT intensities to label samples and identify/verify contaminant proteins. Most of the steps described below would not need to be done.
- Make sure that sample key and biological group membership is known for the LC runs. Many QC metrics are applied within biological groups rather than for all samples in an experiment.
- Make R compatible short names that identify the sample and biological group.
- Use underscores as separators. R likes periods in variable names, too, but periods have meanings in other languages (Python, for example).
- I usually do QC checks on protein-level summaries for the starting data table. There are many ways to summarize peptide-level measurements in bottom-up data into protein relative abundance estimates. Some methods are based on common sense, make use of protein inference, and assume proteins are the sum of their parts. Other summaries are less anchored to reality and may be based on mathematical properties of the summary values (assuming what is mathematically “better” can be defined). Unless the biological subjects are mathematicians, I would avoid mathematical/statistical forms of protein summarization.
- I set up some leading rows for working space at the top of the Excel sheet. This violates typical good table design where the headers are in the first row. The prepped data will be saved in a tab-delimited text file with the headers in the first row after the Excel work has been completed.
- The column headers will usually be in Row 5 or Row 10.
- I add sample/group names in the row above the actual quantitative data column headers. This allows for left-to-right table sorting to group samples. Sorting columns by biological group might need to be done at some point.
- I select the column header row (5 or 10), make bold, and then turn on column filters.
- I use split views to see the top and the bottom of the table at the same time.
- Identify possible contaminants and exclude those proteins. Use of common contaminant sequence collections during searching catches some but not all contaminants. Also, some common contaminants may not be contaminants in all experiments (every proteomic experiment seems different).
- I usually insert a “ranking” numerical column with an abundance quantity (average spectral counts per sample, average intensity per sample) and will eventually sort descending to order the table by decreasing abundance (and move excluded proteins to the bottom of the table). Many proteomics results tables have some sort of text column to denote (flag) contaminants, decoys, etc.
- I set the values in the “ranking column” to negative values for decoys (-3), standard contaminants (-2), any other proteins to exclude (-1), etc.
- There may be additional keratins to exclude. My tables have protein FASTA descriptions columns so I can search for “keratin”, “hemoglobulin”, etc.
- Blood proteins are not always contaminants, for example. Hemoglobins in blood preps are usually considered contaminants.
- There may be proteins that were IDed without any quant values. Quantification is a higher bar than identification.
- If I have “ranking” values of zero (suggesting that there were no quant values), those proteins can be flagged, too.
- When proteins to exclude have been determined, I sort descending on the ranking column. Decreasing relative abundance is a more biological way to order the table.
- Excluded (and zero) proteins get moved to the bottom of the table.
- I color in grey the cells for the excluded proteins at the bottom (to make sure that they do not escape from their bottom cells in some later Excel step).
- The fraction of signal associated with contaminants per sample can be useful to compute. Once the table has been re-ordered and contaminants have been labeled, sums of quantities for contaminants, for all proteins, and for non-excluded proteins can be computed. Those are things I put in the extra top rows.
- The total signal of non-contaminant proteins per sample can also be helpful in finding samples that may have had a poor LC-MS experience.
Missing data related prep can be quick and dirty, or more fancy. The lowest abundance proteins will have the most detection variability and contain more missing values. These proteins are identifiable but too much missing data makes them non-quantifiable. The prelude to missing data imputation is to try and sort the table by decreasing relative abundance and determine an abundance cutoff that excludes the low abundance, non-quantifiable proteins. With some luck, this will remove much of the missing data and change the nature of the missing value imputation problem.
Note that folks talking about left-censored missing data imputation are trying to do data imputation for non-quantifiable proteins. This is the first clue that poor understanding of the data is the most important problem to fix.
- A “ranking” column was added (above).
- I add a column with a count of missing values (assuming missing values are all the same value – like empty cells or zeros).
- Depending on your brain, you can do the counting as missing values or as non-missing values.
- Quick and dirty is to ignore biological groups when counting things.
- Fancy is to count missing values by conditions (biological groups). Fancy requires multiple count columns and more logic for exclusion testing. I will not talk about that to save space. You can read a little more about considering missing values per biological group in this README.
- Figure out a low cutoff value for the “ranking” quantity to define and exclude non-quantifiable proteins. For example, in spectral counting studies (SpC - remember those?), an average SpC of 2.5 across the samples has worked well for data I analyzed. I require 2 peptides per protein, so 2.5 counts is a bit better than the minimum. In two state comparisons, you could have 5 counts in one condition and no counts in the other condition and hit the average of 2.5. This is more accommodating for cases where proteins might not be present in some of the conditions.
- This is a bit of an art form. Data may need some simple normalization to get samples on a more common scale. I try to ignore missingness in the determination of what is quantifiable. Proteins might be absent in some conditions for biological reasons. I favor a more traditional measurement metric like signal-to-noise ratio.
- Check column totals or medians across the samples. Some global scaling normalization may need to be done. Picking the median protein value or grand total of values as the thing to make the same across samples depends on study design and the nature of the samples. There are also some experimental types (AP/IP studies) where neither of these normalizations would be appropriate. Most bench protocols specify some total amount of material per sample to work with. LC loading also tends to be total sample amounts. Start with matching grand totals between samples (I use scaling factors to adjust each sample to the experiment-wide average grand total), then check how consistent the median protein quant values are between samples. Can also check median protein quant values per biological group. Samples within groups should have similar compositions and similar median values. Median values can differ between groups if the groups have different compositions. What to do if median intensities are not very similar between samples gets complicated. Options depend on experiment specifics. It may take some trial and error to find what works.
- See what the “ballpark” smallest measured values are in all samples. If the smallest values are somewhat consistent, then a common cutoff that applies to all samples can work. If we assume the smallest values are the noise level, then a sensible cutoff would be 2 to 3 times the noise level.
- Add a column to denote if proteins are above or below the cutoff. Using a formula so that the cutoff can be adjusted can be wise. I add a column of ones in tables for row counting using a subtotal function call (version 109) in a top cell to count the number of visible rows (proteins). How many quantifiable proteins out of the total number of proteins that sensible cutoff values produce depends on the type of proteomics quant method and the proteome depth. Fractionated samples with deeper proteome profiling should have larger fractions of quantifiable proteins compared to single shot analyses of samples. 80% to 90% quantifiable proteins are good cases. 60% or lower quantifiable proteins suggests that the quantitative proteomics method may not have been a good choice for the samples under study.
- Check the nature of missing values for a band of proteins just above the cutoff and for the proteins below the cutoff. Typical data has little or no missing values (at the protein level) for high abundance to medium abundance proteins. As abundance goes down, missingness will go up. Peptides a little bit above the instrument LC-MS noise level tend to trigger MS2 scans in most runs. It is the peptides very close to the trigger noise level that are variable. Very low abundance proteins will have few detected peptides and those peptides will be close to the noise levels. There may be a significant number of barely detectible proteins depending on protein ID criteria (2 peptide rule implies a modest abundance increase over single peptide IDs, for example). Thus, there may be a lot of missing values associated with the proteins below the putative cutoff level.
- There are some other missingness metrics that can be computed. Fraction of missing data to total data for the proteins (data cell counts) below the cutoff is useful. Fraction missing for a similar number of protein band above the cutoff may be a good contrast. Fraction missing in all proteins above the cutoff is useful.
- Conditional formatting (coloring) of cells with missing values and then zooming way out on the sheet is a visual way to assess how much and where the missingness is.
- After some detective work and some trial and error, a column with some call about whether proteins are quantifiable or not is the goal. A relative abundance rank cutoff might have been what worked, conditional logic based on counts of missing values could also have worked. After something is decided, the table can be further sorted to move the non-quantifiable proteins below the quantifiable proteins.
Make a “quant table” in a new tab. After ordering the table so that excluded proteins are at the bottom, non-quantifiable proteins are below the quantifiable proteins, and the quantifiable proteins are the upper part of the table, data for the quantifiable proteins can be copied to a new table in a new tab.
- The first column should be a protein/gene identified text string (UniProt protein accessions are good). I start the column with its header in Row 5 of the new tab. The quantitative data for the samples are copied next. I capture the sample/group short names in the row above the data column headers along with the data. The data should be just the quantifiable proteins. The length of the data columns should match the length of the protein labels. The column headers of the data can be replaced with the short sample/group names (I usually save the original column headers above the new column names – this is like switching rows 4 and 5 for the data columns.). Data columns can be sorted left-to-right to group samples by biological conditions or that can be done in the R code.
- The table will still have missing values. Hopefully, not a large fraction of cells and the missingness will be well dispersed among the cells (but should still be more concentrated in the lower abundance proteins). How those missing values are replaced depends on if simple QC checking is being done or if statistical testing is being done. Replacement of missing values with a constant, small value often works okay for QC work. A value of about one quarter of the (ballpark) smallest non-zero measured value is something I often use. Replacement can be done in the new tab in Excel or is easy to do in R. NAs and/or zeroes will usually give errors in the R scripts.
- Once the column headers are informative for the samples and missing values are taken care of, the net data table can be selected and saved as a tab-delimited text file. The first column should be the protein labels, then the data columns. The first row should be the column headers. I copy the data table to the clipboard and then paste it into a programmer’s editor to save as a text file. My goal is to make a more R-friendly data table as the starting point for the notebooks. I try to do the Excel steps in ways that are somewhat transparent and try not to delete any rows or columns along the way. There are many other ways to do this.
These are the data wrangling and cleaning steps that are common in any data science work. Many notebooks start with these steps once they have been figured out. The figuring out how to clean the data is usually not included. That could have been done in a variety of quick and dirty programming scripts or notebooks. That exploratory work may be included in the final notebook, may be in another notebook, or may be hidden to make the data analyst look like a genius.
I have spent many years (since about 1986 on the Mac) using Excel. I can do most data exploration, wrangling, and cleaning in Excel much faster than with other programming environments. Excel is just an interactive what-you-see-is-what-you-get vector programming tool. Anything that can be done in Excel can be done in R or pandas (or vice versa).
The important concepts here are that data needs to be explored in proteomics experiments because the assumptions and edge cases vary experiment-to-experiment. Data always needs some basic normalizations. Data will always have some missingness to handle. Quantitative proteomics data results usually include non-quantifiable proteins, and they have to be discovered and excluded. The signature for non-quantifiable proteins depends on the proteomics method, how well the experiment worked, and many other factors.
Quality control metrics for cleaned data can be more standardized. A few years ago I started making dedicated quality control notebooks for analysis of projects. I used to combine QC and statistical steps in single notebooks. That made the notebooks too long. The Venn diagram of folks that want to look at QC notebooks and look at statistical testing notebooks is (I am pretty sure) just me. There are some recent QC notebooks (and some discussion of a few of the metrics in the README) in the PXD011691_reanalysis repository. Below are the typical metrics I use in my QC notebooks.
There are often only a few sensible ways to normalize the data. Checking some data characteristics before and after normalization is one of the first things I like to do. I do boxplots of the log10 of the quant values (spectral counts, MS1 feature intensities, TMT intensities, DIA intensities) before and after normalization. Boxplots of the distributions of logged protein relative abundances should be similar (sizes of interquartile ranges, length of whiskers, etc.) and be in horizontal alignment (similar median values). Good data will always have good looking boxplots. Good looking boxplots do not imply good data, though.
I mostly use the trimmed mean of M-values normalization (TMM) function from edgeR for normalizing proteomics data. This is just a more robust matched-median scaling method. It does assume that all the sample groups should have some global similarity in proteomes. The TMM scaling factors are informative and should be sensible. Large adjustment factors may mean that some samples did not work well or that the experiment has sample groups that are not very similar in composition.
PCA or MDS clustering plots are the best way to check for batch-like effects (and experiment sanity checks). Samples in each biological group (biological replicates) should be similar to each other (cluster together). Samples between biological groups should have some difference and have separation between clusters. Pre and post normalization may not qualitatively change clustering much.
Coefficients of variance of proteins within biological groups are sensitive to normalization and batch effects. I like to compute the distributions of CVs by biological group and check the median values and distribution shapes before and after normalization. The median CVs have characteristics that depend on quantitative proteomics technique and sample type. Technical replicates have the lowest CVs, cell cultures are the next lowest, animal models come next, and human samples will have the highest CVs. Human and animal samples tend to have lower CVs for tissue samples that are easier to work with (and/or collect). Biofluids are their own can of worms. Cancer tissue samples can be really atypical.
Sample-to-sample scatter plot grids are another important view. The idea is that samples within a biological group should be similar and have “tight” scatter plots. Depending on the number of biological replicates per group, random subsets of sample may need to be done to keep scatter plots large enough to be useful (5 or 6 in the grids are practical maximums). Correlation coefficients are not very useful for these scatter plots. The range of values tends to be between 0.8 and 1.0 with 0.9 the (ballpark) bad versus good divider. Correlation coefficients are not nearly as sensitive as the subjective impression of “tight” or broad scatter. Average vectors for each group can be compared to each other in scatter plot grids, too. This can be a good preview of the statistical testing results (lots of DE candidates or few DE candidates).
These different QC views are complementary, and some may reveal things that others do not. For example, a sample that does not cluster with the other samples in that group could be an outlier. It may significantly increase the median CV for that group. It may have scatter plots that indicate wider scatter that other samples in that group have with each other. The set of different metrics should be consistent – all suggesting good samples or all pointing to an outlier case.
The data cleaning steps and QC notebooks are a semi-automated approach to exploratory data analysis. I use the QC notebooks more as templates – making copies and editing them for each project. The markdown cells capture observations about the data, and each project seems to have something unique and interesting about its data. When several projects, with a variety of sample types and experimental designs, have been processed with some consistent set of QC metrics, then a baseline knowledge of what constitutes good data can be established. QC always benefits from a lot of context.
April 17, 2023
Phil Wilmarth
PSR Core, OHSU