Computational Methods to Improve and Validate Peptide Identifications in Proteomics

Computational Methods to Improve and Validate Peptide Identifications in Proteomics PDF Author: Lei Wang (Computer scientist)
Publisher:
ISBN:
Category : Machine learning
Languages : en
Pages : 0

Get Book Here

Book Description
With the rapid development of mass spectrometry technology in the past decade and the recent large-scale proteomics projects, massive and highly redundant tandem mass spectra (MS/MS) are being generated at an unprecedented speed. Hundreds of publications have been made for proteomics studies, yet computational methods which can efficiently identify and analyze the sheer amount of proteomic MS/MS data are still outstanding. The thesis aims to provide systematic approaches to studying MS/MS data from three aspects: spectral clustering, spectral library searching and validation of peptide-spectrum matchings (PSMs).I first introduce a rapid algorithm accelerated by Locality Sensitive Hashing (LSH) techniques to reduce the redundancy in proteomics datasets via clustering similar spectra. The proposed method demonstrates 7-11X performance improvement in running time while retaining superior sensitivity and accuracy when compared to the state of the art spectral clustering algorithms. In addition to the reduction of repetition of similar spectra, the time to search protein database, i.e. a commonly used technique for peptide identification, can be greatly shortened when using the consensus spectra that usually exhibit higher quality than the raw spectra. As a result, It can be demonstrated that more peptide identifications were obtained at the same low false discovery rate (FDR).The second chapter delves into spectral library searching, a complementary approach to database searching for peptide identifications on MS/MS spectra. LSH techniques ensure that similar spectra are placed into the same buckets, whereas spectra with low pairwise similarity are scattered into different buckets. Each input experimental spectrum can then be compared against a subset of highly similar spectra, thus diminishing the unnecessary spectral similarity computation between the input spectrum and all possible combinations of candidate peptides. The identified peptides overlap with those reported by other existing algorithms to a great extent. More importantly, the acceleration rate in the running time of proposed algorithm compared to existing ones increases with the growing size of spectral libraries.Redundancy in large scale proteomic datasets are exploited to further improve the searching results by eliminating the false PSMs examined through a post-processing step. Despite the success of data searching algorithms in proteomics, the peptide identification results usually contain a small fraction of incorrect peptide assignments. Target decoy approach was introduced in previous work to assess the quality of identifications, by searching spectrum against both target and decoy sequences. I formalize the method to improve peptide identifications by removing false PSMs in a probabilistic post-processing approach. As a result, as low as 0.8\\% FDR can be obtained on the remaining PSMs previously reported at 1\\% FDR level and up to 38\\% more unique peptides can be reported at the expected FDR level.I anticipate the computational methods developed in the dissertation can advance the proteomics research field by improving the protein identification through database searching, spectral library searching and validating the searching outputs in a subsequent step. Although the algorithms were evaluated for proteomics studies, they can be extended to small molecules such as natural products, lipids and glycoconjugates. These algorithms can also be generalized to the identification of experimental MS/MS spectra from a molecule of specific interest in massive omic datasets.