The complete sequencing of the human genome has opened the door for several new large-scale analysis techniques. One of these analytical approaches is the analysis of protein fragments by means of mass spectrometry (MS) that allows us to infer, detect and quantify (potentially) thousands of proteins in a single biological sample.
Proteins are the essential functional units in our cells and are thus the consequence of our genetic traits as well as our (current) environmental influences. Proteomics is the general term encompassing all large-scale approaches to protein analysis and MS based analysis techniques has become the predominant proteomics strategy. One of the main goals of clinical research is to identify new biomarkers that can be used to diagnose diseases, measure disease progression, and allow treatment stratification. In practise, the discovery of novel biomarkers by means of proteomics has proven to be significantly more complicated than originally anticipated. Even though proteomics techniques are ideally suited for this task, it is undisputed that current results have not fulfilled the initial high expectations.
The aim of the here presented thesis was to develop an analysis platform specifically designed to tackle the challenges faced by clinical proteomics in a multi-disciplinary research setting. To make it possible as well as to make the vast amount of data manageable to the user, the data was organized in a relational database according to the analyzed sample's characteristics - in essence the cell type and its functional state. The developments of this tool resulted in the Griss Proteomics Database Engine (GPDE) which is available as free software (http://gpde.sourceforge.net). However, simply making data available for analysis soon proved not to be sufficient. During the first phase of the development of the GPDE we realised that several older experiments could no longer be compared to new ones as several of the identified proteins' accessions no longer existed in the underlying protein database. We therefore added an algorithm to the GPDE to update the stored protein identifications to deal with this problem. At that time, the PRIDE repository and the GPDE were, to our knowledge, the only two resources that contained mechanisms to update protein identifiers. The detailed analysis of this problem led to a research project investigating and quantifying the effect of changing protein identifiers and analyzing the efficacy of the algorithms used by the GPDE and PRIDE. In September 2011 one of the most popular protein sequence databases, the International Protein Index (IPI), was discontinued. We therefore expanded our previous analysis to assess and quantify the effect this unprecedented discontinuation would have on stored as well as newly performed experiments.
In preparation for a new version of the GPDE, able to handle the vast amounts of MS data produced by current mass spectrometers, we developed a Java Application Programming Interface (API) that could handle arbitrary large MS data files (http://jmzreader.googlecode.com). This component can now be independently used by any software project to access MS data and serve as an interface between a relational database based repository and a flat file based MS data storage. It is hard to estimate how successful the GPDE is outside our own laboratory. The latest release of the GPDE was downloaded 192 times until 8th March 2012. This data suggests that the GPDE is used by other research groups and is a known resource in parts of the community.
Additionally, as another outcome of this thesis, the results of our study on the discontinuation of IPI have been taken up by the UniProt consortium to refine their "complete proteome" sets.