Completed Theses 2012

Comparison of protein families for different organisms

The aim of this thesis is to compare the protein sequences of different organisms. Comparative studies on the genomes of different organism have been shown to provide valuable information about the evolutionary relationships of the analysed organisms, give insights into the regulation of genes and proteins and the role of highly conserved regions. Sequence comparison between proteins can also hint at the function of newly sequenced proteins. This thesis compares the proteomes of 29 different organisms. Their reference proteomes are available in the EBI database and include members of the three main domains of living organisms: bacteria, eukaryotes and archaea.
The comparison of the protein sequences was done by a PSI‐BLAST search performed on 13 selected organisms for all proteins of a specific organism and for membrane proteins and soluble proteins separately. Membrane proteins were predicted using PolyPhobius, a transmembrane helix prediction algorithm based on Hidden Markov Models and homology information. The PSI‐BLAST results were then selected using different threshold parameters, including the E‐value of the PSI‐BLAST hit, the percentage of identity and the alignment length as well as the overlap in transmembrane helices for membrane proteins. In order to investigate the influence of the reference proteome sets on the results of a homology search some of the analyses were repeated using the complete proteomes in the UniProt database.
In principle, the results confirm the expected relationships between the different organisms, e.g. the proteomes of the mammalian species were more similar to each other than to species of more distant classes. However, the level of similarity between mammalian proteomes was not as high as the results of previous studies suggest, especially for high levels of similarity. The results of the analyses conducted exclusively for soluble proteins do not differ significantly from the results for transmembrane proteins, although for the proteomes of the higher organisms, it appears that transmembrane proteins are slightly more similar to each other than soluble proteins. When compared with the results based on the EBI reference proteomes, the analysis done for the UniProt complete proteomes demonstrates significantly lower levels of similarity for all analysed species. This shows the importance of an adequate reference proteome for a valid comparison of the different protein sequences.

Bachelor thesis
Student: Christiane Gasperi
Supervisor: Burkhard Rost, Edda Kloppmann

HPC full in silico mutagenesis

The purpose of this thesis is to implement and improve a pipeline which is fast enough to calculate millions of single amino acid substitutions and evaluating the eUect on protein function. At the beginning the price for one human full mutation run was above 10,000$. With this new developed pipeline the price falls under 300$. Currently there is no database that contains all possible nsSNP for human therefore we build GeMuDb (Gene Mutation Database). GeMuDb is the result of the project SNAP Map which used parallel computing approaches to calculate every possible mutation in human proteins. The database contains a collection of in silico predicted nsSNPs with the information about the eUect of an SNP and the reliability of it. So far this is the largest annotations of nsSNPs. This database is useful for several reasons. One reason is to Vnd active binding sites of proteins or Vnd correlation between diseases and mutations in protein. The study of all possible mutations in human will give us insights in human diversity and variation. This knowledge is a considerable approach for personalized medicine.

Bachelor thesis
Student: Martin Steinegger
Supervisor: Burkhard Rost

An evaluation of SNP and functional site analysis methods based on structural and evolutionary inference approaches

In this thesis, several SNP effect prediction and functional site prediction methods were evaluated based on their ability to predict SNP effects and functionally important residues. This involved the adaption of the SNP effect prediction methods on functional site prediction and vice versa. Fur- thermore, a representative dataset for the evaluation of the prediction of the effect of SNPs and the prediction of functionally important sites was needed. Therefore, a SNP dataset and functional site dataset were created. Next, the methods were adapted to their new applications and subsequently the prediction power of each method on both applications was evaluated. To get more insight into the prediction power of the methods several subsets were created according to the cell localization of the protein or the region of a SNP within a protein. Finally, an ensemble method of several methods was developed for both applications with the aim of outperforming the best method for that prediction approach available so far.

Master thesis
Supervisor: Burkhard Rost

Building PSSH2 - new comprehensive database of alignments between protein sequences and tertiary structures

The aim of this thesis is to develop tools for construction of the PSSH2 (Protein Sequence-to-Structure Homologies) database. This database is designated to replace the PSSH database, which assigns to a UniProt protein sequence its homologous PDB structures including the pairwise sequence-to-structure alignments. PSSH uses a good but slow alignment method, MaxHom, and the database has to be prefiltered by BLAST, which loses coverage. Because of the rapid growth of the databases, PSSH has become unmanageable and outdated. The design is to use the new iterative protein search and alignment method, HHblits, for the PSSH2 creation. HHblits is based on HMMs and performs better than the previous methods, including MaxHom and PSI-BLAST. HHblits achieves higher search sensitivity and precision, better alignment quality and
it is much faster.
Methods: First, the HHblits performance in retrieving related proteins was evaluated in comparison to PSI-BLAST, using COPS structural classification as gold standard. The best parameters of the programs were estimated, which enable the best trade-off between the sensitivity and precision. Second, the method for the PSSH2 construction, which is based on HHblits using the best search parameters found in the evaluation, was designed and implemented.
Results: As the performed evaluation showed, HHblits achieves the best results using the standard two iterations and other default settings. PSI-BLAST results are similar for all tested number of iterations (1-5). At 20% false detection rate (targets with lower than 30% structural similarity to the query), HHblits retrieves 78% of targets with at least 30% structural similarity to the query and 91% with at least 60% structural similarity. PSI-BLAST finds 74% of proteins with at least 30% structural similarity and 90% with at least 60% structural similarity, at the same false detection rate. The sensitivity of HHblits is higher, especially for more distantly related proteins.
Conclusions: Due to the better performance of HHblits, the new PSSH2 database based on this method is supposed to become a comprehensive source of reliable protein sequence-to-structure alignments. It will provide useful information of the homologous structures for each sequence (also with an unknown structure). PSSH2 can be used in different prediction methods, which require homology information, e.g. protein structure and function prediction, SNPs effect prediction and many more. Once the alignments for each UniProt sequence are calculated, the PSSH2 database is going to be visualized and integrated in the SRS 3D server (or its successor Aquaria).

Bachelor thesis
Student: Maria Kalemanov
Supervisor: Burkhard Rost, Andrea Schafferhans

In-depth comparison of predicted high- and low-impact SNPs from the 1000 Genomes Project

Since the human genome was completely sequenced and assembled in 2002, technologies in this area of research have made incredible progress. Today, the challenge to modern genetics is not the sequencing any more, but the processing of the resulting data. One focus in the analysis of large scale sequencing data is the determination of differences between individuals on DNA level, so called single nucleotide polymorphisms (SNPs). The ultimate goal of this analysis is to measure the effects of SNPs on the organism or more specific on the protein function. For a few SNPs this can be done by experiments. However, wet lab experiments are too expensive and time consuming to apply them, for example, in personalized medicine or extensive QTL studies. A potential relief in this situation are in-silico predictions. The increase of available data and computational power has lead to quite reliable results of actual prediction tools. In this work the convincing prediction performance of SNAP is used as measurement of the effect, single amino acid polymorphisms (SAAPs) have on protein function. This allows to use all amino acid changing mutations in the 1,000 Genomes Project data to analyse the effects of sequence- and structure-based mutation properties in detail. The therefore increased amount of data is improving the statistical significance and the reliability of the findings. Further, dependencies between the properties are also examined. It turned out that different properties have strong dependencies with each other. Especially the type of exchange contains information about structure and even conservation. It is also possible to statistically estimate the damaging potential of every type of exchange. Used for prediction purposes the derived matrix of damaging potentials (damaging matrix) outperforms all other matrix/sequence-based prediction tools and is capable of giving a quick and reliable idea how a SAAP affects protein function.

Master thesis
Student: Veit Höhn
Supervisor: Burkhard Rost, Marc Offman

Predicting protein function through gene ontology

The fast-paced genomic sequencing has led to ever increasing numbers of proteins that have to be analysed and annotated. Annotation by experiment alone is not sufficient with regards to the amount of time it takes. Therefore, the trend has gone to computationally annotate proteins on the basis of a variety of factors, such as sequence homology and
evolutionary information.
This master’s thesis presents a novel method in predicting protein function using support vector machines targeting Gene Ontology function classes. Here, a profile-based kernel is used to generate feature vectors for each protein in the given dataset. The kernel itself is generated using the evolutionary sequence profiles of the proteins. Different multi-class classification approaches are evaluated including one-vs-one, one-vs-all and predefined nested dichotomies, the first with a LIBSVM-integrated implementation, the latter two with a WEKA-integrated implementation. However, due to the large dataset, which in turn resulted in very large kernel matrices far beyond the 1 GB mark, two of the multi-class approaches in collaboration with WEKA could not be run even with a significant amount of server RAM (limited at 32 GB). The LIBSVM implementation with the one-vs-one method, on the other hand, returned some results, with a best cross validation accuracy of 48%. The relatively low accuracy may have been influenced by the dataset containing proteins with multiple labels, along with the high level of detail the feature vectors which may have introduced too much noise. Nonetheless, support vector machines have otherwise proven to be quite accurate in protein function classification tasks, albeit with smaller datasets.

Master thesis
Student: Vivien Klose
Supervisor: Burkhard Rost, Christian Schaefer

Automatic protein name recognition

This thesis introduces a new protein name recognition method that focuses on improving performance in full-text articles. The implementation employs conditional random elds (CRFs), dictionaries, the aggregation of bi-directional parsing models, and customary features in named entity recognition (NER). Two novel features are contemplated with the goal of overcoming the insuciency of training data and to use long-distance context information. These are, rst, the study and application of gram frequencies speci c to biomedical language. Statistics were collected from hundreds of thousands of PubMed documents and from protein and gene names deposited in UniProt. Second, the exploitation of the simple observation that names are often repeated in a paper, concretely generally 2-3 times within an abstract. These features improve signi cantly prediction results and the method achieves state-of-the-art performance. In addition, a new web-based text annotation framework is presented. This permits the combination of manual and automatic annotations through active learning algorithms. Human annotation e orts are reduced in order to generate more training data and so improve prediction methods further. Furthermore, the thesis presents a new database designed to serve as a basis for a future name normalization method. In the end, the goal is to establish a mapping between proteins (their amino acid sequences) to the citations they appear in (as names to be recognized and normalized).

Master thesis
Student: Juan Miguel Cejuela
Supervisor: Burkhard Rost

Improvement of DNA- and RNA-Protein Binding Prediction

Polynucleotide-protein interactions play an important role in many essential molecular processes, especially those dealing with the synthesis of proteins. A polynucleotide is either deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). Transcription factors are a prominent example for proteins binding directly to DNA to regulate gene expression. As is known today, there are manifold post-transcriptional modifications made to the RNA such as alternative splicing, which are initiated by RNA-binding proteins, like the spliceosome, a eukaryotic protein-RNA complex. RNA-protein-complexes, e.g. ribosomes, are involved in a multitude of important processes in the cell.
Although there are already various methods to predict polynucleotide binding on a per-residue basis showing good performance, none of them deals with the differentiation of proteins that bind polynucleotides, and those which do not. The approach presented in this work handles this problem by using neural networks together with a clean dataset containing proteins that are definitively not involved in polynucleotide binding. This allows distinguishing between proteins that bind DNA, RNA or none of those combined with a residue-based prediction of the binding. These predictions shall serve the demand for experimentalists to find new targets involved in those essential molecular processes followed by an identification of polynucleotide binding regions inside those proteins in one tool: SomeNA.

Diploma thesis
Student: Peter Hönigschmid
Supervisor: Burkhard Rost, Edda Kloppmann

Extracting binding residues from the Protein Data Bank

The Protein Data Bank is an archive of 3D structure models for different large biological molecules such as proteins and nucleotides. Many models in the database are shipped with complexes which contain several subunits or combine macromolecule and their ligands together. With the structural informations e.g. atom coordinates, molecular linkages, compiled in the collection we can examine where and how the chemical compounds interact with each other. These models give clue about the binding affinity between enzymes and ligands, the structure of the catalytic site, mechanism about the folding process etc. The most obvious usage of these informations is the data mining task through which predictions in structural, biochemical, medical aspects can be made. The purpose of this thesis is to implement a program which analyze the structure of 3D models of macromolecules and categorize different interactions between proteins and  various types of other biological molecules and extract the binding residues from Protein Data Bank regarding the properties of different interactions.

Bachelor thesis
Student: Shen Wei
Supervisor: Burkhard Rost, Christian Schaefer

Evaluation of sequence-to-structure alignments

Protein sequence alignment has become one of the most essential tasks in the post-genomic era of biological research. It is widely used in many biological applications such as protein structure and function prediction, protein disorder prediction, protein classiVcation, phylogenetic analysis and database annotation. This thesis describes an in-depth evaluation of the sequence search tools PSI-BLAST, HMMER and HHblits as well as the sequence-structure alignment database HSSP. The evaluation focuses on quality of sequence-to-structure alignments to develop guidelines to construct an improved sequence-structure alignment database for SRS3D. For protein family deVnition COPS, Pfam and CATH are used. Based on the COPS hierarchy of protein structures a representative test set of 16,667 structural similar pairs is constructed. For alignment validation, homology model quality and structural alignments are used. Results show that HHblits outperforms all other tested methods when considering alignment accuracy and rate of detection (~70% (HHblits) vs. ~20% (PSI-BLAST) of dataset). However, the increased detection rate brings along a significantly increased false positive rate which needs to be controlled by lowering the E-Value threshold to at least 10^-5.

Master thesis
Student: Benjamin Wellmann
Supervisor: Andrea Schafferhans

Transmembrane protein 3D structure prediction from evolutionary sequence variation

Up to 30% of all human proteins are integral membrane molecules which play vital roles in cell-cell communication, tissue organization and transport. Yet, despite their outstanding relevance as drug targets and considerable advances in experimental structure determination, most membrane protein 3D structures remain unknown. Enabled by a new statistical physics approach and the recent wealth of information from genomic sequencing, we show that evolutionary residue covariation can be used to accurately predict the 3D structure of α-helical membrane proteins from sequence alone. On a set of 25 polytopic membrane proteins with solved structure and up to 487 residues, our de novo protocol achieves Cα-RMSDs between 2.9 - 6.0°A over at least 80% of the full membrane domain and TM scores of 0.5 - 0.7 for 22 proteins. We also observe that residue coevolution gives a strong signal for known functional sites, interfaces in homomultimers and conformational changes. We then proceed to predict the 3D coordinates of ten medically important membrane proteins of unknown structure and without detectable sequence homology to solved proteins, including the human adiponectin receptor 1 and the elusive MT-ND1 subunit of mitochondrial complex 1. Our results agree well with biochemical knowledge and, in some cases, show surprising similarities to known folds, revealing detailed structural information for these proteins for the first time. Besides exploring the universe of transmembrane protein structures, we expect that our method could be used in hybrid computational-experimental approaches to accelerate structure determination, to identify functional residues and alternative conformations, and to assess the phenotypic consequences of genetic variation.Master thesis

Master thesis
Student: Thomas Hopf
Supervisor: Burkhard Rost, Chris Sander, Debora Marks

Feature construction and selection for predicting structural change upon point mutation in proteins

In this bachelor thesis the problem of gaining a set of features via forward selection to perform machine learning will be covered. The feature selection was performed to gain features, suitable to improve the prediction of local structural change in protein sequences due to point mutations. The idea was to use pairs of pentamers with an exchanged amino acid at the middle position, to emulate SNPs and their impact on structural change. The needed datasets containing the information of about the pentamers were provided. All tested features, were derived from biochemical and physiochemical properties. As machine learning algorithm a logistic regression algorithm, was chosen. The feature selection was done via a greedy forward selection and resulted in eight features of different window sizes and difference measures for the exchanged amino acid. After running the performance estimation the results showed that the choice of the eight features was not optimal, as a feature set extended to more features, raised the performance further. This fact demonstrates the difficulty in choosing the trade-off between the possible performance of the feature set, and the complexity of gaining this set. Nevertheless the performance of the tested feature set reached a mean AUC that is far better than random, showing that already a small number of only biochemical and biophysical features, can deliver a good performance in predicting structural change due to SNPs.

Bachelor thesis
Student: Yannick Mahlich
Supervisor: Burkhard Rost, Christian Schaefer