Paccanaro, Alberto, Casbon, James A and Saqi, Mansoor A S (2006) Spectral clustering of protein sequences. Nucleic Acids Research, 34 (5).
Full text access: Open
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].
This is a Submitted version This version's date is: 2006 This item is not peer reviewed
https://repository.royalholloway.ac.uk/items/790eec5f-291a-7a1f-bafa-5ef2e238b6db/3/
Deposited by Research Information System (atira) on 19-Jun-2013 in Royal Holloway Research Online.Last modified on 19-Jun-2013