XII CA2B2C - Transfer learning to annotate (a part of) the Protein Universe
Congreso
Fecha:
2022Editorial y Lugar de Edición:
-Resumen *
Background: The automatic annotation of proteins is still an unresolved problem. For example, as ofAugust 2022, from 232,000,000 entries in UniProtKB only <1% of them are reviewed by expert curators. State-of-the-art annotation methods in Pfam, the protein family database, are based on hidden Markov models that predict family domain according to laborious hand-crafted sequence alignments. This approach has grown the Pfam annotations at a very low rate (<5% in the last 5 years). Alternative proposals based on deep learning models (DL) have appeared recently to accurately predict functional annotations for unaligned amino acid sequences. However, since many Pfam families contain just very few sequences, training such models with few examples is challenging.Results: We propose to apply Transfer Learning for this task, that is, take advantage of pre-trainedprotein embeddings that integrate the information from millions of sequences in the completeUniProtKB. Nowadays there are several protein embeddings available and ready to use, such as ESM. It is based on BERT, a Transformer originally designed for Natural Language Processing, which is trained using the context to predict words. ESM makes an analogy between words and amino acids: it can learn meaningful encodings for each residue in a self-supervised way, by masking some of the residues in the sequence and trying to predict them as a pretext task. This way, the output sequence encodes context residue information on each position. We obtained the ESM learned representation of the full domain data (17,929 families) from Pfam (1,339,083 seed sequences). Then, we used machine learning classifiers over the ESM embeddings in order to predict the domain for the test set (21,293 sequences), which had a low homology with the training set. We compared our approach with ProtCNN that is based on convolutional ResNets and achieved a 27.60% error rate (5,882 errors); while our method achieved a 20.88% error rate (< than 4,500 errors).Conclusions: In this work we used cutting-edge transfer learning techniques to accurately predictprotein domains. The results suggest that this approach presents unique predictive advantages and the potential to become a core component of future protein annotation tools. Información suministrada por el agente en SIGEVAPalabras Clave
PROTEIN FAMILIESBIOINFORMATICSREPRESENTATION LEARNINGDEEP LEARNING