Protein Embeddings ===================== UniProt is providing raw embeddings for UniProtKB/Swiss-Prot and some reference proteomes of model organisms. The embeddings are generated using the ProtT5 protein language model and stored in the standard HDF5 file format. There are two embeddings files generated: per-protein embeddings, where a fixed-length embeddings vector is computed for the whole protein sequence, and per-residue embeddings where a fixed-length embeddings vector is computed for each single residue. Note: Protein sequences longer than 12k residues are excluded due to limitation of GPU memory (this concerns only a handful of proteins). Per-protein embeddings: ---------------------------------- This directory contains the following subdirectories, one for each dataset, where the per-protein.h5 embeddings file resides: 1) uniprot_sprot Per-protein embeddings for UniProtKB/Swiss-Prot. 2) UP000006548_3702 Per-protein embeddings for Arabidopsis thaliana reference proteome. 3) UP000001940_6239 Per-protein embeddings for Caenorhabditis elegans reference proteome. 4) UP000000625_83333 Per-protein embeddings for Escherichia coli reference proteome. 5) UP000005640_9606 Per-protein embeddings for Homo sapiens reference proteome. 6) UP000000589_10090 Per-protein embeddings for Mus musculus reference proteome. 7) UP000002494_10116 Per-protein embeddings for Rattus norvegicus reference proteome. 8) UP000464024_2697049 Per-protein embeddings for SARS-CoV-2 reference proteome. bases. Per-residue embeddings: ---------------------------------- Since per-residue embeddings could become very large for larger datasets and longer sequences, they are provided under a different ftp location (and would only be made available based on users interest). Per-residue embeddings can be accessed from following location: https://ftp.ebi.ac.uk/pub/contrib/UniProt/embeddings/current_release Similar to the per-protein directory, there is one subdirectory for each dataset, where the per-residue.h5 embeddings file resides. -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution 4.0 International (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of our databases. (c) 2002-2024 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.