Universal Protein Resource (UniProt) ==================================== The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and the Protein Information Resource (PIR), is comprised of three databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combine closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. UniProt Reference Clusters (UniRef) ================================================= The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90 and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions (100%, >90% and >50%) while hiding redundant sequences (but not their descriptions) from view. UniRef90 ========= UniRef90 clusters are generated from the UniRef100 seed sequences with a 90% sequence identity threshold using the MMseqs2 algorithm. The seed sequences are the longest members of the UniRef100 cluster. However, the longest sequence is not always the most informative. There is often more biologically relevant information and annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are ranked to facilitate the selection of a biologically relevant representative for the cluster. The proteins are ranked as follows: 1. quality of annotation: order of preference is a member from UniProtKB/Swiss-Prot then UniProtKB/TrEMBL and last is UniParc 2. annotation score: prefer entries that have higher UniProtKB Annotation Score 3. organism: prefer entries from Reference proteomes and Model Organisms 4. sequence length: longest sequence is preferred. As new proteins are added to UniProtKB and UniParc, UniRef cluster memberships and/or identifiers might change. UniRef90 cluster titles and identifiers are derived from the representative UniRef100 entry. The UniRef90 identifier is generated by replacing "UniRef100_" prefix of the representative with "UniRef90_". Ftp access ========== Currently, UniRef90 is available from UniProt FTP site: ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90 The UniRef90 files and their descriptions are as follows: File Name File Description ------------- ----------------------------------------------------------- uniref90.fasta This file contains all UniRef90 entries in FASTA format. The definition line in the FASTA format includes cluster specific information such as cluster name, number of members and and common taxonomy and also the ID of the representative protein. The format is as follows: >UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember where: - UniqueIdentifier is the primary accession number of the UniRef cluster. - ClusterName is the name of the UniRef cluster. - Members is the number of UniRef cluster members. - Taxon is the scientific name of the lowest common taxon shared by all UniRef cluster members. - RepresentativeMember is the entry name of the representative member of the UniRef cluster. For example: >UniRef90_P99999 Cytochrome c n=14 Tax=Catarrhini RepID=CYC_HUMAN uniref90.xml This file contains all UniRef90 entries in XML format. Each entry is identified by the UniRef identifier, and contains: - cross-reference to representative UniProtKB or UniParc entry and its sequence - cluster member that served as the seed sequence is flagged - cross-references to member UniProtKB and/or UniParc entries - cross-references to UniRef50 and UniRef100 entries - member count - common taxon Document type definition for uniref90.xml ------------------------------------------ ]> -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of our databases. (c) 2002-2023 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.