Universal Protein Resource (UniProt)
====================================


The Universal Protein Resource (UniProt), a collaboration between the European
Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and
the Protein Information Resource (PIR), is comprised of three databases, each
optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensively curated protein information, including
function, classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed up
sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive
repository of all protein sequences, consisting only of unique identifiers and
sequences.

UniProt Reference Clusters (UniRef)
=================================================

The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90
and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc
records, in order to obtain complete coverage of sequence space at several resolutions
(100%, >90% and >50%) while hiding redundant sequences (but not their descriptions)
from view.

UniRef100
=========

UniRef100 contains all records in the UniProt Knowledgebase and selected UniParc records
(see below). In UniRef100, identical sequences and subfragments are placed into a single
cluster using the CD-HIT algorithm. The longest members of the cluster (seed sequences) 
is used to generate UniRef90. However, the longest sequence is not always the most 
informative. There is often more biologically relevant information and annotation
(name, function, cross-references) available on other cluster members. All the
proteins in each cluster are ranked to facilitate the selection of a biologically
relevant representative for the cluster.
The proteins are ranked as follows: 
1. quality of annotation: order of preference is a member from UniProtKB/Swiss-Prot
   then UniProtKB/TrEMBL and last is UniParc
2. annotation score: prefer entries that have higher UniProtKB Annotation Score
3. organism: prefer entries from Reference proteomes and Model Organisms
4. sequence length: longest sequence is preferred. 
As new proteins are added to UniProtKB and UniParc, UniRef cluster memberships and/or
identifiers might change.

The UniRef100 identifier is generated by placing "UniRef100_" prefix before the UniProtKB
accession number or UniParc identifier of the representative UniProtKB or UniParc entry.

UniParc records in UniRef100 
----------------------------
In addition to UniProtKB records, UniRef100 also includes selected UniParc entries that
are not covered by UniProtKB and contain cross-references to the following databases:
	- Refseq
	- PDB


Ftp access 
==========

Currently, UniRef100 is available from UniProt FTP site:

        ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100

The UniRef100 files and their descriptions are as follows:

File Name       File Description
-------------   -----------------------------------------------------------
uniref100.fasta This file contains all UniRef100 entries in FASTA format. 
                The definition line in the FASTA format includes cluster 
                specific information such as cluster name, number of members and
                and common taxonomy and also the ID of the representative protein. 
                The format is as follows: 
                >UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember
                where:
                - UniqueIdentifier is the primary accession number of the UniRef cluster.
                - ClusterName is the name of the UniRef cluster.
                - Members is the number of UniRef cluster members.
                - Taxon is the scientific name of the lowest common taxon shared 
                  by all UniRef cluster members.
                - RepresentativeMember is the entry name of the representative member 
                  of the UniRef cluster.
                For example:
                >UniRef100_P99999 Cytochrome c n=5 Tax=Hominidae RepID=CYC_HUMAN
                   
uniref100.xml   This file contains all UniRef100 entries in XML format. Each entry is
                identified by the UniRef identifier, and contains:
                - cross-reference to representative UniProtKB or UniParc entry and its 
                  sequence
                - cluster member that served as the seed sequence is flagged 
                - cross-references to member UniProtKB and/or UniParc entries
		- cross-references to UniRef50 and UniRef90 entries
                - member count
                - common taxon

Document type definition for uniref100.xml   
------------------------------------------
<?xml version="1.0" encoding="ASCII"?>
<!DOCTYPE UniRef100 [
<!ELEMENT UniRef100 (entry+)>
<!ATTLIST UniRef100 
		    xmlns CDATA #FIXED "http://uniprot.org/uniref"
		    xmlns:xsi CDATA #IMPLIED
		    xsi:schemaLocation CDATA #IMPLIED
		    releaseDate    CDATA #IMPLIED
                    version        CDATA #IMPLIED
>

<!-- entry: UniRef100 entry -->
<!ELEMENT entry (name,property*,representativeMember,member*)> 
<!ATTLIST entry  id             ID    #REQUIRED
                 updated        CDATA #IMPLIED 
>

<!-- name: UniRef100 entry's name derived from -->
<!-- representative UniProtKB or UniParc entry --> 
<!ELEMENT name  (#PCDATA)>


<!-- representativeMember: information for representative -->
<!-- UniProtKB or UniParc entry  -->
<!ELEMENT representativeMember (dbReference,sequence)>

<!-- memberList: members of UniRef100 other than representative --> 
<!ELEMENT member (dbReference)>

<!-- dbReference: cross-reference to member UniProtKB or UniParc -->
<!-- entry  -->
<!ELEMENT dbReference (property*)>
<!ATTLIST dbReference
    type CDATA #REQUIRED 
    id 	 CDATA #REQUIRED 
> 

<!-- property: properties of cross-references -->
<!ELEMENT property EMPTY>
<!ATTLIST property
    type CDATA #REQUIRED
    value CDATA #REQUIRED
>

<!ELEMENT sequence (#PCDATA ) >
<!ATTLIST sequence
    length CDATA #IMPLIED
    checksum CDATA #IMPLIED
>

]>


--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License
(https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of
our databases.

(c) 2002-2023 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.