Bioinformatics

Datasets commonly used by bioinformatics domains

AlphaFold3 Databases

infoProtein structure and sequence databases used with AlphaFold3, an updated version of AlphaFold capable of predicting the structure and interactions of biomolecules.

folder_open/datasets/bio/alphafold3

zoom_inView more info...

BFD/MGnify

infoBFD/MGnify is a database built for ColabFold by combining the Big Fantastic Database (BFD) with the MGnify database.

folder_open/datasets/bio/colabfold/bfd_mgy_colabfold

zoom_inView more info...

infoBig Fantastic Database (BFD) is a protein sequence database. BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust, Soil Reference Catalog and Marine Eukaryotic Reference Catalog. It consists of over 65M protein families represented as multiple sequence alignments and hidden Markov models. BFD was built using the Uniclust pipeline and is one of the protein sequence databases used with AlphaFold.

folder_open/datasets/bio/alphafold/bfd

zoom_inView more info...

checkm

infoDatabase associated with CheckM, a tool for assessing the quality of genomes recovered from isolates, single cells, or metagenomes.

folder_open/datasets/bio/checkm/

zoom_inView more info...

ColabFoldDB

infoColabFoldDB is a protein database built for ColabFold by extending BFD/MGnify with additional metagenomic protein catalogs containing eukaryotic proteins, phage catalogs and an updated version of MetaClust.

folder_open/datasets/bio/colabfold/colabfold_envdb_202108

zoom_inView more info...

Databases for ColabFold

info“Databases built in MMseqs2 format to be used with ColabFold. The databases include PDB70 (version 220313), UniRef70 (versions 2103 and 2202), BFD/Mgnfy and the environmental database ColabFoldDB (version 202108)”

folder_open/datasets/bio/colabfold

zoom_inView more info...

dfam

infoDfam is a database of Transposable Element DNA sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations.

folder_open/datasets/bio/dfam/

zoom_inView more info...

EggNOG - version 5.0

infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.

folder_open/datasets/bio/eggnog5-data/

zoom_inView more info...

EggNOG - version 6.0

folder_open/datasets/bio/eggnog6-data/

zoom_inView more info...

EVcouplings databases

infoEVcouplings is an open-source application for identifying evolutionary couplings and perform structure prediction of protein and RNA molecules.

folder_open/datasets/bio/evcouplings

zoom_inView more info...

Genomes from NCBI RefSeq database

infoComplete archaeal, bacterial and viral genomes retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.

folder_open/datasets/bio/ncbi-refseq-genomes

zoom_inView more info...

GMAP-GSNAP database (human genome)

infoThe programs GMAP (Genomic Mapping and Aligment Program) and GSNAP (Genomic Short-read Nucleotide Alignment Program) align RNA and DNA sequences from next-generation sequencing data to a genome reference sequence. The GMAP-GSNAP human genomic database available on Unity was built using the human genome assembly GRCh38.p14 (NCBI RefSeq assembly GCF_000001405.40)

folder_open/datasets/bio/gmap-gsnap

zoom_inView more info...

GTDB

infoThe Genome Taxonomy Database (GTDB) is a genome-based taxonomy for prokaryotic genomes collected from the NCBI RefSeq and GenBank Assembly databases.

folder_open/datasets/bio/gtdb/

zoom_inView more info...

Illumina iGenomes

infoThe Illumina iGenomes dataset is an assortment of genomes and annotation files (downloaded from UCSC, NCBI, or Ensembl) for commonly analyzed organisms.

folder_open/datasets/bio/igenomes

zoom_inView more info...

Kraken2

infoDatabase for Kraken2, a tool that assigns taxonomic labels to DNA sequences. The database was built with the complete archaeal, bacterial and viral genomes downloaded from the NCBI Reference Sequence Database on July 22nd 2024.

folder_open/datasets/bio/kraken2

zoom_inView more info...

MGnify

infoMGnify is a database of non-redundant protein sequences predicted from metagenomic assemblies. MGnify is one of the protein sequence databases that can be used with AlphaFold.

folder_open/datasets/bio/alphafold/mgnify

zoom_inView more info...

NCBI BLAST databases

infoNational Center for Biotechnology Information (NCBI) database presented in the format required for running Basic Local Alignment Search Tool (BLAST) as well as the sequence aligner DIAMOND. It contains the nucleotide database, the non-redundant Reference Sequence protein database for archaeal and bacterial genomes, the Reference Sequence Prokaryotic Representative Genome Database and the Reference Sequence Eukaryotic Representative Genome Database. NCBI’s BLAST databases are downloaded weekly. See the full details for more information.

folder_open/datasets/bio/ncbi-db/

zoom_inView more info...

NCBI RefSeq database

infoGenomic, transcript and protein data retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.

folder_open/datasets/bio/ncbi-refseq-all/

zoom_inView more info...

Parameters of AlphaFold

infoAlphaFold is a deep leaning model designed to predict the 3D structure of proteins.

folder_open/datasets/bio/alphafold/params

zoom_inView more info...

Parameters of Evolutionary Scale Modeling (ESM) models

infoESM models are a group of transformer protein language models designed to predict variant effects on protein function, protein sequences from backbone atom coordinates or protein structures from primary sequences.

folder_open/datasets/bio/esm/

zoom_inView more info...

PDB70

infoPDB70 is a protein database that contains profile hidden Markov models for a representative set of protein sequences from the Protein Data Bank database filtered with a maximum pairwise sequence identity of 70%. PDB70 can be used with AlphaFold.

folder_open/datasets/bio/alphafold/pdb70

zoom_inView more info...

PINDER

infoPINDER or Protein Interaction Dataset and Evaluation Resource, is a dataset and resource for training and evaluation of protein-protein docking algorithms.

folder_open/datasets/bio/pinder

zoom_inView more info...

PLINDER

infoPLINDER or Protein Ligand Interactions Dataset and Evaluation Resource, is a comprehensive, annotated, high quality dataset and resource for training and evaluation of protein-ligand docking algorithms.

folder_open/datasets/bio/plinder

zoom_inView more info...

Protein Data Bank

infoProtein sequences from the Protein Data Bank in CIF format.

folder_open/datasets/bio/colabfold/pdb

zoom_inView more info...

Protein Data Bank database in mmCIF format

infoProtein sequences from the Protein Data Bank in mmCIF format.

folder_open/datasets/bio/alphafold/pdb_mmcif

zoom_inView more info...

Protein Data Bank database in SEQRES records

infoProtein sequences from the Protein Data Bank in SEQRES records. SEQRES records contain the amino acid sequence of residues in each chain of the proteins.

folder_open/datasets/bio/alphafold/pdb_seqres

zoom_inView more info...

Tara Oceans 18S amplicon

info18S amplicon sequencing data from the Tara Oceans expedition (2009-2013) DNA samples corresponding to size fractions for protists. The sequence files were downloaded from the European Nucleotide Archive under project number PRJEB6610.

folder_open/datasets/bio/tara-oceans/18S-amplicon

zoom_inView more info...

Tara Oceans MATOU gene catalog

infoReference collection of expressed eukaryotic genes called Marine Atlas of Tara Oceans Unigenes (MATOU), obtained with the TARA Oceans expedition (2009-2013) samples.

folder_open/datasets/bio/tara-oceans/MATOU-gene-catalog

zoom_inView more info...

Tara Oceans MGT transcriptomes

infoCollection of metagenomics-based transcriptomes (MGTs) of eukaryotic marine plankton communities obtained with the TARA Oceans expedition (2009-2013) samples.

folder_open/datasets/bio/tara-oceans/MGT-transcriptomes

zoom_inView more info...

Tattabio

infoLLM that was trained genomic sequences with the goal of creating embeddings for protein sequences

folder_open/datasets/bio/tattabio

zoom_inView more info...

Uniclust30

infoUniclust30 is a database of annotated protein sequences and alignments. It is built by clustering the sequences in UniProt Knowledgebase (UniProtKB) at the level of 30% pairwise sequence identity. Uniclust30 can be used with AlphaFold.

folder_open/datasets/bio/alphafold/uniclust30

zoom_inView more info...

UniProtKB

infoThe UniProt Knowledgebase (UniProtKB) is a database of protein sequences consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains manually annotated and non-redundant protein sequence records while UniProtKB/TrEMBL contains computationally analyzed and unreviewed protein sequence records.

folder_open/datasets/bio/alphafold/uniprot

zoom_inView more info...

UniRef100

infoUniRef100 is a database of protein sequences from UniProtKB and selected UniParc records.

folder_open/datasets/bio/uniref100

zoom_inView more info...

UniRef30

infoUniRef30 is a database of protein sequences built for ColabFold by clustering UniRef100 sequences with 30% sequence identity.

folder_open/datasets/bio/colabfold/uniref30_2103

zoom_inView more info...

UniRef90

infoUniRef90 is a database of protein sequences from UniProtKB and selected UniParc records. UniRef90 is built by clustering UniRef100 sequences such that each clustered set is composed of sequences that have at least 90% sequence identity to, and 80% overlap with, the longest sequence in the cluster.

folder_open/datasets/bio/alphafold/uniref90

zoom_inView more info...

Updated databases for ColabFold

info“Databases built in MMseqs2 format to be used with ColabFold. The databases include PDB100 (version 230517), UniRef30 (version 2302) and the environmental database ColabFoldDB (version 202108)”

folder_open/datasets/bio/colabfold_new

zoom_inView more info...