AI and ML

Datasets commonly used by Artificial Intelligence and other Machine Learning

Aleph-Alpha

infoOur research has produced state-of-the-art multi-modal models (MAGMA), explainability techniques for transformer-based models (AtMan), and a comprehensive evaluation framework for large-scale model assessment

folder_open/datasets/ai/aleph-alpha

zoom_inView more info...

Alibaba-NLP

infoGTE-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024)

folder_open/datasets/ai/alibaba

zoom_inView more info...

Allen AI

infoAllen AI collections

folder_open/datasets/ai/allenai

zoom_inView more info...

AlpacaFarm

infoAlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

folder_open/datasets/ai/alpaca-farm

zoom_inView more info...

Amass

infoAMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning

folder_open/datasets/ai/amass

zoom_inView more info...

Audioset

infoAudioSet is an ontology and human-labeled dataset for audio event detection. It consists of 2,084,320 ten-second sound clips from YouTube videos labeled with a hierarchical ontology of 632 audio event classes, including human and animal sounds, musical instruments, and everyday environmental noises.

folder_open/datasets/ai/audioset

zoom_inView more info...

BAAI

infoEmu3-Gen is a unified multimodal generative model designed for high-quality image generation and visual understanding within a single autoregressive framework. Built on a discrete visual tokenizer, Emu3-Gen supports text-to-image generation, image editing, and multimodal reasoning by modeling images and text as a shared sequence, enabling strong generative fidelity and flexible multimodal interactions

folder_open/datasets/ai/baai

zoom_inView more info...

Bigcode

infoBigCode is an open scientific collaboration working on responsible training of large language models for coding applications

folder_open/datasets/ai/bigcode

zoom_inView more info...

Biomed Clip

infoBiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning

folder_open/datasets/ai/biomed-clip

zoom_inView more info...

Blip 2

infoBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

folder_open/datasets/ai/blip

zoom_inView more info...

Bloom

infoBLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.

folder_open/datasets/ai/bloom

zoom_inView more info...

ByteDance

infoByteDance

folder_open/datasets/ai/bytedance

zoom_inView more info...

COCO

infoCOCO is a large-scale object detection, segmentation, and captioning dataset

folder_open/datasets/ai/coco

zoom_inView more info...

Code Llama

infoModel for Code Llama LLM

folder_open/datasets/ai/codellama/

zoom_inView more info...

DeepAccident

infoDeepAccident is the first V2X (vehicle-to-everything simulation) autonomous driving dataset that contains diverse collision accidents that commonly occur in real-world driving scenarios

folder_open/datasets/ai/deep-accident

zoom_inView more info...

DeepSeek

infoDeepSeek trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors

folder_open/datasets/ai/deepseek

zoom_inView more info...

DeSTA

infoOne of the latest Large Audio Language Model

folder_open/datasets/ai/desta

zoom_inView more info...

Diffa

infoNew Audio LLM that uses diffusion language model

folder_open/datasets/ai/diffa

zoom_inView more info...

DINO v2

infoDINOv2 is a self-supervised method to learn visual representation

folder_open/datasets/ai/dinov2

zoom_inView more info...

epic-kitchens

infoEpic-Kitchens-100 is a large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in kitchen environments.

folder_open/datasets/ai/epic-kitchens

zoom_inView more info...

Falcon

infoFalcon is a family of large language models, available in 7B, 40B, and 180B parameters, as pretrained and instruction tuned variants

folder_open/datasets/ai/falcon

zoom_inView more info...

Florence

infoFlorence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.

folder_open/datasets/ai/florence

zoom_inView more info...

FLUX.1 Kontext

infoFLUX.1

folder_open/datasets/ai/flux

zoom_inView more info...

Fomo

infoFOMO-60K is a large-scale dataset of brain MRI scans, including both clinical and research-grade scans. The dataset includes a wide range of sequences, including T1, MPRAGE, T2, T2*, FLAIR, SWI, T1c, PD, DWI, ADC, and more.

folder_open/datasets/ai/fomo

zoom_inView more info...

Gemma

infoGemma is a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models

folder_open/datasets/ai/gemma

zoom_inView more info...

Genmo

folder_open/datasets/ai/genmo

zoom_inView more info...

Glm

infoChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage

folder_open/datasets/ai/glm

zoom_inView more info...

GPT

infoa large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks

folder_open/datasets/ai/gpt

zoom_inView more info...

HiDream-I1

infoHiDream-I1

folder_open/datasets/ai/hidream

zoom_inView more info...

Ibm Granite

infoGranite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters

folder_open/datasets/ai/ibm-granite

zoom_inView more info...

Idefics2

infoIdefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs

folder_open/datasets/ai/idefics2

zoom_inView more info...

Imagenet 1K

infoImagenet 1K dataset

folder_open/datasets/ai/imagenet/

zoom_inView more info...

Inaturalist

infoThe iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories

folder_open/datasets/ai/inaturalist

zoom_inView more info...

Infly

infoINF-Retriever-v1 is an LLM-based dense retrieval model developed by INF TECH. It is built upon the gte-Qwen2-7B-instruct model and specifically fine-tuned to excel in retrieval tasks, particularly for Chinese and English data

folder_open/datasets/ai/infly

zoom_inView more info...

InternLM

infoInternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques

folder_open/datasets/ai/internlm

zoom_inView more info...

Internvl3-8b-hf

infoInternVL3-8B is an open-source multimodal vision-language model optimized for fine-grained visual understanding, multimodal reasoning, and instruction-following. It supports complex tasks including visual question answering, captioning, OCR, and diagram reasoning. Built upon advanced scaling strategies and alignment techniques, InternVL2.5 bridges the gap to proprietary models like GPT-4V through high-quality pretraining and preference optimization.

folder_open/datasets/ai/internvl

zoom_inView more info...

Intfloat

infoA novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps

folder_open/datasets/ai/intfloat

zoom_inView more info...

Kinetics

infoKinetics is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version.

folder_open/datasets/ai/kinetics

zoom_inView more info...

LG

infoLarge Language Models (LLMs) and Large Multimodal Models (LMMs) developed by LG AI Research. EXAONE stands for EXpert AI for EveryONE, a vision that LG is committed to realizing

folder_open/datasets/ai/lg

zoom_inView more info...

Linq

infoLinq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models

folder_open/datasets/ai/linq

zoom_inView more info...

Llama2

infoModels for Llama 2 LLM

folder_open/datasets/ai/llama2/

zoom_inView more info...

Llama3

infoLlama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage

folder_open/datasets/ai/llama3

zoom_inView more info...

Llama4

infoLlama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture

folder_open/datasets/ai/llama4/

zoom_inView more info...

Llava_OneVision

infoLLaVA-OneVision Easy Visual Task Transfer

folder_open/datasets/ai/llava

zoom_inView more info...

LLM-compiler

infoLLM Compiler: Foundation Language Models for Compiler Optimization

folder_open/datasets/ai/llm-compiler

zoom_inView more info...

LMSys

infoThe large model systems organization (LMSYS) develops large models and systems that are open accessible and scalable.

folder_open/datasets/ai/lmsys

zoom_inView more info...

Lumina

infoLumina-Image 2.0: A Unified and Efficient Image Generative Framework

folder_open/datasets/ai/lumina

zoom_inView more info...

Mims

infoTxAgent, an AI agent that leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools to analyze drug interactions, contraindications, and patient-specific treatment strategies

folder_open/datasets/ai/mims

zoom_inView more info...

Mixtral

infoModel for Laion 2 (2B)

folder_open/datasets/ai/mixtral/

zoom_inView more info...

Monai

infoM3 is a medical visual language model that empowers medical imaging professionals, researchers, and healthcare enterprises by enhancing medical imaging workflows across various modalities.

folder_open/datasets/ai/monai

zoom_inView more info...

Moonshot-ai

infoKimi-Audio is an open-source audio foundation model excelling in audio understanding, generation, and conversation

folder_open/datasets/ai/moonshot

zoom_inView more info...

Msmarco

infoThe MS MARCO dataset is a large-scale information retrieval benchmark that uses real-world questions from Bing’s search queries to evaluate the performance of machine learning models in generating answers

folder_open/datasets/ai/msmarco

zoom_inView more info...

Natural-questions

infoNatural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine

folder_open/datasets/ai/natural-questions

zoom_inView more info...

Nvidia

infoNvidia repository

folder_open/datasets/ai/nvidia

zoom_inView more info...

Objaverse

infoObjaverse is a Massive Dataset with 800K+ Annotated 3D Objects

folder_open/datasets/ai/objaverse

zoom_inView more info...

Openai-whisper

infoWhisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation

folder_open/datasets/ai/whisper

zoom_inView more info...

Perplexity AI

infoR1-1776, Perplexity AI

folder_open/datasets/ai/perplexity

zoom_inView more info...

Phi

infoPhi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3

folder_open/datasets/ai/phi

zoom_inView more info...

Playgroundai

infoA model that generates highly aesthetic images of resolution 1024x1024, as well as portrait and landscape aspect ratios

folder_open/datasets/ai/playgroundai

zoom_inView more info...

Pythia

infoPythia is the first LLM suite designed specifically to enable scientific research on LLMs

folder_open/datasets/ai/pythia

zoom_inView more info...

Qwen

infoQwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts

folder_open/datasets/ai/qwen

zoom_inView more info...

Qwen2

infoQwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model

folder_open/datasets/ai/qwen2

zoom_inView more info...

Qwen3

infoQwen3

folder_open/datasets/ai/qwen3

zoom_inView more info...

Rag-sequence-nq

infoRAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever

folder_open/datasets/ai/rag-sequence-nq

zoom_inView more info...

S1-32B

infos1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.

folder_open/datasets/ai/simplescaling

zoom_inView more info...

Scalabilityai

infoA novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens

folder_open/datasets/ai/stabilityai/

zoom_inView more info...

Sft

infoA sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2

folder_open/datasets/ai/sft

zoom_inView more info...

SlimPajama

infoSlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together

folder_open/datasets/ai/slim-pajama

zoom_inView more info...

T5

infoThe T5 model, short for Text-to-Text Transfer Transformer, is a machine learning model developed by Google

folder_open/datasets/ai/t5

zoom_inView more info...

Tulu

infoTülu 3: Pushing Frontiers in Open Language Model Post-Training

folder_open/datasets/ai/tulu

zoom_inView more info...

V2X

infoV2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving

folder_open/datasets/ai/v2x

zoom_inView more info...

Video-MAE

infoVideo masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models

folder_open/datasets/ai/opengvlab

zoom_inView more info...

Vit

infoThe Vision Transformer (ViT) model uses the transformer architecture to process image patches for tasks like image classification

folder_open/datasets/ai/vit

zoom_inView more info...

Wildchat

infoWildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns

folder_open/datasets/ai/wildchat

zoom_inView more info...