Audio Datasets nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18 • 565M • 1.44k • 8 laion/LAION-DISCO-12M Viewer • Updated Nov 14, 2024 • 12.3M • 256 • 40 laion/LAION-Audio-300M Viewer • Updated Jan 10 • 229M • 28.4k • 54
Video Datasets nkp37/OpenVid-1M Viewer • Updated Jul 14 • 1.45M • 29k • 241 Koala-36M/Koala-36M-v1 Viewer • Updated Oct 12, 2024 • 36M • 1.08k • 48 OpenGVLab/InternVid-Full Viewer • Updated Jun 5, 2024 • 47.6M • 208 • 16 1x-technologies/world_model_raw_data Updated Apr 20 • 1.57k • 5
Text Datasets Running 131 TxT360: Trillion Extracted Text 📖 131 Explore and analyze the TxT360 dataset for LLM pre-training CASIA-LM/ChineseWebText2.0 Viewer • Updated Dec 2, 2024 • 2k • 2.33k • 27 HPLT/HPLT2.0_cleaned Viewer • Updated Nov 13 • 9.03B • 35k • 36 TrevorDohm/Pile_Tokenized Viewer • Updated Feb 20, 2024 • 134M • 2.07k
Running 131 TxT360: Trillion Extracted Text 📖 131 Explore and analyze the TxT360 dataset for LLM pre-training
Image Datasets kakaobrain/coyo-700m Viewer • Updated Aug 30, 2022 • 747M • 3.13k • 152 mlfoundations/datacomp_1b Viewer • Updated Aug 21, 2023 • 1.39B • 12.9k • 35
Audio Datasets nyu-dice-lab/wavepulse-radio-raw-transcripts Viewer • Updated Feb 18 • 565M • 1.44k • 8 laion/LAION-DISCO-12M Viewer • Updated Nov 14, 2024 • 12.3M • 256 • 40 laion/LAION-Audio-300M Viewer • Updated Jan 10 • 229M • 28.4k • 54
Video Datasets nkp37/OpenVid-1M Viewer • Updated Jul 14 • 1.45M • 29k • 241 Koala-36M/Koala-36M-v1 Viewer • Updated Oct 12, 2024 • 36M • 1.08k • 48 OpenGVLab/InternVid-Full Viewer • Updated Jun 5, 2024 • 47.6M • 208 • 16 1x-technologies/world_model_raw_data Updated Apr 20 • 1.57k • 5
Image Datasets kakaobrain/coyo-700m Viewer • Updated Aug 30, 2022 • 747M • 3.13k • 152 mlfoundations/datacomp_1b Viewer • Updated Aug 21, 2023 • 1.39B • 12.9k • 35
Text Datasets Running 131 TxT360: Trillion Extracted Text 📖 131 Explore and analyze the TxT360 dataset for LLM pre-training CASIA-LM/ChineseWebText2.0 Viewer • Updated Dec 2, 2024 • 2k • 2.33k • 27 HPLT/HPLT2.0_cleaned Viewer • Updated Nov 13 • 9.03B • 35k • 36 TrevorDohm/Pile_Tokenized Viewer • Updated Feb 20, 2024 • 134M • 2.07k
Running 131 TxT360: Trillion Extracted Text 📖 131 Explore and analyze the TxT360 dataset for LLM pre-training