The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25, 2024 β’ 98
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements Paper β’ 2210.01970 β’ Published Sep 30, 2022 β’ 13
Evaluating the Social Impact of Generative AI Systems in Systems and Society Paper β’ 2306.05949 β’ Published Jun 9, 2023 β’ 9
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Paper β’ 2104.08758 β’ Published Apr 18, 2021
SEAL : Interactive Tool for Systematic Error Analysis and Labeling Paper β’ 2210.05839 β’ Published Oct 11, 2022
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper β’ 2303.03915 β’ Published Mar 7, 2023 β’ 7
Stable Bias: Analyzing Societal Representations in Diffusion Models Paper β’ 2303.11408 β’ Published Mar 20, 2023 β’ 2
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper β’ 2211.05100 β’ Published Nov 9, 2022 β’ 35