All Publications, Projects, and Blogposts

← Back to home

2026

📄 FineTranslations

2026

Guilherme Penedo, Hynek Kydlíček, Amir Hossein Kargaran, Leandro Von Werra

A trillion-token multilingual → English parallel dataset built from FineWeb2

Dataset

2025

📄 FinePDFs-Edu

2025

Hynek Kydlíček, Guilherme Penedo, Leandro Von Werra

350B+ highly educational tokens filtered from FinePDFs

Dataset

🌐 FineWiki

2025

Guilherme Penedo

An updated, better extracted version of Wikipedia in 300+ languages

Dataset

📄 FinePDFs

2025

Hynek Kydlíček, Guilherme Penedo, Leandro Von Werra

3T tokens of text data extracted from PDFs sourced from the Web

Dataset

🚀 SmolLM3: smol, multilingual, long-context reasoner

2025

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo et al.

Hugging Face Blog

SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports dual mode reasoning, 6 languages and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.

Blogpost Models

📚 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
2025

Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo et al.

An 8TB dataset of public domain and openly licensed text

Arxiv Dataset

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
```
@misc{kandpal2025commonpilev018tb,
      title={The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}, 
      author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and John Kirchenbauer and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben Allal and Elie Bakouch and John David Pressman and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and Tom Goldstein and Brian R. Bartoldson and Bhavya Kailkhura and Tyler Murray},
      year={2025},
      eprint={2506.05209},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.05209}, 
}
```

🏆 Open-R1: Olympic coding

2025

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching et al.

Datasets and models to produce DeepSeek-R1 like reasoning models excelling at competitive programming

Blogpost Models & Datasets

2024

🥂 FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language
2024

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan et al.

Second Conference on Language Modeling (COLM) 2025 Spotlight

An extension of FineWeb to over 1000 languages

Arxiv Dataset

Large language models have demonstrated remarkable capabilities across diverse tasks, yet their performance heavily relies on the quality and scale of pre-training data. While recent efforts have focused on improving data quality for English, most languages lack comparable resources. We introduce FineWeb2, a comprehensive pipeline for processing web data that adapts to any language while maintaining consistent quality standards. Our approach combines language-agnostic filtering techniques with language-specific optimizations, enabling the creation of high-quality pre-training datasets across 95 languages. We demonstrate that models trained on FineWeb2 data consistently outperform those trained on existing multilingual datasets, with particularly strong improvements for low-resource languages. By open-sourcing both our pipeline and processed datasets, we aim to democratize access to high-quality pre-training data and accelerate multilingual AI development.
```
@misc{penedo2025fineweb2pipelinescale,
      title={FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language}, 
      author={Guilherme Penedo and Hynek Kydlíček and Vinko Sabolčec and Bettina Messmer and Negar Foroutan and Amir Hossein Kargaran and Colin Raffel and Martin Jaggi and Leandro Von Werra and Thomas Wolf},
      year={2025},
      eprint={2506.20920},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.20920}, 
}
```
🤏 SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model
2024

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo et al.

Second Conference on Language Modeling (COLM) 2025 Spotlight

SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device

Arxiv Models

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
```
@misc{allal2025smollm2smolgoesbig,
      title={SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model}, 
      author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf},
      year={2025},
      eprint={2502.02737},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02737}, 
}
```

🔍 FineTasks: Finding signal in 100s of multilingual evaluation tasks

2024

Hynek Kydlíček, Guilherme Penedo, Clémentine Fourrier, Nathan Habib, Thomas Wolf

Finding signal in a haystack of 200+ multilingual tasks

Blogpost

📖 Towards Best Practices for Open Datasets for LLM Training
2024

Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo et al.

Best practices for creating and sharing open datasets for LLM training

Arxiv

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
```
@misc{baack2025bestpracticesopendatasets,
      title={Towards Best Practices for Open Datasets for LLM Training}, 
      author={Stefan Baack and Stella Biderman and Kasia Odrozek and Aviya Skowron and Ayah Bdeir and Jillian Bommarito and Jennifer Ding and Maximilian Gahntz and Paul Keller and Pierre-Carl Langlais and Greg Lindahl and Sebastian Majstorovic and Nik Marda and Guilherme Penedo and Maarten Van Segbroeck and Jennifer Wang and Leandro von Werra and Mitchell Baker and Julie Belião and Kasia Chmielinski and Marzieh Fadaee and Lisa Gutermuth and Hynek Kydlíček and Greg Leppert and EM Lewis-Jong and Solana Larsen and Shayne Longpre and Angela Oduor Lungati and Cullen Miller and Victor Miller and Max Ryabinin and Kathleen Siminyu and Andrew Strait and Mark Surman and Anna Tumadóttir and Maurice Weber and Rebecca Weiss and Lee White and Thomas Wolf},
      year={2025},
      eprint={2501.08365},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2501.08365}, 
}
```
📚 FineWeb-Edu
2024

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell et al.

Advances in Neural Information Processing Systems (NeurIPS) 2024 Spotlight

A filtered subset of the most educational content from FineWeb

Arxiv Dataset Blogpost

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
```
@inproceedings{NEURIPS2024_370df50c,
 author = {Penedo, Guilherme and Kydl\'{\'\i}\v{c}ek, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas},
 booktitle = {Advances in Neural Information Processing Systems},
 doi = {10.52202/079017-0970},
 editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
 pages = {30811--30849},
 publisher = {Curran Associates, Inc.},
 title = {The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
 url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets_and_Benchmarks_Track.pdf},
 volume = {37},
 year = {2024}
}
```
🍷 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
2024

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell et al.

Advances in Neural Information Processing Systems (NeurIPS) 2024 Spotlight

A state-of-the-art 15-trillion+ token English dataset for LLM pre-training

Arxiv Dataset Blogpost

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
```
@inproceedings{NEURIPS2024_370df50c,
 author = {Penedo, Guilherme and Kydl\'{\'\i}\v{c}ek, Hynek and allal, Loubna Ben and Lozhkov, Anton and Mitchell, Margaret and Raffel, Colin and Von Werra, Leandro and Wolf, Thomas},
 booktitle = {Advances in Neural Information Processing Systems},
 doi = {10.52202/079017-0970},
 editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
 pages = {30811--30849},
 publisher = {Curran Associates, Inc.},
 title = {The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
 url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets_and_Benchmarks_Track.pdf},
 volume = {37},
 year = {2024}
}
```

2023

🦅 The Falcon Series of Open Language Models
2023

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo

SOTA (at the time) open source large language models. Available in 7B, 40B, and 180B parameters.

Arxiv Models

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.
```
@misc{almazrouei2023falconseriesopenlanguage,
      title={The Falcon Series of Open Language Models}, 
      author={Ebtesam Almazrouei and Hamza Alobeidli and Abdulaziz Alshamsi and Alessandro Cappelli and Ruxandra Cojocaru and Mérouane Debbah and Étienne Goffinet and Daniel Hesslow and Julien Launay and Quentin Malartic and Daniele Mazzotta and Badreddine Noune and Baptiste Pannier and Guilherme Penedo},
      year={2023},
      eprint={2311.16867},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2311.16867}, 
}
```
📀 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli et al.

Advances in Neural Information Processing Systems (NeurIPS) 2023

Pre-training dataset used to train the Falcon LLMs

Arxiv Dataset

Large language models are commonly trained on a mixture of filtered web data and curated high-quality'' corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation, and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 500 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
```
@inproceedings{NEURIPS2023_fa3ed726,
 author = {Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Alobeidli, Hamza and Cappelli, Alessandro and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {79155--79172},
 publisher = {Curran Associates, Inc.},
 title = {The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/fa3ed726cc5073b9c31e3e49a807789c-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}
```

🅰️ AlGhafa Evaluation Benchmark for Arabic Language Models

2023

Ebtesam Almazrouei, Ruxandra Cojocaru, Michele Baldo, Quentin Malartic, Hamza Alobeidli, Daniele Mazzotta, Guilherme Penedo et al.

Proceedings of ArabicNLP 2023

A new multiple-choice evaluation benchmark for Arabic LLMs

ACL

2022

🛰️ Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS
2022

Luis García Mozos, Devonjit Saroya, Yannick Roelvink, Naël dos Santos D'Amore, Stefano Gabetti, Jorge Galván Lobo, Catarina Lobo, Mrunmayi Joshi, Guilherme Penedo et al.

4th Symposium on Space Educational Activities (SSEA)

Assembly, integration, and testing for a student payload for the ISS

Article

The Artery in Microgravity (AIM) project was the first experiment to be selected for the “Orbit Your Thesis!” programme of the European Space Agency Academy. It is a 2U cube experiment that will be operated in the International Commercial Experiment (ICE) Cubes facility onboard the International Space Station. The experiment is expected to be launched on SpaceX-25 in mid-2022. The project is being developed by an international group of students from ISAE-SUPAERO and Politecnico di Torino. The objective of the experiment is to study haemodynamics in the space environment applied to coronary heart disease. The outcomes of this testbench will contribute to understanding the effects of radiation and microgravity on the circulatory system of an astronaut, specifically the behaviour in long-term human spaceflight. It will also help to ascertain the feasibility of individuals suffering from this kind of disease going to space someday. The cornerstones of the experiment are two models of 3D-printed artificial arteries, in stenotic and stented conditions respectively. Blood-mimicking fluid composed of water and glycerol is circulated through the arteries in a closed hydraulic loop, and a red dye is injected for flow visualisation. Drops of pressure and image analysis of the flow will be studied with the corresponding sensors and camera. The pH of the fluid will also be monitored to assess the effect of augmented radiation levels on the release of particles from the metallic stent. Some delays were experienced in the project due to the COVID-19 pandemic and to implement design improvements. Improvements were made to several aspects of the design including mechanics (e.g. remanufacturing the reservoir with surface treatment against corrosion, leak prevention measures), software (e.g. upgrading to Odroid-C4 and migrating the code to Python), and electronics (e.g. several iterations of the interface PCB design). This iterative process of identifying areas of concern and designing and implementing solutions has resulted in many lessons learned. The paper will outline in detail Phase D – Qualification and Production of the AIM experiment cube, with special insight on the implementation of the improvements. Previously, at the Symposium on Space Educational Activities in 2019 in Leicester, the initial phases of the design and development of the cube were presented. This year, the final flight model and the results of validation testing before launching on SpaceX-25 are presented. Lessons learned throughout the course of the project are also highlighted for students embarking on their own space-related educational activities.
```
@inproceedings{garcia2022artery,
  title={Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS},
  author={Garc{\'\i}a Mozos, Luis and Saroya, Devonjit and Roelvink, Yannick and Santos D'Amore, Na{\"e}l dos and Gabetti, Stefano and Galv{\'a}n Lobo, Jorge and Lobo, Catarina and Joshi, Mrunmayi and Penedo, Guilherme and Rodr{\'\i}guez Encinas, Jorge Alberto and others},
  booktitle={4th Symposium on Space Educational Activities},
  year={2022},
  organization={Universitat Polit{\`e}cnica de Catalunya}
}
```

← Back to home

All Publications, Projects, and Blogposts

2026

📄 FineTranslations

2025

📄 FinePDFs-Edu

🌐 FineWiki

📄 FinePDFs

🚀 SmolLM3: smol, multilingual, long-context reasoner

📚 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

🏆 Open-R1: Olympic coding

2024

🥂 FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

🤏 SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

🔍 FineTasks: Finding signal in 100s of multilingual evaluation tasks

📖 Towards Best Practices for Open Datasets for LLM Training

📚 FineWeb-Edu

🍷 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

2023

🦅 The Falcon Series of Open Language Models

📀 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

🅰️ AlGhafa Evaluation Benchmark for Arabic Language Models

2022

🛰️ Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS