Guilherme Penedo

Guilherme Penedo

Machine Learning Research Engineer

All Publications, Projects, and Blogposts

← Back to home

2025

  • 📄 FinePDFs-Edu

    2025
    Hynek Kydlíček, Guilherme Penedo, Leandro Von Werra
    350B+ highly educational tokens filtered from FinePDFs
  • 🌐 FineWiki

    2025
    Guilherme Penedo
    An updated, better extracted version of Wikipedia in 300+ languages
  • 📄 FinePDFs

    2025
    Hynek Kydlíček, Guilherme Penedo, Leandro Von Werra
    3T tokens of text data extracted from PDFs sourced from the Web
  • 🚀 SmolLM3: smol, multilingual, long-context reasoner

    2025
    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo et al.
    Hugging Face Blog
    SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports dual mode reasoning, 6 languages and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.
  • 📚 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

    2025
    Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo et al.
    An 8TB dataset of public domain and openly licensed text
  • 🏆 Open-R1: Olympic coding

    2025
    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching et al.
    Datasets and models to produce DeepSeek-R1 like reasoning models excelling at competitive programming

2024

  • 🥂 FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

    2024
    Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan et al.
    Second Conference on Language Modeling (COLM) 2025 Spotlight
    An extension of FineWeb to over 1000 languages
  • 🤏 SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

    2024
    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo et al.
    Second Conference on Language Modeling (COLM) 2025 Spotlight
    SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device
  • 🔍 FineTasks: Finding signal in 100s of multilingual evaluation tasks

    2024
    Hynek Kydlíček, Guilherme Penedo, Clémentine Fourrier, Nathan Habib, Thomas Wolf
    Finding signal in a haystack of 200+ multilingual tasks
  • 📖 Towards Best Practices for Open Datasets for LLM Training

    2024
    Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo et al.
    Best practices for creating and sharing open datasets for LLM training
  • 📚 FineWeb-Edu

    2024
    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell et al.
    Advances in Neural Information Processing Systems (NeurIPS) 2024 Spotlight
    A filtered subset of the most educational content from FineWeb
  • 🍷 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    2024
    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell et al.
    Advances in Neural Information Processing Systems (NeurIPS) 2024 Spotlight
    A state-of-the-art 15-trillion+ token English dataset for LLM pre-training

2023

  • 🦅 The Falcon Series of Open Language Models

    2023
    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
    SOTA (at the time) open source large language models. Available in 7B, 40B, and 180B parameters.
  • 📀 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    2023
    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli et al.
    Advances in Neural Information Processing Systems (NeurIPS) 2023
    Pre-training dataset used to train the Falcon LLMs
  • 🅰️ AlGhafa Evaluation Benchmark for Arabic Language Models

    2023
    Ebtesam Almazrouei, Ruxandra Cojocaru, Michele Baldo, Quentin Malartic, Hamza Alobeidli, Daniele Mazzotta, Guilherme Penedo et al.
    Proceedings of ArabicNLP 2023
    A new multiple-choice evaluation benchmark for Arabic LLMs

2022

  • 🛰️ Artery in Microgravity (AIM): Assembly, integration, and testing for a student payload for the ISS

    2022
    Luis García Mozos, Devonjit Saroya, Yannick Roelvink, Naël dos Santos D'Amore, Stefano Gabetti, Jorge Galván Lobo, Catarina Lobo, Mrunmayi Joshi, Guilherme Penedo et al.
    4th Symposium on Space Educational Activities (SSEA)
    Assembly, integration, and testing for a student payload for the ISS

← Back to home