Guilherme Penedo

Guilherme Penedo

Machine Learning Research Engineer

I am a Machine Learning Research Engineer at 🤗 Hugging Face, where I work on creating state-of-the-art datasets to train Large Language Models, such as the industry standard RefinedWeb, FineWeb, and FineWeb2.

I studied Aerospace Engineering at Instituto Superior Técnico (Lisbon, Portugal), and later obtained a Master's in Aerospace Engineering from ISAE-SUPAERO (Toulouse, France), before changing fields and going into Machine Learning. Back in high school, I represented my country, Portugal, in the International Olympiads in Informatics (IOI), a competitive programming competition.

I am currently based in Paris, France, but I grew up by the beach in Lisbon, Portugal. In total, I have lived in 4 different countries: Portugal 🇵🇹, France 🇫🇷, the United Arab Emirates 🇦🇪, and the Netherlands 🇳🇱.

🔥 News

📜 Selected Publications, Projects, and Blogposts

  • 🏆 Open-R1: Olympic coding

    2025
    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching et al.
    Datasets and models to produce DeepSeek-R1 like reasoning models excelling at competitive programming
  • 🥂 FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

    2024
    Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan et al.
    Second Conference on Language Modeling (COLM) 2025 Spotlight
    An extension of FineWeb to over 1000 languages
  • 🍷 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    2024
    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell et al.
    Advances in Neural Information Processing Systems (NeurIPS) 2024 Spotlight
    A state-of-the-art 15-trillion+ token English dataset for LLM pre-training
  • 📀 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    2023
    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli et al.
    Advances in Neural Information Processing Systems (NeurIPS) 2023
    Pre-training dataset used to train the Falcon LLMs

See all Publications, Projects, and Blogposts →

🎤 Talks

    No talks yet.