I am a Machine Learning Research Engineer at 🤗 Hugging Face, where I work on creating state-of-the-art datasets to train Large Language Models, such as the industry standard RefinedWeb, FineWeb, and FineWeb2.
I studied Aerospace Engineering at Instituto Superior Técnico (Lisbon, Portugal), and later obtained a Master's in Aerospace Engineering from ISAE-SUPAERO (Toulouse, France), before changing fields and going into Machine Learning. Back in high school, I represented my country, Portugal, in the International Olympiads in Informatics (IOI), a competitive programming competition.
I am currently based in Paris, France, but I grew up by the beach in Lisbon, Portugal. In total, I have lived in 4 different countries: Portugal 🇵🇹, France 🇫🇷, the United Arab Emirates 🇦🇪, and the Netherlands 🇳🇱.
🔥 News
- Oct 2025 We have just released FineWiki, an updated, better extracted version of Wikipedia in 300+ languages
- Oct 2025 I will be attending COLM 2025 in Montreal, Canada. Presenting as a spotlight: "FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language"
📜 Selected Publications, Projects, and Blogposts
-
🏆 Open-R1: Olympic coding
2025Datasets and models to produce DeepSeek-R1 like reasoning models excelling at competitive programming -
🥂 FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language
2024Second Conference on Language Modeling (COLM) 2025 SpotlightAn extension of FineWeb to over 1000 languages -
🍷 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
2024Advances in Neural Information Processing Systems (NeurIPS) 2024 SpotlightA state-of-the-art 15-trillion+ token English dataset for LLM pre-training -
📀 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
2023Advances in Neural Information Processing Systems (NeurIPS) 2023Pre-training dataset used to train the Falcon LLMs
🎤 Talks
No talks yet.