Dr. Pedro Holanda - Principal Engineer at DuckDB Labs

Dr. Pedro Holanda

Principal Engineer @ DuckDB Labs

DuckDB is a free, open-source database that lets anyone analyze large datasets right on their own computer - now downloaded over 30 million times a month. I've been building its infrastructure since it was a research prototype at CWI, from the CSV engine to Arrow integration to DuckLake. I work on the systems that make analytical databases radically simpler.

Selected work: led DuckLake from its 0.1 pre-production version to 1.0, and built the CSV engine that ranks highest on the 2025 Pollock robustness benchmark across every tested system.

PhD from CWI/Leiden University with publications at VLDB and ICDE. Former Microsoft Research intern. Served as COO of DuckDB Labs, helping build the company from a CWI research spin-out before returning to engineering full-time.

DuckDB by the numbers

38K+ GitHub Stars
30M+ Monthly Downloads

DuckLake by the numbers

2.7K+ GitHub Stars
2M+ Monthly Downloads

About

I joined DuckDB at CWI Amsterdam when the project had two researchers, only a handful of users, and a bet that analytical databases could be radically simpler. I have been building its core infrastructure from the early days.

When DuckDB Labs spun out of CWI, I took on the COO role - helping hire the early team, organizing events, shaping the open-source strategy, and building the company while continuing to ship code. That experience gave me a perspective on the full lifecycle of open-source infrastructure that informs every design decision I make today. When the company was ready for dedicated operations leadership, I chose to return to engineering full-time.

As Principal Engineer, I led DuckLake from its 0.1 pre-production version to its 1.0 release - including data inlining that delivers 926× faster queries and 105× faster ingestion versus Iceberg - and I own query-processing performance for the engine. I mentor new contributors, lead design reviews across subsystems, and help set technical direction for the engine.

Engineering Principles

The future of analytical databases is in-process. Moving data to the database is the wrong abstraction - the database should come to the data. That is the bet I made when I joined DuckDB.

CSV will never die. Instead of replacing messy formats, build systems that handle their full complexity transparently. That philosophy drives the work I do on DuckDB's data ingestion layer.

Databases should meet users where they are. That is why I built DuckDB's zero-copy Arrow integration and ADBC - the best data system works seamlessly with every tool in your stack, not the other way around.

Engineering Contributions

DuckLake

Led the development of DuckDB's integrated data lake format from its 0.1 pre-production version to its 1.0 release. A SQL-native catalog that stores table metadata in any database while data lives in open formats like Parquet on object storage.

CSV Engine

Designed and built DuckDB's parallel CSV parser with automatic type, delimiter, and dialect detection. Scores highest on the Pollock robustness benchmark (2025) across all tested systems.

Arrow & ADBC Integration

Built the zero-copy integration between DuckDB and Python's data ecosystem via Apache Arrow. ADBC provides a modern connectivity standard that eliminates the serialization overhead of ODBC.

ART Persistent Storage

Designed and implemented the persistent storage layer for DuckDB's Adaptive Radix Tree indexes. Keeps index data durable on disk without sacrificing in-memory lookup speed.

Python Client

Built DuckDB's Python client foundations and the UDF framework that lets users extend the engine in pure Python - no C++ required.

BIGNUM Implementation

Implemented arbitrary-precision integer arithmetic for DuckDB. Handles HUGEINT and DECIMAL types so that financial calculations and scientific measurements stay exact beyond 64-bit limits.

Open Source Projects

DuckDB

38K+

The open-source analytical in-process database. Early core contributor since the research prototype - 580+ merged pull requests.

View on GitHub

DuckLake

2.7K+

Integrated data lake and catalog format designed to work with DuckDB. Led its development - 180+ merged pull requests.

View on GitHub

Career Timeline

2025 - Present

Principal Engineer

DuckDB Labs
2023 - 2025

Software Engineer

DuckDB Labs Returned to full-time engineering. Built the CSV engine, ART storage, and ADBC integration.
2021 - 2023

Chief Operating Officer

DuckDB Labs Helped hire the early team, shaped the open-source strategy, and built company operations while continuing to ship code.
2021 - 2022

Post-Doctoral Researcher

CWI Amsterdam Concurrent with the COO role, during the CWI spin-out.
2019

Research Intern

Microsoft Research DMX group. JIT-compiled execution engines for SQL Server.
2017 - 2021

PhD in Computer Science

CWI / Leiden University

Talks & Media

I speak on database internals and query-engine design at FOSDEM, EuroPython, and data community events.

FOSDEM EuroPython

Selected Publications

Peer-reviewed work on progressive indexing, database benchmarking, and analytical query processing. The same research depth behind these VLDB and ICDE papers ships in DuckDB and DuckLake today.

2021

Multidimensional Adaptive & Progressive Indexes

Extending progressive indexing to multiple dimensions for faster analytical queries.
2021

Progressive Indexes

Dissertation combining adaptive indexing with workload-driven optimization for interactive analytical queries.
Pedro Holanda
PhD Thesis @ Leiden University/CWI [PDF]
2020

Dissecting DuckDB: The internals of the SQLite for Analytics

A hands-on tutorial on DuckDB internals for analytical workloads.
Pedro Holanda and Mark Raasveldt
SBBD (Tutorial) [PDF] [HANDS-ON]
2019

Progressive Indexes: Indexing for Interactive Data Analysis

Indexes that adapt during query execution, enabling interactive data analysis.
2018

Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing

Identifying common pitfalls in database performance testing.
SIGMOD (DbTest) [PDF] [SOURCE CODE]

Blog Posts

Showing selected posts. Full list on the DuckDB Blog and DuckLake Blog.

Introducing DuckLake's data inlining feature that stores small updates directly in the catalog database, eliminating the small files problem and achieving 926× faster queries and 105× faster ingestion compared to Iceberg.

Introduced the zero-copy integration between DuckDB and Apache Arrow that became the default way to move data between DuckDB and the Python ecosystem.

Testing DuckDB's CSV parser against the Pollock robustness benchmark - the most adversarial collection of real-world CSV files available.

A benchmark comparison of CSV and Parquet ingestion performance. The results are more nuanced than the conventional wisdom suggests.

DuckDB's implementation of the Arrow Database Connectivity standard, providing a modern alternative to ODBC for high-throughput data transfer.

Get in Touch

I am always up for a conversation about database internals, query engine design, or open-source collaboration. Feel free to reach out.

Get in Touch