DuckDB

This feature is in alpha

Expect rapid changes, limited features, and possible breaking updates. Share feedback as we refine the experience and expand access.

DuckDB is a high-performance SQL database system optimized for analytical workloads. It can directly query Iceberg tables stored in your analytics buckets, making it ideal for data exploration and complex analytical queries.

Installation#

Install DuckDB and the Iceberg extension:

1
pip install duckdb duckdb-iceberg

Connecting to Analytics buckets#

Here's a complete example of connecting to your Supabase analytics bucket and querying Iceberg tables:

1
import duckdb
2
import os
3
4
# Configuration
5
PROJECT_REF = "your-project-ref"
6
WAREHOUSE = "your-analytics-bucket-name"
7
SERVICE_KEY = "your-service-key"
8
9
# S3 credentials
10
S3_ACCESS_KEY = "your-access-key"
11
S3_SECRET_KEY = "your-secret-key"
12
S3_REGION = "us-east-1"
13
14
# Construct endpoints
15
S3_ENDPOINT = f"https://{PROJECT_REF}.supabase.co/storage/v1/s3"
16
CATALOG_URI = f"https://{PROJECT_REF}.supabase.co/storage/v1/iceberg"
17
18
# Initialize DuckDB connection
19
conn = duckdb.connect(":memory:")
20
21
# Install and load the Iceberg extension
22
conn.install_extension("iceberg")
23
conn.load_extension("iceberg")
24
25
# Configure Iceberg catalog with Supabase credentials
26
conn.execute(f"""
27
    CREATE SECRET (
28
        TYPE S3,
29
        KEY_ID '{S3_ACCESS_KEY}',
30
        SECRET '{S3_SECRET_KEY}',
31
        REGION '{S3_REGION}',
32
        ENDPOINT '{S3_ENDPOINT}',
33
        URL_STYLE 'virtual'
34
    );
35
""")
36
37
# Configure the REST catalog
38
conn.execute(f"""
39
    ATTACH 'iceberg://{CATALOG_URI}' AS iceberg_catalog
40
    (
41
        TYPE ICEBERG_REST,
42
        WAREHOUSE '{WAREHOUSE}',
43
        TOKEN '{SERVICE_KEY}'
44
    );
45
""")
46
47
# Query your Iceberg tables
48
result = conn.execute("""
49
    SELECT *
50
    FROM iceberg_catalog.default.events
51
    LIMIT 10
52
""").fetchall()
53
54
for row in result:
55
    print(row)
56
57
# Complex aggregation example
58
analytics = conn.execute("""
59
    SELECT
60
        event_name,
61
        COUNT(*) as event_count,
62
        COUNT(DISTINCT user_id) as unique_users
63
    FROM iceberg_catalog.default.events
64
    GROUP BY event_name
65
    ORDER BY event_count DESC
66
""").fetchdf()
67
68
print(analytics)

Key features with DuckDB#

Efficient data exploration#

DuckDB's lazy evaluation means it only scans the data you need:

1
# This only reads the columns you select
2
events = conn.execute("""
3
    SELECT event_id, event_name, event_timestamp
4
    FROM iceberg_catalog.default.events
5
    WHERE event_timestamp > NOW() - INTERVAL '7 days'
6
""").fetchdf()

Converting to Pandas#

Convert results to Pandas DataFrames for further analysis:

1
df = conn.execute("""
2
    SELECT *
3
    FROM iceberg_catalog.default.events
4
""").fetchdf()
5
6
# Use pandas for visualization or further processing
7
print(df.describe())

Exporting results#

Save your analytical results to various formats:

1
# Export to Parquet
2
conn.execute("""
3
    COPY (
4
        SELECT * FROM iceberg_catalog.default.events
5
    ) TO 'results.parquet'
6
""")
7
8
# Export to CSV
9
conn.execute("""
10
    COPY (
11
        SELECT event_name, COUNT(*) as count
12
        FROM iceberg_catalog.default.events
13
        GROUP BY event_name
14
    ) TO 'summary.csv' (FORMAT CSV, HEADER true)
15
""")

Best practices#

Connection pooling - Reuse connections for multiple queries
Partition pruning - Filter by partition columns to improve query performance
Column selection - Only select columns you need to reduce I/O
Limit results - Use LIMIT during exploration to avoid processing large datasets