Storage

DuckDB


DuckDB is a high-performance SQL database system optimized for analytical workloads. It can directly query Iceberg tables stored in your analytics buckets, making it ideal for data exploration and complex analytical queries.

Installation

Install DuckDB and the Iceberg extension:

1
pip install duckdb duckdb-iceberg

Connecting to Analytics buckets

Here's a complete example of connecting to your Supabase analytics bucket and querying Iceberg tables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import duckdbimport os# ConfigurationPROJECT_REF = "your-project-ref"WAREHOUSE = "your-analytics-bucket-name"SERVICE_KEY = "your-service-key"# S3 credentialsS3_ACCESS_KEY = "your-access-key"S3_SECRET_KEY = "your-secret-key"S3_REGION = "us-east-1"# Construct endpointsS3_ENDPOINT = f"https://{PROJECT_REF}.supabase.co/storage/v1/s3"CATALOG_URI = f"https://{PROJECT_REF}.supabase.co/storage/v1/iceberg"# Initialize DuckDB connectionconn = duckdb.connect(":memory:")# Install and load the Iceberg extensionconn.install_extension("iceberg")conn.load_extension("iceberg")# Configure Iceberg catalog with Supabase credentialsconn.execute(f""" CREATE SECRET ( TYPE S3, KEY_ID '{S3_ACCESS_KEY}', SECRET '{S3_SECRET_KEY}', REGION '{S3_REGION}', ENDPOINT '{S3_ENDPOINT}', URL_STYLE 'virtual' );""")# Configure the REST catalogconn.execute(f""" ATTACH 'iceberg://{CATALOG_URI}' AS iceberg_catalog ( TYPE ICEBERG_REST, WAREHOUSE '{WAREHOUSE}', TOKEN '{SERVICE_KEY}' );""")# Query your Iceberg tablesresult = conn.execute(""" SELECT * FROM iceberg_catalog.default.events LIMIT 10""").fetchall()for row in result: print(row)# Complex aggregation exampleanalytics = conn.execute(""" SELECT event_name, COUNT(*) as event_count, COUNT(DISTINCT user_id) as unique_users FROM iceberg_catalog.default.events GROUP BY event_name ORDER BY event_count DESC""").fetchdf()print(analytics)

Key features with DuckDB

Efficient data exploration

DuckDB's lazy evaluation means it only scans the data you need:

1
2
3
4
5
6
# This only reads the columns you selectevents = conn.execute(""" SELECT event_id, event_name, event_timestamp FROM iceberg_catalog.default.events WHERE event_timestamp > NOW() - INTERVAL '7 days'""").fetchdf()

Converting to Pandas

Convert results to Pandas DataFrames for further analysis:

1
2
3
4
5
6
7
df = conn.execute(""" SELECT * FROM iceberg_catalog.default.events""").fetchdf()# Use pandas for visualization or further processingprint(df.describe())

Exporting results

Save your analytical results to various formats:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Export to Parquetconn.execute(""" COPY ( SELECT * FROM iceberg_catalog.default.events ) TO 'results.parquet'""")# Export to CSVconn.execute(""" COPY ( SELECT event_name, COUNT(*) as count FROM iceberg_catalog.default.events GROUP BY event_name ) TO 'summary.csv' (FORMAT CSV, HEADER true)""")

Best practices

  • Connection pooling - Reuse connections for multiple queries
  • Partition pruning - Filter by partition columns to improve query performance
  • Column selection - Only select columns you need to reduce I/O
  • Limit results - Use LIMIT during exploration to avoid processing large datasets

Next steps