Snowflake vs Databricks Performance Benchmark

A Python-based benchmarking tool to compare query performance and cost between Snowflake and Databricks using the TPC-H SF1000 (1TB) dataset.

What It Does

This benchmark executes all 22 TPC-H queries against both Snowflake and Databricks, measuring execution time, cost (credits/DBUs), and detailed performance metrics. Results are stored in DuckDB for analysis and comparison using dbt-generated views.

Benchmark Modes

The tool supports three benchmarking scenarios:

Normal (Sequential)
- Executes all 22 queries sequentially on a warm warehouse
- Warehouse remains running throughout the entire run
- First query experiences cold start, subsequent queries benefit from warm warehouse
- Default mode for comprehensive performance testing
Coldstart
- Warehouse suspended/stopped between each query
- Each query experiences full cold start overhead (warehouse startup time)
- Tests worst-case performance when no cache is available
- Defaults to queries 1, 3, 5, 10, 18 (configurable)
- Useful for understanding cold cache performance
Concurrent
- All 22 queries executed in parallel on the same warehouse
- Tests warehouse concurrency and resource contention
- Measures performance degradation under concurrent load
- Uses multi-cluster warehouses to handle parallel execution
- Defaults to all 22 queries (configurable)
CTAS (Create Table As Select)
- Benchmarks write performance by creating tables with different data shapes
- Tests data ingestion and table creation performance
- Five variants with different row counts and column widths:
  - narrow_tall: Many rows, few columns (~6B rows, minimal columns)
  - standard_tall: Many rows, standard columns (~6B rows)
  - medium_wide: Moderate rows, more columns
  - very_wide: Fewer rows, many columns (wide denormalized table)
  - filtered: Subset of data with filters applied
- Tables are automatically dropped after benchmark completion
DML (Delete + Insert)
- Benchmarks data modification performance using DELETE and INSERT operations
- Simulates partition refresh pattern: delete a monthly slice, then re-insert it
- Operations:
  - delete: Delete June 1995 lineitem data (~7.5M rows)
  - insert: Insert June 1995 lineitem data from source (~7.5M rows)
- Tests real-world data update patterns common in data warehouses
- Requires lineitem_dml table (copy of lineitem) which is created automatically

Use --scenario normal, --scenario coldstart, --scenario concurrent, --scenario ctas, --scenario dml, or --scenario all (default) to run all scenarios with a unified run ID.

Getting Started

1. Install Dependencies

If you do not have uv installed, run curl -LsSf https://astral.sh/uv/install.sh | sh. See uv docs for details and troubleshooting.

uv sync

Note: uv will automatically install the Snowflake and Databricks CLI tools for you.

2. Configure Environment Variables (Automated)

Run the automated setup script from the project root:

uv run setup_config.py

This interactive script will:

✓ Prompt for your Snowflake connection name
✓ Prompt for Databricks workspace credentials
✓ Discover available catalogs and schemas
✓ Generate your .env file automatically

Note: SQL Warehouses are created and destroyed automatically during benchmark runs. No manual warehouse configuration needed.

Warehouse Naming: Warehouses are named with the pattern {PREFIX}_{SIZE}_{SCENARIO}_{RUN_ID} to prevent conflicts when running multiple scenarios. For example: BENCHMARK_WH_MEDIUM_NORMAL_001 and BENCHMARK_WH_MEDIUM_COLDSTART_001.

2b. Configure Environment Variables (Manual)

If you prefer to manually configure, copy .env.example to .env and configure all user-specific environment variables for your environment.

Snowflake Configuration

Ensure you have the Snowflake CLI installed and configured.

First, add your Snowflake CLI connection:

snow connection add --connection <your-connection-name>
snow connection test --connection <your-connection-name>

Then configure these variables in .env:

SNOWFLAKE_CONNECTION - Your Snowflake CLI connection name (required)
SNOWFLAKE_ROLE - Role to use for benchmark (default: BENCHMARK)
SNOWFLAKE_DATABASE - Database for benchmark tables (default: BENCHMARK)
SNOWFLAKE_SCHEMA - Schema for benchmark tables (default: PUBLIC)
SNOWFLAKE_WAREHOUSE_PREFIX - Prefix for warehouse names (default: BENCHMARK_WH)
- Full warehouse names will be: {PREFIX}_small, {PREFIX}_medium, {PREFIX}_xlarge with run IDs appended

Example:

export SNOWFLAKE_CONNECTION=my_connection
export SNOWFLAKE_ROLE=BENCHMARK
export SNOWFLAKE_DATABASE=MY_BENCHMARK_DB
export SNOWFLAKE_SCHEMA=PUBLIC
export SNOWFLAKE_WAREHOUSE_PREFIX=MY_WH

Databricks Configuration

IMPORTANT: TPC-H SF1000 Data Requirement

Unlike Snowflake (which provides SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000), Databricks does not include a pre-loaded TPC-H SF1000 (1TB) dataset. You must ETL the TPC-H data into your Databricks catalog before running benchmarks.

The benchmark expects these tables in your catalog/schema:

customer, lineitem, nation, orders, part, partsupp, region, supplier

Configure these variables in .env:

DATABRICKS_HOST - Your Databricks workspace URL (required)
- Format: https://dbc-xxxxxxxxx.cloud.databricks.com
DATABRICKS_TOKEN - Your Databricks personal access token (required)
DATABRICKS_CATALOG - Catalog containing your TPC-H SF1000 tables (required, user-specific)
DATABRICKS_SCHEMA - Schema containing your TPC-H SF1000 tables (required, user-specific)

Note: SQL Warehouses are created and destroyed automatically during benchmark runs. No pre-configuration of warehouses needed.

Example:

export DATABRICKS_HOST=https://dbc-abc123.cloud.databricks.com
export DATABRICKS_TOKEN=dapi_abc123xyz789
export DATABRICKS_CATALOG=your_tpch_catalog
export DATABRICKS_SCHEMA=sf1000

3. Run Benchmarks

Basic Usage

# Run all scenarios (normal + coldstart + concurrent) with default settings (medium warehouse)
uv run main.py

Command-Line Options

Flag	Options	Description
`--warehouse-size`	`small`, `medium`, `large`, `xl`, `2xl`, `all`, or comma-separated	Warehouse size(s) to use. Automatically maps to platform-specific sizes: • `small`: Snowflake Small / Databricks Small • `medium`: Snowflake Medium / Databricks Small (default) • `large`: Snowflake Large / Databricks Medium • `xl`: Snowflake XLarge / Databricks Large • `2xl`: Snowflake 2XLarge / Databricks XLarge • `all`: Run all sizes • Comma-separated: e.g., `medium,xl`
`--scenario`	`normal`, `coldstart`, `concurrent`, `ctas`, `dml`, `all`	Benchmark scenario to run: • `normal`: Sequential queries with warm warehouse only • `coldstart`: Warehouse suspended between each query only (defaults to queries 1,3,5,10,18 if not specified) • `concurrent`: All queries executed in parallel on same warehouse • `ctas`: Create Table As Select benchmarks with multiple data shapes • `dml`: DELETE + INSERT operations simulating partition refresh • `all`: Run all scenarios with unified run ID (default)
`--queries`	e.g., `1,2,3` or `1-5`	Specific queries to run (default: all 22 TPC-H queries)
`--ctas-variants`	comma-separated list	CTAS variants to run (default: all). Options: `narrow_tall`, `standard_tall`, `medium_wide`, `very_wide`, `filtered`
`--snowflake-only`	(flag)	Run only Snowflake benchmark (skip Databricks)
`--databricks-only`	(flag)	Run only Databricks benchmark (skip Snowflake)

Examples

# Run all scenarios with default medium warehouse
uv run main.py

# Run all scenarios with ALL warehouse sizes (small, medium, large)
uv run main.py --warehouse-size all

# Run with specific warehouse sizes (comma-separated)
uv run main.py --warehouse-size small,large

# Run with large warehouse only
uv run main.py --warehouse-size large

# Run specific queries (all scenarios)
uv run main.py --queries 1,2,3
uv run main.py --queries 1-5

# Run ONLY normal scenario (warm warehouse)
uv run main.py --scenario normal

# Run ONLY cold start scenario (warehouse suspended between queries)
uv run main.py --scenario coldstart

# Run ONLY concurrent scenario (all queries in parallel)
uv run main.py --scenario concurrent

# Explicitly run all scenarios (same as default)
uv run main.py --scenario all

# Run ALL scenarios with ALL warehouse sizes (the full matrix!)
uv run main.py --scenario all --warehouse-size all

# Run cold start with specific queries
uv run main.py --scenario coldstart --queries 1,5,10

# Run concurrent with specific queries
uv run main.py --scenario concurrent --queries 1-10

# Run only Databricks
uv run main.py --databricks-only

# Run only Snowflake
uv run main.py --snowflake-only

# Combine flags: large warehouse, specific queries, all scenarios
uv run main.py --warehouse-size large --queries 1-10 --scenario all

# Run CTAS scenario with all variants
uv run main.py --scenario ctas

# Run CTAS with specific variants (skip very_wide)
uv run main.py --scenario ctas --ctas-variants narrow_tall,standard_tall,medium_wide,filtered

# Run CTAS on multiple warehouse sizes
uv run main.py --scenario ctas --warehouse-size medium,large,xl

# Run CTAS on Snowflake small only, skipping very_wide
uv run main.py --scenario ctas --snowflake-only --warehouse-size small --ctas-variants narrow_tall,standard_tall,medium_wide,filtered

# Run CTAS on Databricks XL only (uses 2xl mapping)
uv run main.py --scenario ctas --databricks-only --warehouse-size 2xl

# Run DML scenario with default medium warehouse
uv run main.py --scenario dml

# Run DML on multiple warehouse sizes
uv run main.py --scenario dml --warehouse-size medium,large,xl

# Run DML on Snowflake only with large warehouse
uv run main.py --scenario dml --snowflake-only --warehouse-size large

# Run DML on Databricks only with XL warehouse
uv run main.py --scenario dml --databricks-only --warehouse-size xl

4. Enrich Results with Cost and Performance Data

Both Snowflake and Databricks collect detailed cost and performance metrics in system tables, but this data is not immediately available. After running benchmarks, enrich all unenriched queries in the DuckDB database.

Run at least 1-2 hours after benchmark completion:

uv run enrich.py

This unified enrichment script runs the following in order:

Snowflake query enrichment - Data from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
Snowflake warehouse usage - Data from SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
Databricks warehouse usage - Data from system.billing.usage

Snowflake enrichment adds:

Query compilation time
Queue time (provisioning + repair + overload)
Bytes scanned
Cloud services credits used
Total elapsed time (server-side)

Databricks enrichment adds:

Warehouse usage data for DBU cost approximation
Approximate DBU cost per query (proportionally distributed from warehouse-hour costs)

Important Notes:

Timing: Snowflake data available after 45 minutes, Databricks may take 1-2 hours
Databricks costs: DBU costs are approximations (proportionally distributed from warehouse-hour costs)
Snowflake costs: Exact per-query credit tracking
Permissions required:
- Snowflake: SELECT on SNOWFLAKE.ACCOUNT_USAGE.*
- Databricks: SELECT on system.query.history and system.billing.usage

5. View Results

After running benchmarks, view comparison reports using the dbt-generated views.

Database Schema:

Database file: benchmark_results.duckdb (in project root)
Schema: benchmark_results.main
Raw tables: snowflake_results, databricks_results
Analysis views: Generated by dbt (see below)

Build Analysis Views

# Build/refresh all analysis views
cd common/transformations
uvx dbt build

Query Results

Command-line:

# Query results for normal scenario (warm warehouse)
duckdb benchmark_results.duckdb -c "SELECT * FROM platform_comparison_normal;"

# Query results for coldstart scenario (suspended warehouse)
duckdb benchmark_results.duckdb -c "SELECT * FROM platform_comparison_coldstart;"

# Query results for concurrent scenario (parallel execution)
duckdb benchmark_results.duckdb -c "SELECT * FROM platform_comparison_concurrent;"

# Query results for DML scenario (delete + insert operations)
duckdb benchmark_results.duckdb -c "SELECT * FROM platform_comparison_dml;"

# Query latest run (all scenarios)
duckdb benchmark_results.duckdb -c "SELECT * FROM platform_comparison_latest;"

GUI Tool (Recommended):

Use DBeaver for interactive querying and visualization:

Download and install DBeaver
Create a new DuckDB connection
Point to benchmark_results.duckdb in the project root
Query the views with full SQL editor and export capabilities

Available Views:

platform_comparison_normal - Latest normal scenario (sequential queries, warm warehouse)
platform_comparison_coldstart - Latest coldstart scenario (warehouse suspended between queries)
platform_comparison_concurrent - Latest concurrent scenario (all queries in parallel)
platform_comparison_dml - Latest DML scenario (delete + insert operations)
platform_comparison_latest - Latest run with all scenarios combined

See common/transformations/README.md for detailed documentation on the analysis views.

6. Visualize Results (Interactive Dashboard)

After running benchmarks and enriching data, you can launch an interactive React dashboard to explore and compare results visually.

Generate Visualization Data

Export the latest results from DuckDB to JSON format:

uv run visualization/update_data.py

This script:

Queries the run_summary_agg and query_details dbt views
Exports formatted data to visualization/src/data/benchmarkData.json
Includes scenario summaries and per-query details with proportional cost allocation

Launch the Dashboard

Install dependencies and start the development server:

cd visualization
npm install
npm run dev

The dashboard will open at http://localhost:5173 (or another port if 5173 is in use).

Features:

Interactive comparison charts for all scenarios
Query-level performance breakdowns
Cost comparisons (Snowflake credits vs Databricks DBUs)
Filterable by scenario and warehouse tier
Responsive design with detailed tooltips

Note: The dashboard uses static JSON data. Re-run uv run visualization/update_data.py whenever you want to refresh with new benchmark results.

Requirements

Python 3.x with uv package manager
Snowflake account with access to SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000 (pre-loaded sample data)
Databricks workspace with TPC-H SF1000 dataset in Delta Lake
- Note: TPC-H SF1000 data is NOT pre-loaded in Databricks - you must ETL it yourself
- Set DATABRICKS_CATALOG to your catalog containing the TPC-H tables
- Required tables: customer, lineitem, nation, orders, part, partsupp, region, supplier

Project Structure

main.py - Main benchmark execution script
snowflake/ - Snowflake benchmark implementation
databricks/ - Databricks benchmark implementation
common/ - Shared utilities and transformations
- transformations/ - dbt models for analysis views
tests/ - Test suite for benchmark logic
project_plan.md - Detailed implementation plan
CLAUDE.md - Development guidelines

Testing

Run the test suite to verify benchmark logic and ensure code quality:

# Run all tests with test runner script
./run_tests.sh

# Or run manually with pytest
uv run pytest tests/ -v

The test suite includes 34 tests covering:

Warehouse manager lifecycle (create, destroy, suspend/resume)
Query executor logic and metrics collection
Scenario integration (normal, coldstart, all)
Run type classification (cold, semi-warm, warm)

Tests use mocks and do not require database connections.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
archive		archive
common		common
databricks		databricks
snowflake		snowflake
tests		tests
visualization		visualization
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
TEST_SUMMARY.md		TEST_SUMMARY.md
claude.md		claude.md
enrich.py		enrich.py
gemini.md		gemini.md
main.py		main.py
pyproject.toml		pyproject.toml
run_tests.sh		run_tests.sh
setup_config.py		setup_config.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snowflake vs Databricks Performance Benchmark

What It Does

Benchmark Modes

Getting Started

1. Install Dependencies

2. Configure Environment Variables (Automated)

2b. Configure Environment Variables (Manual)

Snowflake Configuration

Databricks Configuration

3. Run Benchmarks

Basic Usage

Command-Line Options

Examples

4. Enrich Results with Cost and Performance Data

5. View Results

Build Analysis Views

Query Results

6. Visualize Results (Interactive Dashboard)

Generate Visualization Data

Launch the Dashboard

Requirements

Project Structure

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

get-select/snowflake-databricks-benchmark

Folders and files

Latest commit

History

Repository files navigation

Snowflake vs Databricks Performance Benchmark

What It Does

Benchmark Modes

Getting Started

1. Install Dependencies

2. Configure Environment Variables (Automated)

2b. Configure Environment Variables (Manual)

Snowflake Configuration

Databricks Configuration

3. Run Benchmarks

Basic Usage

Command-Line Options

Examples

4. Enrich Results with Cost and Performance Data

5. View Results

Build Analysis Views

Query Results

6. Visualize Results (Interactive Dashboard)

Generate Visualization Data

Launch the Dashboard

Requirements

Project Structure

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages