An end-to-end deep learning approach for estimating Digital Surface Models (DSM) and building heights from single-view aerial imagery using convolutional-deconvolutional neural networks.
This implementation is based on the work presented in:
@article{liu2020im2elevation,
title={IM2ELEVATION: Building Height Estimation from Single-View Aerial Imagery},
author={Liu, Chao-Jung and Krylov, Vladimir A and Kane, Paul and Kavanagh, Geraldine and Dahyot, Rozenn},
journal={Remote Sensing},
volume={12},
number={17},
pages={2719},
year={2020},
publisher={MDPI},
doi={10.3390/rs12172719}
}Please cite this journal paper (available as Open Access PDF) when using this code or dataset.
IM2ELEVATION addresses the challenging problem of estimating building heights and Digital Surface Models (DSMs) from single-view aerial imagery. This is an inherently ill-posed problem that we solve using deep learning techniques.
- End-to-end trainable architecture: Fully convolutional-deconvolutional network that learns direct mapping from RGB aerial imagery to DSM
- Multi-sensor fusion: Combines aerial optical and LiDAR data for training data preparation
- Registration improvement: Novel registration procedure using Mutual Information and Hough transform validation
- State-of-the-art performance: Validated on high-resolution Dublin dataset and popular DSM estimation benchmarks
The model uses a Squeeze-and-Excitation Network (SENet-154) as the encoder backbone with the following components:
Input RGB (3 channels) β SENet-154 Encoder β Multi-Feature Fusion β Refinement β DSM Output (1 channel)
- Encoder (E_senet): SENet-154 pretrained on ImageNet, extracts hierarchical features at 5 different scales
- Decoder (D2): Up-projection blocks with skip connections for spatial resolution recovery
- Multi-Feature Fusion (MFF): Fuses features from all encoder blocks at the same spatial resolution
- Refinement Module (R): Final convolutional layers for output refinement
- Input Size: 440Γ440 RGB images
- Output Size: Variable (typically 220Γ220 or 440Γ440 for DSM)
- Loss Function: Combined loss with depth, gradient, and surface normal terms
- Training: Adam optimizer with learning rate 0.0001, batch size 2
The model uses a sophisticated multi-term loss function:
L_total = L_depth + L_normal + L_dx + L_dyWhere:
- L_depth:
log(|output - depth| + 0.5)- Direct depth estimation loss - L_normal:
|1 - cos(output_normal, depth_normal)|- Surface normal consistency - L_dx, L_dy: Gradient losses in x and y directions for edge preservation
The IM2ELEVATION framework employs a dataset-agnostic normalization pipeline that works seamlessly with any DSM scale, whether in meters, feet, or any other unit system.
Original DSM β [Γ1000] β [Γ·100000] β Model Input β Model Output β [Γ100] β Restored DSM
β β β β β β
Any Units β Precision Normalize Train/Test Raw Output Original
Retention for Training Scale Units
-
Data Loading (
loaddata.py):depth = cv2.imread(depth_name, -1) depth = (depth * 1000).astype(np.uint16) # Multiply by 1000 for precision retention
-
Normalization (
nyu_transform.py):depth = self.to_tensor(depth) / 100000 # Divide by 100000 for model training
-
Net Transformation Effect:
Original_DSM Γ 1000 Γ· 100000 = Original_DSM Γ· 100 -
Model Training/Testing:
- Model learns to predict in the normalized space (Original_DSM Γ· 100)
- All internal computations use this normalized scale
-
Prediction Restoration (
test.py):# Universal restoration formula (works for any dataset) pred_array = output[j, 0].cpu().detach().numpy() * 100
β
Dataset Agnostic: Works with Dublin (meters), DFC2019 (any unit), DFC2023 (any unit), or custom datasets
β
No Hardcoding: No need to specify dataset-specific ranges or units
β
Automatic Scaling: Restores predictions to original DSM scale regardless of input units
β
Consistent Pipeline: Same approach for training, testing, and evaluation phases
If Original_DSM = X units (meters, feet, etc.)
After normalization: X Γ 1000 Γ· 100000 = X Γ· 100
Model predicts: X Γ· 100
After restoration: (X Γ· 100) Γ 100 = X units (original scale restored)
- Precision Retention: Γ1000 prevents small floating-point DSM values from being rounded to zero when stored as uint16 integers in image files
- Training Stability: Γ·100000 creates normalized values suitable for neural network training
- Perfect Reversibility: Γ100 exactly reverses the net Γ·100 effect
- Unit Independence: Process works identically regardless of original measurement units
# Create conda environment with modern dependencies
mamba env create -f tools/environment_setup/environment_modern.yml
conda activate im2elevation
# Install additional packages
pip install pytorch-ssim tensorboard --no-cache-dir# Training with auto-configuration
./run_train.sh --dataset DFC2019_crp512_bin --epochs 100
# Testing with comprehensive options
./run_test.sh --dataset DFC2019_crp512_bin --batch-size 1
# Complete evaluation pipeline
./run_eval.sh --dataset DFC2019_crp512_bin --force-regenerate# Training
python train.py --data pipeline_output/dataset_name --csv dataset/train_dataset.csv --epochs 100
# Testing with uint16 conversion (original IM2ELEVATION format)
python test.py --model pipeline_output/dataset_name --csv dataset/test_dataset.csv --uint16-conversion
# Evaluation
python evaluate.py --predictions-dir pipeline_output/dataset_name/predictions --csv-file dataset/test_dataset.csv--uint16-conversion: Enables original depth formatdepth = (depth*1000).astype(np.uint16)--disable-normalization: Bypasses normalization pipeline for raw model analysis- Multi-GPU Compatibility: Seamless checkpoint loading between single/multi-GPU configurations
- Auto-resume: Intelligent checkpoint detection and resume functionality
- Unified Directory Structure: Consistent
pipeline_output/organization across all scripts - Flexible Configuration: Comprehensive command-line options for all training/testing scenarios
The metrics reported during the test phase (end of each epoch) are NOT the same as the final evaluation results. For authentic regression evaluation, always use the dedicated evaluation phase.
- Purpose: Quick progress monitoring during training/testing
- Implementation: PyTorch tensor-based computation with batch processing
- Preprocessing: Applied in
util.evaluateError()function - Averaging: Pixel-weighted averaging across all batches
- Metrics: MSE, RMSE, MAE, SSIM computed per-batch then averaged
- Purpose: Comprehensive, publication-ready evaluation results
- Implementation: NumPy array-based computation with image-by-image processing
- Preprocessing: Consistent with test phase but applied differently
- Averaging: Image-weighted averaging (each image contributes equally)
- Metrics: Extended metrics including building-type-specific RMSE and delta metrics
- Different Libraries: PyTorch (test) vs NumPy (evaluation)
- Averaging Methods: Pixel-weighted vs image-weighted
- Processing Pipeline: Batch processing vs individual image processing
- Numerical Precision: Different floating-point handling
- Use test phase metrics for training progress monitoring and model comparison during development
- Use evaluation phase metrics for final results, publications, and authentic model assessment
- MSE: Mean Squared Error - Overall prediction accuracy
- RMSE: Root Mean Squared Error - Standard deviation of prediction errors
- MAE: Mean Absolute Error - Average absolute prediction error
- Low-rise RMSE: Buildings 1-15m height
- Mid-rise RMSE: Buildings 15-40m height
- High-rise RMSE: Buildings >40m height
- Ξ΄β: Percentage of pixels with
max(pred/gt, gt/pred) < 1.25 - Ξ΄β: Percentage of pixels with
max(pred/gt, gt/pred) < 1.25Β² - Ξ΄β: Percentage of pixels with
max(pred/gt, gt/pred) < 1.25Β³
Download the trained SENet-154 model weights: Download Trained Weights
High-resolution dataset captured over central Dublin, Ireland:
- LiDAR point cloud: 2015
- Optical aerial images: 2017
- Coverage: Central Dublin area with building height annotations
- Registration: Mutual Information-based alignment of optical and LiDAR data
- Validation: Hough transform-based validation to detect registration failures
- Adjustment: Interpolation-based correction of failed registration patches
- Augmentation: Standard data augmentation techniques during training
The implementation supports three encoder backbones:
- ResNet-50:
define_model(is_resnet=True, is_densenet=False, is_senet=False) - DenseNet-161:
define_model(is_resnet=False, is_densenet=True, is_senet=False) - SENet-154 (Recommended):
define_model(is_resnet=False, is_densenet=False, is_senet=True)
- Multi-GPU Support: DataParallel training on multiple GPUs
- Skip Connections: Enhanced feature propagation from encoder to decoder
- Up-projection Blocks: Learnable upsampling with feature refinement
- Multi-scale Fusion: Features from all encoder levels combined for rich representation
IM2ELEVATION/
βββ models/
β βββ net.py # Main model definition
β βββ modules.py # Network components (Encoder, Decoder, MFF, Refinement)
β βββ senet.py # SENet backbone implementation
β βββ resnet.py # ResNet backbone implementation
β βββ densenet.py # DenseNet backbone implementation
βββ train.py # Training script
βββ test.py # Testing and evaluation script
βββ loaddata.py # Data loading utilities
βββ util.py # Utility functions for evaluation
βββ sobel.py # Sobel edge detection for gradient loss
βββ splitGeoTiff.py # Geospatial data processing
βββ tools/
βββ environment_setup/ # Environment setup tools
βββ git_setup/ # Git configuration
βββ pdf_processor/ # Document processing tools
- Evaluation Pipeline - Comprehensive guide to the evaluation system and metrics
- Scripts Documentation - Shell scripts usage and features
- GPU Issues & Solutions - Complete guide to GPU-related problems and fixes
- GPU Fixes Summary - Quick summary of recent GPU utilization improvements
- Original Paper - Markdown version of the research paper
- Paper PDF - Full research paper
- Environment Setup - Complete environment installation guide
- Git Setup - Git configuration for the project
- OS: Linux (tested on Ubuntu)
- Memory: 16GB+ RAM recommended
- Storage: 10GB+ free space
- GPU: NVIDIA GPU with CUDA support (recommended)
- Dependencies: Python 3.11, PyTorch 2.4.0, CUDA 12.1
The model achieves state-of-the-art performance on:
- Dublin Dataset: High-resolution validation with 2015 LiDAR and 2017 optical imagery
- Standard Benchmarks: Competitive results on popular DSM estimation datasets
- Evaluation Metrics: MSE, RMSE, MAE, SSIM for comprehensive assessment
We welcome contributions! Please see our environment setup guide in tools/environment_setup/ for development setup instructions.
For questions about the implementation or dataset, please refer to the original paper or create an issue in this repository.
This implementation is forked from speed8928/IMELE with significant enhancements including:
- Enhanced shell script pipeline with auto-configuration
- Original IM2ELEVATION uint16 conversion compatibility
- Multi-GPU checkpoint compatibility and improved error handling
- Comprehensive evaluation system with detailed metrics reporting
Note: This implementation focuses on urban building height estimation. For optimal results, ensure your aerial imagery has similar characteristics to the training data (resolution, viewing angle, urban scenes).