Keypoint Detection

Tabi
Computer vision
April 15, 2025

Table of Contents

Overview

In this project, I explore three different methods for facial keypoint detection using deep learning:

Direct Coordinate Regression
Transfer Learning: Leveraging ResNet18 and DINO
Heatmap-based Keypoint Localization

Our dataset consists of grayscale facial images paired with annotated keypoints. Each image is of size 224 x 224, containing a single face with 68 annotated keypoints. Each keypoint is represented by its (x, y) coordinate in the 2D image space.

The goal is to predict the locations of all 68 keypoints given only the input facial image.

Below is a sample visualization of the dataset:

Keypoint Visualization

Part 1: Direct Coordinate Regression

Method

I first experimented with the most straightforward approach: directly regressing the coordinates of each facial keypoint. The model consists of a series of convolutional layers to extract features from the input image, followed by fully connected layers to predict the keypoints.

Given an input image, the network outputs a vector $\mathbf{y} \in \mathbb{R}^{2K}$, where $K = 68$ is the number of facial keypoints. Each keypoint is represented by its $(x, y)$ coordinates. I use the Smooth L1 loss between the predicted keypoints $\mathbf{y}$ and the ground truth keypoints $\mathbf{y}^*$:

Model Architecture

The CNN architecture includes multiple convolutional blocks with batch normalization and activation functions ELU, followed by a dropout layer and fully connected layers. The final output is a flattened vector of size 136.

Training Details

Loss Function: Smooth L1 Loss
Optimizer: Adam
Learning Rate: $1 \times 10^{-3}$
Batch Size: 8
Epochs: 20
Final Model Epoch: 16

Training Curve

The figure below shows the training and validation loss curves during training:

Left: Validation Curve. Right: Training Curve. The loss decreases steadily, and the model at epoch 16 was selected for evaluation.

Hyperparameter Analysis

I conducted experiments to evaluate the impact of different hyperparameters on performance:

Loss Functions: MSELoss vs. Smooth L1 Loss
Activations: ReLU vs. ELU
Learning Rates: $1 \times 10^{-3}$ vs. $1 \times 10^{-4}$

The table below summarizes the results of these experiments, including the best performance achieved with each configuration, in terms of MSELoss on the test set:

Loss Function	Activation	Learning Rate	MSE (Test Set)
MSELoss	ReLU	$1 \times 10^{-3}$	0.0724
MSELoss	ELU	$1 \times 10^{-3}$	0.1551
Smooth L1	ReLU	$1 \times 10^{-3}$	0.1201
Smooth L1	ELU	$1 \times 10^{-3}$	0.0629
MSELoss	ReLU	$1 \times 10^{-4}$	0.0758
MSELoss	ELU	$1 \times 10^{-4}$	0.0939
Smooth L1	ReLU	$1 \times 10^{-4}$	0.0708
Smooth L1	ELU	$1 \times 10^{-4}$	0.0791

Findings:

The choice between MSE and Smooth L1 resulted in similar training dynamics. But the result shows L1 loss will bring slightly better performance.
ReLU and ELU both helped reduce training loss, but ReLU occasionally caused instability on the validation set. ELU offered more stable generalization and was therefore adopted.
Changing the learning rate mainly affected the training curve; validation performance remained similar.
The red curves are with ReLU, blue curves are with ELU

Thus I will use ELU activation and Smooth L1 loss for the final model and in following parts.

Results and Visualization

Again, the best model finally get 0.0629 Mean Squared Error on test set. The figure below shows the predicted keypoints overlaid on the original image. The red dots represent the predicted keypoints, while the green dots represent the ground truth keypoints.

Keypoint Visualization

As image shows, the results aren’t particularly good yet. Most of the center of the face can be predicted correctly, but it is not possible to accurately locate each keypoint

Part 2: Transfer Learning

To improve performance with limited training data, I adopted transfer learning by leveraging pretrained models—specifically, ResNet18 and DINO. I made light modifications to adapt these models to our facial keypoint detection task. Both models achieved strong performance.