Leveraging Self-Supervision for
Cross-Domain Crowd Counting

1Computer Vision Laboratory, EPFL 2Tencent AI Lab
EPFL logo Tencent AI Lab logo

Abstract

State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, these data-driven approaches rely on large amount of data annotation to achieve good performance, which stops these models from being deployed in emergencies during which data annotation is either too costly or cannot be obtained fast enough.

One popular solution is to use synthetic data for training. Unfortunately, due to domain shift, the resulting models generalize poorly on real imagery. We remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time.

Video

Cross-Domain Crowd Counting

Keras code snippet
Crowd Counting (CC) is the task of estimation the number of people presented in an image and it plays an important role for many practical applications such as video surveillance and traffic control. At the same time, most of the techniques developped for CC require significantly large and diverse datasets for training.

Often collecting such a data is impossible and therefore other less data-intensive methods should be used. One of the possible solutions is Cross-Domain methods that can operate both on synthetic and real-world images and therefore amplyfying final accuracy of the model.

Uncertainty Estimation and Masksembles

The ability of deep neural networks to produce useful predictions is now abundantly clear but assessing the reliability of these predictions remains a challenge. The goal of Uncertainty Estimation (UE) is to produce a measure of confidence for model predictions.

In this work, we are using Masksembles layer that pre-generates a set of binary masks before training a network and drops out network's weights in controllable manner during training and inference. In such a way, Masksembles offers number of configurable parameters that allow one to span the whole spectrum of methods between Single Model, MC-Dropout and Ensembles approaches.
Masksembles transformation from Single Model to Ensembles

Self-Supervised Learning for Crowd Counting

Keras code snippet
Self-Supervised Learning is a machine learning method that allows learning from unlabelled data. In our case, we implement this technique via two-step procedure: 1) given a large-scale dataset of synthetic images with generated (crowd-)labels, we train the network with incorporated Masksembles layer in order to enable uncertainty estimation for it 2) Given the model trained on synthetic data, we run inference for this model on real data and using provided uncertainties we filter only the most confident predictions, add them to training dataset and then retrain the model.

Running this procedure for predefined number of iterations or until convergence, we acquire a model that absorbed information both from synthetic (images + labels) and real (images) data and significantly outperforms other approaches.

Results

In this section, we introduce benchmark datasets we use in our experiments: 1) large-scale synthetic dataset GCC is used for training, other datasets 2) ShanghaiTech 3) UCF CC 50 4) WorldExpo’10 are used for testing with the same experimental protocols as in earlier [work]. We compare our approach (OURS) to state-of-the-art methods, namely: 1) Cycle-GAN 2) SE Cycle-GAN 3) SE Cycle-GAN (JT) 4) SE+FD 5) GP. "No Adapt" stands for the protocol when one trains the model only on synthetic data and applies it on the real without any adaptation.

Results 1: For ShanghaiTech (Part A and B) and UCF CC 50, we consistently and clearly outperform all other methods in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by significant margin. Moreover, for ShanghaiTech our method demonstrates performance that is very close to supervised learning baseline.

ShanghaiTech Part A

Model MAE RMSE
No Adapt 160.0 216.5
Cycle-GAN 143.3 204.3
SE Cycle-GAN 123.4 193.4
SE Cycle-GAN(JT) 119.6 189.1
SE+FG 129.3 187.6
GP 121.0 181.0
OURS 109.2 168.1
Supervised 76.3 144.2

ShanghaiTech Part B

Model MAE RMSE
No Adapt 22.8 30.6
Cycle-GAN 25.4 39.7
SE Cycle-GAN 19.9 28.3
SE Cycle-GAN(JT) 16.4 25.8
SE+FG 16.9 24.7
GP 12.8 19.2
OURS 11.4 17.3
Supervised 11.0 17.1

UCF CC 50

Model MAE RMSE
No Adapt 487.2 689.0
Cycle-GAN 404.6 548.2
SE Cycle-GAN 373.4 528.8
SE Cycle-GAN(JT) 370.2 512.0
GP 355.0 505.0
OURS 336.5 486.1
Supervised 259.3 407.2

Results 2: As for the previous results, our method shows excellent performance both on UCF-QNRF dataset and on all of the scenes of WorldExpo’10. Once again, OURS beats other counterparts both in terms of MAE and RMSE on all of the data.

UCF-QNRF

Model MAE RMSE
No Adapt 275.5 458.5
Cycle-GAN 257.3 400.6
SE Cycle-GAN 230.4 384.5
SE Cycle-GAN(JT) 225.9 385.7
SE+FG 221.2 390.2
GP 210.0 351.0
OURS 198.3 332.9
Supervised 198.3 332.9

WorldExpo’10 (Part 1)

Model Scene1 Scene2
No Adapt 4.4 87.2
Cycle-GAN 4.4 69.6
SE Cycle-GAN 4.3 59.1
SE Cycle-GAN(JT) 4.2 49.6
GP - -
OURS 4.0 31.9
Supervised 2.7 18.2

WorldExpo’10 (Part 2)

Model Scene3 Scene4
No Adapt 59.1 51.8
Cycle-GAN 49.9 29.2
SE Cycle-GAN 43.7 17.0
SE Cycle-GAN(JT) 41.3 19.8
GP - -
OURS 23.5 19.4
Supervised 14.3 16.1

WorldExpo’10 (Part 3)

Model Scene5 Average
No Adapt 11.7 42.8
Cycle-GAN 9.0 32.4
SE Cycle-GAN 7.6 26.3
SE Cycle-GAN(JT) 7.2 24.4
GP - 20.4
OURS 4.2 16.6
Supervised 4.5 11.2

Results 3: Qualitative comparison: these visualizations depict closeness of predictions generated by our method and ground truth density labels for crowd counting. As one can see, OURS accurately restores information about dense crowds and correctly predicts locations of people in sparse regions, therefore generating precise final crowd counts estimations.

Interpolation end reference image.

Input Image

Interpolation end reference image.

Ground Truth

Interpolation end reference image.

Estimated Density

Interpolation end reference image.

Input Image

Interpolation end reference image.

Ground Truth

Interpolation end reference image.

Estimated Density

BibTeX

@inproceedings{liu2022leveraging,
  title={Leveraging Self-Supervision for Cross-Domain Crowd Counting},
  author={Liu, Weizhe and Durasov, Nikita and Fua, Pascal},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5341--5352},
  year={2022}   
}