Is Label Smoothing Truly Incompatible with Knowledge Distillation:
An Empirical Study

ICLR 2021
Zhiqiang Shen1    Zechun Liu12    Dejia Xu3    Zitian Chen4    Kwang-Ting Cheng2    Marios Savvides1
3Peking University
4UMass Amherst
[PyTorch Code]


Recently, Muller et al. proposed a new standpoint that teachers trained with label smoothing distill inferior student compared to teachers trained with hard labels, even label smoothing improves teacher’s accuracy, as the authors found that label smoothing tends to “erase” information contained intra-class across individual examples, which indicates that the relative information between logits will be erased to some extent when the teacher is trained with label smoothing. We present a novel connection of label smoothing to the idea of “erasing” relative information. We expose the fact that factually the negative effects of erasing relative information only happen on the semantically different classes. Intuitively, those classes are easy to classify as they have obvious discrepancies. Therefore, the negative effects during distillation are fairly moderate. On those semantically similar classes, interestingly, we observe that erasing phenomenon can enforce two clusters being away from each other and actually enlarge the central distance of clusters between classes, which means it makes the two categories easier for classifying. These classes in traditional training procedure are difficult to distinguish, so generally, the benefits of using label smoothing on teachers outweigh the disadvantages when training in knowledge distillation.


Z. Shen, Z. Liu, D. Xu, Z. Chen,
K. Cheng, M. Savvides.

Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study.
ICLR, 2021.

[Paper] | [Bibtex] | [Code (Distillation)] | [Code (Stability Metric)] (Distillation code can refer to our MEAL V2 project, just remove discriminator loss in and choose proper teachers/students.)

This paper aims to address:

1. Does label smoothing in teacher networks suppress the effectiveness of knowledge distillation?

2. What will actually determine the performance of a student in knowledge distillation?

3. When will the label smoothing indeed lose its effectiveness for learning deep neural networks?

Our observations: (i) Long-Tailed Distribution; (ii) More #Class.

Qualitative analysis

Predicted label distribution from teachers

The probabilities have a major value (the bars in Fig. (1)) that represents model’s prediction on category and other small values (i.e., minor predictions in Fig. (2)) indicate that the input image is somewhat similar to those other categories. We can observe in this figure that a model trained with label smoothing will generate more softened distributions, but the relations across different classes are still preserved.

Quantitative analysis

Stability metric [Code]

Our motivation on this metric: If label smoothing erases relative information within a class, the variance of intra-class probabilities will decrease accordingly, thus we can use such variance to monitor the erasing degree:

Metric evaluation

The variances with label smoothing always have lower values than models trained without label smoothing. Models trained with more epochs by incorporating data augmentation techniques like CutMix can dramatically increase the stability, this means relative information will be erased significantly by more data augmentation and longer training.

Testing curve along training

From the visualization we found two interesting phenomena: On training set, the loss of teacher networks that trained with label smoothing is much higher than that of without label smoothing. While on validation set the accuracy is comparable or even slightly better (The boosts on CUB is greater than those on ImageNet-1K, as shown in Table 2).

What circumstances indeed will make label smoothing less effective?

Long-Tailed Distribution.

Weight shrinkage effect (Regularisation) enabled by label smoothing is no longer effective on the long-tailed recognition circumstance and will further impair the performance.

More #Class.

This is another circumstance we found will impair the effectiveness of label smoothing.

What is a better teacher in knowledge distillation?

Better supervision is crucial for distillation: (i) Higher accuracy; (ii) Better stability.

Our Other Work on Knowledge Distillation

Feel free to check them out if interested:

Zhiqiang Shen, Zechun Liu, Jie Qin, Lei Huang, Kwang-Ting Cheng, and Marios Savvides. "S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration." CVPR (2021).

Zhiqiang Shen, and Marios Savvides. "Meal v2: Boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks." NeurIPS workshop 2020.

Zhiqiang Shen, Zhankui He, and Xiangyang Xue. "Meal: Multi-model ensemble via adversarial learning." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 4886-4893. 2019.

*Website template from Swapping Autoencoder.