Distributed Information Systems Laboratory LSIR

Knowledge Distillation for On-Device Short Text Classification

Project Details

Knowledge Distillation for On-Device Short Text Classification

Laboratory : LSIR Master Proposal


This master project can be done either at EPFL LSIR or at the company Privately in the EPFL Innovation Park.


Recently, large-scale pre-trained language models, such as BERT (Devlin et al., 2018) and GPT (Radford et al., 2018), have been used effectively as the base models for building task-specific natural language understanding (NLU) models via fine-tuning (such as text classification). However those pre-trained models are expensive to serve at runtime (e.g. BERT contains 24 transformer layers with 344 million parameters, and GPT-2 contains 48 transformer layers with 1.5 billion parameters), which make them impossible to deploy on devices such as mobile phones. The model of choice for mobiles phones is Convolutional Neural Network (CNN) since it is a fast and lightweight architecture.

Knowledge distillation is a process of distilling or transferring the knowledge from a (set of) large, cumbersome model(s) to a lighter, easier-to-deploy single model, without significant loss in performance (Hinton et al., 2015). This small model will be able to produce comparable results, and in some cases, it can even be made capable of replicating the results of the cumbersome model. We consider the cumbersome model as Teacher Network and our new small model as Student Network.

In this project, the candidate will exploit the recent advancements in knowledge distillation for NLU (Liu et al., 2019) to train a novel student CNN model from a much larger teacher pre-trained language models on short text classification tasks. We would also like to extend the knowledge distillation method to the multi-task learning setting.


  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao. 2019. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding.


The candidate should have programming experience, ideally in Python. Previous experience with machine learning and natural language processing is a plus.

30% Theory, 30% Implementation, 40% Research and Experiments


Send me your CV: remi.lebret@epfl.ch.

Contact: Rémi Lebret
Useful Links

Project Guidelines