Student-teacher offline reinforcement learning process. It initially starts with teacher model trained with clean data, followed by student-teacher learning using both of clean and noisy data by the frozen teacher model. The well-trained student model is used as the final dialogue policy model where the student classifier parameter ϕcls is directly copied from the teacher classifier parameter θcls.