MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

While current sign language translation technology has made significant strides, there is still no viable solution for generating sign sequences directly from spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production toease communication between sign and non-sign language users. The framework can capably convert multimodal spoken data (speech or text) into continuous sign keypoint sequences. In particular, a sequence diffusion model is crafted to step-by-step generate sign predictions, employing text or speech audio embeddings extracted by pretrained models like CLIP and HuBert. Moreover, by formulating a joint embedding space for text, audio, and sign, we bind data from the three modalities and leverage the semantic consistency across modalities to provide informative feedback signals for the training of diffusion model. This embedding-consistency learning strategy minimizes the reliance on triplet sign language data and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in producing signs from both speech and text data.

How2Sign

Text-to-Sign

Text: Let me demonstrate you this on my back because it’s a lot easier.

WebP Image

Text: Right now, winter ties are probably the more popular way to go.

WebP Image

Text: I have got some leather mittens here.

WebP Image

Text: I have got some leather mittens here.

WebP Image

Audio-to-Sign

Text: And I’m actually going to lock my wrists when I pike.

WebP Image

Text: The rudder is the vertical stabilizer.

WebP Image

Text: There’s the orange portal that we came out of and that’s this test chamber.

WebP Image

Text: So, we’ve got to find a way to get to the exit.

WebP Image

Citation

Please consider citing our paper if it helps your research.

@inproceedings{ma2024ms2sl,
  title={MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production},
  author={Ma, Jian and Wang, Wenguan and Yang, Yi and Zheng, Feng},
  booktitle={ACL},
  year={2024}
}