Code-Mixed Speech Synthesis for Swiss Voice Assistant

Abstract

Most common text-to-speech (TTS) synthesis models are designed to generate speech in one single, set language. However, the occurrence of foreign words in text (code mixing) is a frequent phenomenon leading to pronunciation errors in the output of a monolingual TTS system. In this work, we present a solution to this issue. We develop a data-driven pipeline that is able to produce speech from multilingual text. In contrary to most methods used in related work, we do not rely on the availability of a phonetic transcription. Instead, the speech is generated directly from text. The pipeline consists of two main parts: a language identification model and the actual TTS architecture. Our language identification method detects the language of an input text on word level and provides this information to the TTS model. We propose two extensions of a current state-of-the-art, sequence-to-sequence TTS architecture that add multilingual functionality to the model. After generating suitable training datasets, we conduct various experiments and a user study which results in mean-opinion-scores that are about one point higher than the ones in previous work.


Jonas Stehli

Master's Thesis

Status:

Completed

JavaScript has been disabled in your browser