Speech Synthesis Models for Germanic Low-Resource Languages

Duration

18 months

Status

Completed

MTC Team

Julian Mäder, Dr. Pelin Dogan, Elizabeth Salesky, Luca Campanella, Viturin Züst, Philippe Goetschmann, Agon Serifi, Philipp Rimle, Dr. Severin Klingler, Prof. Thomas Hofmann

Collaborators

Gert von Manteuffel (SRF)

Voice assistants are gaining importance as a human-computer interface in our lives. Millions of users already use voice assistants to get information about the current weather, listen to the news, play music or control their smart homes. However, none of today’s assistants speaks or understands the various dialects in Switzerland. This introduces a barrier to voice assistants’ widespread use in Switzerland’s German-speaking parts, as communication does not occur in the user’s true mother tongue. Switzerland's NLP community is researching Swiss German speech recognition with great effort. But there is not a lot of active research in the field of Swiss German text-to-speech synthesis, although it plays a crucial role in voice assistants. Additionally, such synthesis models could be applied in many different contexts beyond voice assistants such as podcasts, article narration, and automatic radio show generation.

Goals

At MTC, we started the swiss voice project to research the technical possibilities of Swiss German voice assistants with a focus on text-to-speech models. Swiss German is a low-resource language and dialect continuum, for which only little data and no standardized written form is available. This makes building text-to-speech models challenging, since state-of-the-art models for languages such as English and German are relying on big data sets with text and speech aligned. The Swiss Voice project tries to solve those challenges with the following steps.

Building a dataset for Swiss German speech synthesis. We are introducing the first annotated parallel corpus of spoken Swiss German across different dialects, plus a standard German reference. This data set will enable the NLP community to research and build powerful Swiss German speech synthesis models.

Building a system for Swiss German speech synthesis. We are building a system that translates High German text to Swiss German speech in different dialects, based on our data set and deep learning models for machine translation and speech synthesis.

Building the first voice assistant that understands and speaks Swiss German. Our models will be integrated into an existing voice assistant infrastructure to power sample applications for question answering scenarios for news, weather forecasts and location-based assistance.

Outcomes

The SwissDial Data Set

As part of this project, we collected around 3 hours of voice recordings for 8 different dialects (AG, BE, BS, GR, LU, SG, VS, ZH) together with Swiss German and High German transcripts. To foster further research, we will release this data set, making it the most extensive parallel corpus of high quality spoken Swiss German. The data is available to the research community. Researchers can use our data set to explore speech synthesis methods and other technologies dealing with Swiss German texts, such as dialect identification, machine translation, or even to explore linguistic properties of the various dialects.

external page Data set download
external page Data set paper

The Swiss Voice REST API

At the core of this project, we built a text-to-speech model for different Swiss German dialects. Our model first translates High German text into Swiss German text using machine translation methods. The Swiss German text is then converted into Swiss German speech using a neural speech synthesis method. In addition to the voice assistant prototype, we deployed a Swiss Voice REST API that takes High German as input and produces audio in all 8 supported dialects. Our easy-to-use API opens the possibility to apply our technology stack for many different contexts such as voice assistants, article narration, podcast generation, or the generation of individualized radio moderations.

The Swiss Voice Assistant

We created the first voice assistant to speak 8 different Swiss dialects using state-of-the-art technologies in neural speech synthesis. Thanks to a collaboration with external page recapp AG that provides speech recognition services for Swiss German, our voice assistant can understand Swiss German as well. The jointly developed prototype answers everyday requests such as reading the news headlines and weather forecasts in any of the 8 dialects currently available.

Presentation at the Swiss Text Analytics Conference 2021

Publications

Public Open-Source Repositories

SwissDial Dataset

An annotated parallel corpus of spoken Swiss German across 8 major dialects (AG, BE, BS, GR, LU, SG, VS, ZH). The dataset includes around 3 hours of high quality audio per dialect together with Swiss German and High German transcripts. Find out more

Swiss German Data Collection Tool

We collaborated with Fachhochschule Nordwestschweiz (FHNW) to add gamification features to their voice collection platform. Find out more