Sakriani Sakti
The development of advanced spoken language technologies based on automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has enabled computers to either learn how to listen or speak. Many applications and services are now available but still support fewer than 100 languages. Nearly 7000 living languages that are spoken by 350 million people remain uncovered. This is because the construction is commonly done based on machine learning trained in a supervised fashion where a large amount of paired speech and corresponding transcription is required. In this talk, we will introduce a semi-supervised learning mechanism based on a machine speech chain framework. First, we describe the primary machine speech chain architecture that learns not only to listen or speak but also to listen while speaking. The framework enables ASR and TTS to teach each other given unpaired data. After that, we describe the use of machine speech chain for code-switching and cross-lingual ASR and TTS of several languages, including low-resourced ethnic languages. Finally, we describe the recent multimodal machine chain that mimics overall human communication to listen while speaking and visualizing. With the support of image captioning and production models, the framework enables ASR and TTS to improve their performance using an image-only dataset.
Sakriani Sakti is currently an associate professor at Japan Advanced Institute of Science and Technology (JAIST) Japan, adjunct associate professor at Nara Institute of Science and Technology (NAIST) Japan, visiting research scientist at RIKEN Center for Advanced Intelligent Project (RIKEN AIP) Japan, and adjunct professor at the University of Indonesia. She received DAAD-Siemens Program Asia 21st Century Award in 2000 to study in Communication Technology, University of Ulm, Germany, and received her MSc degree in 2002. During her thesis work, she worked with the Speech Understanding Department, DaimlerChrysler Research Center, Ulm, Germany. She then worked as a researcher at ATR Spoken Language Communication (SLC) Laboratories Japan in 2003-2009, and NICT SLC Groups Japan in 2006-2011, which established multilingual speech recognition for speech-to-speech translation. While working with ATR and NICT, Japan, she continued her study (2005-2008) with Dialog Systems Group University of Ulm, Germany, and received her Ph.D. degree in 2008. She was actively involved in international collaboration activities such as Asian Pacific Telecommunity Project (2003-2007) and various speech-to-speech translation research projects, including A-STAR and U-STAR (2006-2011). In 2011-2017, she was an assistant professor at the Augmented Human Communication Laboratory, NAIST, Japan. She also served as a visiting scientific researcher of INRIA Paris-Rocquencourt, France, in 2015-2016, under JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation. In 2018–2021, she was a research associate professor at NAIST and a research scientist at RIKEN, Center for Advanced Intelligent Project AIP, Japan. Currently, she is an associate professor at JAIST, adjunct associate professor at NAIST, visiting research scientist at RIKEN AIP, and adjunct professor at the University of Indonesia. She is a member of JNS, SFN, ASJ, ISCA, IEICE, and IEEE. Furthermore, she is currently a committee member of IEEE SLTC (2021-2023) and an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020-2023). She was a board member of Spoken Language Technologies for Under-resourced languages (SLTU) and the general chair of SLTU2016. She was also the general chair of the "Digital Revolution for Under-resourced Languages (DigRevURL)" Workshop as the Interspeech Special Session in 2017 and DigRevURL Asia in 2019. She was also the organizing committee of the Zero Resource Speech Challenge 2019 and 2020. She was also involved in creating joint ELRA and ISCA Special Interest Group on Under-resourced Languages (SIGUL) and served as SIGUL Board since 2018. Last year, in collaboration with UNESCO and ELRA, she was also the organizing committee of the International Conference of "Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide". Her research interests lie in deep learning & graphical model framework, statistical pattern recognition, zero-resourced speech technology, multilingual speech recognition and synthesis, spoken language translation, social-affective dialog system, and cognitive-communication.