PSYchoacoustic features for PHOnation prediction
Psychoacoustic features for phonation prediction (PSYPHON) aims at creating psychoacoustic-based predictors of phonation by means of Machine Learning (ML) techniques. One of the main outcomes at the end of PSYPHON is a software tool to predict various types of phonation (e.g., modal, breathy, creaky, etc.) based on psychoacoustic models (roughness, loudness, etc.). Other outcomes are the creation of psychoacoustic models using ML, identification of psychoacoustic features related to different phonations, advancing our knowledge of the role of phonation in several languages, and helping to bridge the gap between under- and sufficiently-resourced languages.
There are ~6,000 languages for which Human Language Technologies (HLT) such as speech recognition systems are in early stages of development at best. In an increasingly HLT-dependent world, speakers of these languages are lagging behind those of languages sufficiently resourced. Some under-resourced languages such as Vietnamese and Thai are spoken by large populations (90M and 70M, respectively). Many of these languages use phonation—or voice qualities such as modal, breathy, creaky, etc.—for distinguishing between units of sound (i.e., phonemic contrast) . The speech recognition market in Asia Pacific region (where many of these languages are spoken) is expected to grow at the highest rate in the next three years . PSYPHON (Psychoacoustic features for phonation prediction) is a multidisciplinary project that aims at improving automatic detection of phonemic contrast by means of psychoacoustic models for better speech recognition systems. Psychoacoustic models map the non-linear relationships between physical quantities (frequency, pressure level, etc.) and sensory counterparts (pitch, loudness, etc.) . They also provide units (mel, sone, etc.), scales, and Just-Noticeable Differences (JNDs) that allow the comparison of disparate phenomena consistently.
Researchers usually draw upon predicting models relating phonation with acoustic features like spectral tilt (the slope of speech’s power spectral density)  or articulatory features (e.g., open quotient: the fraction of a glottal cycle during which the glottis is open) . But, there are many ways in which speakers can produce a given phonation. Articulation variations correlate differently with acoustic/physiological features, and depending on the predominant articulation within a group of speakers, performance of these predictors varies. This variation hinders comparison across languages since researchers often choose models that best suit their data (e.g., one researcher uses spectral tilt measured as the magnitude difference between the 2nd and 1st harmonic, whereas another uses the 4th to 2nd harmonic difference. Both spectral tilt measures cannot be compared directly). Speakers using phonemic contrast may differ physiologically in the way they produce it, but ultimately, their interlocutors are able to distinguish different phonations when necessary. Thus, perceptual features of speech at the listener’s end, as opposed to unprocessed acoustic or physiological features at the speaker’s end, could be suitable for predicting phonation.
We will use Recurrent Neural Networks (RNNs), a ML-technique, to address the aforementioned problems. Specifically, RNNs will be used to build predictors of psychoacoustic features upon which an ML-segment classifier of phonation will also be built. As illustrated in the Figure on the right, we will create acoustic stimuli and train RNNs to predict psychoacoustic features (pitch, roughness, sharpness, etc.). As ground truth, we will use subjective responses to those stimuli which are already reported in the literature (chiefly in ). Additionally, we will compile, curate, and annotate different corpora of languages with phonemic contrast. Time series of psychoacoustic features will be computed from these corpora and used for training an RNN classifier of phonation. RNNs are especially well suited for modeling temporal variations in data series: In our case, the length and contents of speech segments are in general unknown, psychoacoustic features vary with the timeline of the stimuli, etc.
PSYPHON will provide new insights of phonation as a contrastive cue, improve current prediction methods, and support the development of HLT in under-resourced languages where phonation is a key component. Because psychoacoustic features have well defined units, audibility thresholds, and JNDs, the use of predictors based on psychoacoustics eases the comparison between different corpora and phenomena.
In contrast with current predictors, our proposal focuses on the reception/feedback end of the of the speech chain . This focus seems appropriate to describe and study phonation since perceptual quantities are more related to the phonemic classification made by listeners than acoustic features which may be correlated with specific articulatory means used by speakers.
For easing the management of PSYPHON, we have divided this project into five Work Packages (WPs) distributed in time as illustrated in Fig. 2. WP1 tackles the issues identified in (A) including acquiring, selecting, annotating, and evaluating acoustic recordings of speech. When acquiring new corpora, we will also perform glottography and record other articulatory features for completeness. We will work on the issues discussed in (B), i.e., implementation and evaluation of psychoacoustic models in WP2. Tasks related to ML-phonation classification (development, training, and evaluation of RNNs, ablation studies, etc.) will be conducted in WP3. This WP requires corpora and psychoacoustic models from WP1 and WP2. For completeness, we will perform perceptual tests to assess native listener classification accuracy and compare it with the performance of our classifier. Dissemination and demonstration tasks are contemplated in WP4. Besides research publication, all the software created in the context of psyphon (psychoacoustic models, phonation classifier, etc.) will be placed in public repositories licensed as Attribution-ShareAlike 4.0 International. Finally, administrative tasks (e.g., overall coordination, objective achievement, quality assurance, and punctuality of reports and demonstrations) are included in WP5.
Signal processing, phonetics, and psychoacoustics are fields where Prof. Villegas has substantial research experience. His team will be in charge of providing computational models of psychoacoustic features and extract their time series from recorded speech. He will lead WP2, WP4, and WP5. Prof. Markov has an ample experience on speech signal processing, statistical modeling, and machine learning. His team will be in charge of the tasks related to WP3: phonation classification based on time series of psychoacoustic features, ablation studies, etc. He will also support the development of WP2. Prof. Lee is an expert on phonetics and phonology of understudied languages with special interest in those where phonemic contrast is used, such as Xitsonga (South Africa), Burmese (Myanmar), and Zhuang (China). He will be in charge of WP1, i.e., provide curated and annotated corpora to train, develop, and evaluate ML-systems. WP4 will be collaboratively conducted by all team members.
Julián Villegas: http://onkyo.u-aizu.ac.jp/
Konstantin Markov: https://www.u-aizu.ac.jp/
Seunghun Lee: https://researchers.icu.ac.jp
Project title: "PSYPHON: Psychoacoustic features for phonation prediction".
This project is supported by JSPS KAKENHI Grant Number 20K11956.