Will Styler gave a colloquium at Michigan State University entitled, “Ask an Algorithm: Using Machine Learning to study Human Speech.” His research specialization is in acoustic phonetics and speech perception, with a strongly technical and computational approach. His main focus has been on the perception of vowel quality and vowel nasality in the face of limited context and high variability, but he is also interested in natural language processing and other forms of extracting meaningful signal from incredible noise.


Machine learning, the use of nuanced computer models to analyze and predict data, has a long history in speech recognition and natural language processing, but have largely been limited to more applied, engineering tasks.  This talk will describe two more research-focused applications of machine learning in the study of speech perception and production.  

For speech perception, we'll examine the difficult problem of identifying acoustic cues to a complex phonetic contrast, in this case, vowel nasality.  Here, by training machine learning algorithms on acoustic measurements, we can more directly measure the informativeness of the various acoustic features to the contrast.  This by-feature informativeness data was then used to create hypotheses about human cue usage, and then, to model the observed human patterns of perception, showing that these models were able to predict not only the utilized cue, but the subtle patterns of perception arising from less informative changes.  

For speech production, we'll focus on data from Electromagnetic Articulography (EMA), which provides position data for the articulators with high temporal and spatial resolution, and discuss our ongoing efforts to identify and characterize pause postures (specific vocal tract configurations at prosodic boundaries, c.f. Katsika et al. 2014) in the speech of 7 speakers of American English.  Here, the lip aperture trajectories of 800+ individual pauses were gold-standard annotated by a member of the research team, and then subjected to principal component analysis.  These analyses were then used to train a support vector machine (SVM) classifier, which achieved a 94% classification accuracy in cross-validation tests, with a Cohen's Kappa showing machine-to-annotator agreement of 0.88, suggesting the potential for improvements in speed, consistency, and objective characterization of gestures.  

These methods of modeling feature importance and classifying curves using machine learning both demonstrate concrete methods which are potentially useful and applicable to a variety of questions in phonetics, and potentially, in linguistics in general.