Duration and speed of speech events: A selection of methods
PDF

Keywords

speech timing
Polish
English
speech technology

How to Cite

Gibbon, D., Klessa, K., & Bachan, J. (2014). Duration and speed of speech events: A selection of methods. Lingua Posnaniensis, 56(1), 59–83. https://doi.org/10.2478/linpo-2014-0004

Abstract

The study of speech timing, i.e. the duration and speed or tempo of speech events, has increased in importance over the past twenty years, in particular in connection with increased demands for accuracy, intelligibility and naturalness in speech technology, with applications in language teaching and testing, and with the study of speech timing patterns in language typology. H owever, the methods used in such studies are very diverse, and so far there is no accessible overview of these methods. Since the field is too broad for us to provide an exhaustive account, we have made two choices: first, to provide a framework of paradigmatic (classificatory), syntagmatic (compositional) and functional (discourse-oriented) dimensions for duration analysis; and second, to provide worked examples of a selection of methods associated primarily with these three dimensions. Some of the methods which are covered are established state-of-the-art approaches (e.g. the paradigmatic Classification and Regression Trees, CART , analysis), others are discussed in a critical light (e.g. so-called ‘rhythm metrics’). A set of syntagmatic approaches applies to the tokenisation and tree parsing of duration hierarchies, based on speech annotations, and a functional approach describes duration distributions with sociolinguistic variables. Several of the methods are supported by a new web-based software tool for analysing annotated speech data, the Time Group Analyser.

https://doi.org/10.2478/linpo-2014-0004
PDF

References

Arnold, Denis & Wagner, Petra & Möbius, Bernd. 2011. Evaluating different rating scales for obtaining judgments of syllable prominence from naive listeners. In Proceedings of XVIIth International Congress of Phonetic Sciences, 253-255. Hong Kong.

Auran, Cyril & Bouzon, Caroline & Hirst, Daniel. 2004. The Aix-MARSEC project: an evolutive database of spoken English. In Bel, Bernard & Marlien, Isabelle (eds.), Proceedings of the Second International Conference on Speech Prosody, 561-564. N ara, Japan.

Bachan, Jolanta. 2011. Communicative alignment of synthetic speech. Poznań: Adam Mickiewicz University in Poznań. (Doctoral dissertation.) Barbosa, Plinio. 2009. Measuring speech rhythm variation in an oscillator-based framework. In Proceedings of Interspeech 2009. Brighton: International Speech Communication Association.

Breiman, Leo & Friedman, Jerome & Olshen, R. A . & Stone, Charles. 1984. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Buchsbaum, Adam & van Santen L ., J an P . H . 1997. Methods for Optimal Text Selection. In Proceedings 5th Euro. Conf. on Speech Communication and Technology, Vol 2, 553-556. Rhodes, Greece.

Campbell, Nick. 1992. Multi-level timing in speech. Brighton, UK : University of Sussex (Exp. Psychol). (Doctoral dissertation.)

Carson-Berndsen, J ulie. 1998. Time map phonology: Finite state models and event logics in speech recognition. Dordrecht: Kluwer Academic Publishers.

Cummins, Fred. 1999. Some lengthening factors in English speech combine additively at most rates. The Journal of the Acoustical Society of America 105. 476-480.

Dechert, Hans W . & Raupach, Manfred (eds.), Temporal Variables in Speech. Studies in Honour of Frieda Goldman- Eisler. T he H ague: Mouton.

Demenko, Grażyna & Klessa, Katarzyna & Szymański, Marcin & Breuer, Stefan & Hess, Wolfgang. 2010. Polish unit selection speech synthesis with BOSS: extensions and speech corpora. International Journal of Speech Technology 13(2). 85-99.

Everitt, Brian S. & Landau, Sabine & Leese, Morven & Stahl, Daniel 2011. Cluster Analysis, 5th Edition. King’s College, London: John Wiley & Sons.

Gibbon, Dafydd. 1992. Prosody, time types, and linguistic design factors in spoken language system architectures. Proceedings of KONVENS 1992. 90-99.

Gibbon, Dafydd. 2003. Computational modelling of rhythm as alternation, iteration and hierarchy. In Proceedings of International Congress of Phonetic Sciences III. Barcelona, 2489-2492.

Gibbon, Dafydd. 2006. Time types and time trees: Prosodic mining and alignment of temporally annotated data. In Sudhoff, Stefan et al. 2006. Methods in Empirical Prosody Research, 281-209. Berlin: W alter de Gruyter.

Gibbon, Dafydd. 2013. TGA : a web tool for Time Group Analysis. In Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP). A ix-en-Provence.

Gibbon, Dafydd & Fernandes, Flaviane Romani. 2005. Annotation-mining for rhythm model comparison in Brazilian Portuguese. Proceedings of Interspeech 2005, 3289-3292.

Gibbon, Dafydd & Hirst, Daniel & Campbell, Nick (eds.). 2012. Rhythm, melody and harmony in speech. Studies in honour of Wiktor Jassem. Speech and Language Technology 14/15. Poznań.

Grosjean, François H . & L ass, Norman J . 1977. Some factors affecting the listener’s perception of reading rate in English and French. Language and Speech 20(3). 198-208.

Gut, Ulrike. 2012. Rhythm in L 2 speech. In Gibbon, Dafydd & Hirst, Daniel & Campbell, Nick (eds.), Rhythm, melody and harmony in speech. Studies in honour of Wiktor Jassem. Speech and Language Technology 14/15. 105-114. Poznań.

‘t Hart, Johan & Collier, Rene & Cohen Antonie. 1990. A Perceptual Study of Intonation: An Experimental- Phonetic Approach to Speech Melody. Cambridge: Cambridge University Press.

Hirst, Daniel & Di Cristo, Albert (eds.). 1998. Intonation Systems. A survey of Twenty Languages. Cambridge: Cambridge University Press.

Inden, Benjamin & Malisz, Zofia & Wagner, Petra, & Wachsmuth, Ipke. 2012. Rapid entrainment to spontaneous speech: A comparison of oscillator models. In Miyake, N . & Peebles, D. & Cooper, R. P . (eds.), Proceedings of 34th Annual Conference of the Cognitive Science Society. Austin, T X: Cognitive Science Society.

Jassem, Wiktor. 2003. IPA : Polish. Journal of the International Phonetic Association 33(1). 103-107.

Jassem, Wiktor & Krzyśko, Mirosław & Stolarski, Przemysław. 1981. Regression model of isochrony in speech signal, IPPT PAN 33. Warszawa.

Jassem, Wiktor & H ill, David R. & Witten, Ian H . 1984. Isochrony in English speech: its statistical validity and linguistic relevance. In Gibbon, Dafydd & Richter, Helmut (eds.), Intonation, accent and rhythm. Studies in Discourse Phonology 8. 203-225.

King, Simon & Portele, Thomas & Höfer, Florian. 1997. Speech synthesis using non-uniform units in the Verbmobil project. Proceedings Eurospeech 2. 569-572. Rhodes.

King, Simon & Black, Alan W . & Taylor, Paul & Caley, Richard & Clark, Rob. 2003. Edinburgh Speech Tools. System Documentation Edition 1.2, for 1.2.3 24th Jan 2003. (Retrieved from: http://www.cstr.ed.ac.uk/projects/speech_tools/manual-1.2.0 on 27 April 2013).

Klatt, Dennis. H . 1976. Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. The Journal of the Acoustical Society of America 59. 1208‑1221.

Klatt, Dennis. H . 1987. Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America 88(3). 737-793.

Klessa, Katarzyna & Szymański, Marcin & Breuer, S., & Demenko, Grażyna. 2007. Optimization of Polish segmental duration prediction with CART. In Proceedings of 6th ISCA Workshop on Speech Synthesis (SSW-6). Vol. 1. Bonn.

Klessa, K atarzyna & Wagner, Agnieszka, O leśkowicz-Popiel, Magdalena & K arpiński, Maciej. 2013. “Paralingua” - a new speech corpus for the studies of paralinguistic features. In Vargas-Sierra, Chelo (ed.), Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013). Procedia - Social and Behavioral Science. Vol. 95, 48-58.

Koreman, Jacques. 2006. Perceived speech rate: T he effects of articulation rate and speaking style in spontaneous speech. Journal of the Acoustical Society of America 119. 582-596.

Lehiste, Ilse. 1970. Suprasegmentals. Cambridge, Massachusetts-London: M.I.T. Press.

Lehiste, Ilse. 1977. Isochrony reconsidered. Journal of Phonetics 5.

Low, Ee Ling & Grabe, E sther & Nolan, Francis. 2001. Quantitative characterisations of speech rhythm: Syllabletiming in Singapore English. Language and Speech 43(4). 377-401.

Łobacz, Piotra. 1976a. Objective and subjective speech tempo in Polish. Speech Analysis and Synthesis 4. 173-186.

Łobacz, Piotra. 1976b. Speech rate and vowel formants. Speech Analysis and Synthesis 4. 187-218.

Möbius, Bernd & van Santen, J an P . H . 1996. Modeling segmental duration in German text-to-speech synthesis. Spoken Language, 1996. Proceedings of ICSLP. Vol. 4, 2395-2398. Philadelphia, PA : IEEE .

Möbius, Bernd. 2001. Rare events and closed domains: two delicate concepts in speech synthesis. 4th ISCA ITRW on Speech Synthesis. Perthshire.

Moers, Donata & J auk, Igor & Möbius, Bernd & Wagner, Petra. 2010. Synthesizing Fast Speech by Implementing Multi-Phone Units in Unit Selection Speech Synthesis. In Proceedings of 7th ISCA Tutorial and Research Workshop on Speech Synthesis (SSW-7).

Moos, Anja, & Trouvain, Jürgen. 2007. Comprehension of Ultra-Fast Speech-Blind vs. ‘Normally Hearing’ Persons. In Proceedings of the 16th International Congress of Phonetic Sciences, 677-680.

Olaszy, Gábor. 2002. Predicting Hungarian sound durations for continuous speech. Acta Linguistica Hungarica 49(3-4). 321-345.

OʼShaughnessy, Douglas. 1984. A multispeaker analysis of duration in read French paragraphs. Journal of the Acoustical Society of America 76(6). 1664-1672.

Pfitzinger, Hartmut R. 1996. Two approaches to speech rate estimation. In Proceedings SST. Vol. 96, 421-426.

Portele, Thomas & Sendlemeier, W alter & Hess, Wolfgang. 1990. A system for German speech synthesis based on demisyllables, diphones, and suffixes. In ESCA Workshop on Speech Synthesis Autrans, 161-164.

Richter, Lutosława. 1973. T he duration of Polish vowels. Speech Analysis and Synthesis 3. 87-115. Warszawa.

Richter, Lutosława. 1974. Porównanie iloczasu samogłosek polskich wymówionych w logatomach oraz w wyrazach. Biuletyn Polskiego Towarzystwa Fonetycznego 32. 173-178.

Richter, Lutosława. 1987. Modelling of the rhythmic structure of utterances in Polish. Studia Phonetica Posnaniensia 1. 91-125.

Roach, Peter. 1982. O n the distinction between ‘stress-timed’ and ‘syllable-timed’ languages. In Crystal, David (ed.), Linguistic Controversies: Essays in Linguistic Theory and Practice, 73-79. London: Edward Arnold.

Scott, Donia R. & Isard, S. D. & de Boysson-Bardies, Bénédicte. 1986. On the measurement of rhythmic irregularity: a reply to Benguerel. Journal of Phonetics 14. 327-330.

Siegler, Matthiew A . & Stern, Richard M. 1995. On the effects of speech rate in large vocabulary speech recognition systems. In International Conference on Acoustics, Speech, and Signal Processing 1995. ICASSP-95. Vol. 1, 612-615.

Syrdal, Ann K . & Bunnell, Timothy & Hertz, Susan R. & Mishra, Taniya & Spiegel, Murray & Bickley, Corine & Rekart, Deborah & Makashay, Matthew J . 2012. Text-To-Speech Intelligibility across Speech Rates. In Proceedings of Interspeech. Portland, Oregon.

Szymański, Marcin & Klessa, Katarzyna & Breuer, Stefan & Demenko, Grażyna. 2011. Optimization of unit selection speech synthesis. In Proceedings of XVIIth International Congress of Phonetic Sciences, 1930-1933. Hong Kong.

Treiblmaier, Horst & Filzmoser, Peter. 2009. Benefits from using continuous rating scales in online survey research. Technische Universitt Wien, Forschungsbericht SM-2009-4.

Vainio, Martti. 2001. Artificial neural network based prosody models for Finnish text-to-speech synthesis. Helsinki: University of Helsinki. (Doctoral dissertation.)

van Santen, Jan P . H . 1993. Quantitative modeling of segmental duration. In Proceedings of the workshop on Human Language Technology, 323-328. Association for Computational Linguistics.

Wagner, Petra & Windmann, Andreas. 2011. The shrinking effects on speech tempo perception. In Proceedings of XVIIth International Congress of Phonetic Sciences, 2082-2085. Hong Kong.

Zee, Eric. 2002. The effect of speech rate on the temporal organization of syllable production in cantonese. Proceedings of Speech Prosody. Aix-en-Provence.