調音音声合成とは？わかりやすく解説

調音音声合成: 合成音声と声道モデル

ドイツ語文 "Lea und Doreen mögen Bananen"

(日本語訳: リーとドリーンはバナナが好き) を子音+母音調音結合モデルを使って
自然発話文の基本周波数と音長から再現。^[1]

アーティキュレートリー・シンセシス (英: articulatory synthesis)、調音合成 (ちょうおんごうせい) あるいは 調音音声合成 とは、人間の声道のモデルとそこで行なわれる調音プロセス (articulation) に基づいて音声合成を行なうための計算手法である。声道の形状は通常、舌や顎、唇といった調音器官の位置変更と関連した数多くの調音方法で制御できる。声道の表現を介した空気の流れのデジタル・シミュレーションで、音声が生成される。

機械式語り手

「音声合成#歴史」も参照

機械式「語り手」(talking heads) の製作の試みには長い歴史がある。^[2] オーリヤックのジェルベール (–1003)、アルベルトゥス・マグヌス (1198–1280)、ロジャー・ベーコン (1214–1294) らは皆、喋る頭 (speaking heads) を作ったと言われている (Wheatstone 1837^[要出典])。しかしながら、歴史的に確認された音声合成の始まりは訳注: クリスティアン・クラッツェンシュタイン (1723–1795)^[3] とヴォルフガング・フォン・ケンペレン (1734–1804)であり、ケンペレンは1791年に研究報告^[4]を出版した。(Dudley & Tarnoczy (1950)も参照)

電子式声道

最初の電子式アナログ声道は、Dunn (1950)やStevens, Kasowski & Fant (1953)、Fant (1960)のように静的なものだった。Rosen (1958)は動的な声道 (DAVO)を組み立て、後にDennis (1963)がコンピュータ制御を試みた。Dennis & et al. (1964))^[要出典]、比企 & et al. (1968))^[要出典]、Baxter & Strong (1969)らもアナログ声道ハードウェアについて説明している。

最初のコンピュータ・シミュレーションは、Kelly & Lochbaum (1962)が行なった; その後デジタルコンピュータによるシミュレーションを、例えば中田 & 光岡 (1965)、松井 (1968)、Mermelstein (1971))^[要出典]が行なった。本多, 井上 & 小川 (1968)はアナログコンピュータによるシミュレーションを行なった。

Haskinsと前田のモデル

研究室の実験で定期的に使用される最初のソフトウェアによる調音シンセサイザーは、1970年代半ばにHaskins Laboratoriesで Philip Rubin, Tom Baer, Paul Mermelstein により開発された。ASY (Articulatory Synthesis)^[5]として知られるこのシンセサイザーは、1960年代–1970年代にベル研究所で Paul Mermelstein, Cecil Coker, およびその同僚らによって開発された声道モデルに基づく音声生成の計算モデルだった。もう一つの頻繁に使用された著名なモデルは、前田眞治 (Shinji Maeda)による、舌の形状制御に因子ベースのアプローチ (factor-based approach) を使ったモデルである。^[要出典]^[要説明]

現代的なモデル

音声生成イメージング、調音制御モデリング、舌の生体力学モデリングの最近の進展は、調音合成が行われる方法に変化をもたらしている。^[6] 一例として、Philip Rubin, Mark Tiede,^[7] Louis Goldstein^[8] が設計したHaskins CASYモデル (Configurable Articulatory Synthesis)^[9]では、声道の縦断面を実際の核磁気共鳴画像(MRI)データと一致させており、MRIデータを声道の3次元モデルの構築に使用している。フル3次元の調音合成モデルは Olov Engwall^[10]が説明している。^[11] 幾何学的に基づいた^[要出典]3次元調音スピーチ・シンセサイザーはPeter Birkholzにより開発されている。(VocalTracLab^[12]参照) ArtiSynthプロジェクト^[13]は、ブリティッシュコロンビア大学のSidney Fels^[14]が率いており、人間の声道と上気道のための3次元生体力学モデリング・ツールキットを提供している。舌などの調音器官の生体力学モデリングは、Reiner Wilhelms-Tricarico,^[15] Yohan Payan^[16] と Jean-Michel Gerard, ^[17] 党建武 (Jianwu Dang)^[18] と本多清志 (Kiyoshi Honda)^[19] など数多くの科学者によって開拓されている。

商用モデル

数少ない商用の調音スピーチ・シンセシス・システムの一つは、NeXTベースのシステムで、多数の独自研究が実施されていたカナダのカルガリー大学のスピンオフ企業 Trillium Sound Researchにより開発・販売された。 1980年代後半スティーブ・ジョブスが設立し、1997年Apple Computerと合併した NeXTの様々な転生が消滅した後、TrilliumのソフトウェアはGNU General Public Licenseで公開され、Gnuspeech^[20]として継続している。 1994年に最初に発売されたこのシステムは、René Carré^[21]の"Distinctive Region Model" (DRM)^[22]^[23]で制御される、人間の口腔および鼻腔の導波路 (waveguide) モデルもしくは伝送路アナログ(transmission-line analog) を使った^[24](訳注: Tube Resonance Model (TRM)^[25])、フル調音ベースのテキスト読み上げ変換を提供する。

脚注

参考文献

Baxter, Brent; Strong, William J. (1969), “WINDBAG—a vocal-tract analog speech synthesizer”, Journal of the Acoustical Society of America 45: 309(A), doi:10.1121/1.1971456
Birkholz, P.; Jackel, D.; Kröger, B.J. (2007), “Simulation of losses due to turbulence in the time-varying vocal system”, IEEE Transactions on Audio, Speech, and Language Processing 15: 1218–1225
Birkholz P, Jackel D, Kröger BJ (2006), “Construction and control of a three-dimensional vocal tract model”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006) (Toulouse, France): 873–876
Coker, C. H. (1968), “Speech synthesis with a parametric articulatory model”, Proc. Speech. Symp., Kyoto, Japan , paper A-4.
Coker, C. H. (1976). “A model for articulatory dynamics and control”. Proceedings of the IEEE 64 (4): 452–460. doi:10.1109/PROC.1976.10154.
Coker, C. H.; Fujimura, O. (1966). “Model for the specification of the vocal tract area function”. Journal of the Acoustical Society of America 40: 1271. doi:10.1121/1.2143456.
Dennis, Jack B. (1963), “Computer control of an analog vocal tract”, Journal of the Acoustical Society of America 35: 1115(A)
Dudley, Homer; Tarnoczy, Thomas H. (1950). “The speaking machine of Wolfgang von Kempelen”. Journal of the Acoustical Society of America 22 (2): 151–66. doi:10.1121/1.1906583.
Dunn, Hugh K. (1950). “Calculation of vowel resonances, and an electrical vocal tract”. Journal of the Acoustical Society of America 22 (6): 740–53. doi:10.1121/1.1906681.
Engwall, O. (2003), “Combining MRI, EMA & EPG measurements in a three-dimensional tongue model”, Speech Communication 41: 303-329, doi:10.1016/S0167-6393(02)00132-2
Fant, C. Gunnar M (1960), Acoustic theory of speech production, The Hague: Mouton
Fant, Gunnar (1970), Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, Mouton/Walter de Gruyter, ISBN 9789027916006
Gariel, M. (1879). “Machine parlante de M. Faber”. J. Physique Théorique et Appliquée 8: 274–5. doi:10.1051/jphystap:018790080027401.
Gerard, J.M.; Wilhelms-Tricarico, R.; Perrier, P.; Payan, Y. (2003). “A 3D dynamical biomechanical tongue model to study speech motor control”. Recent Research Developments in Biomechanics 1: 49–64.
Henke, W. L. (1966), “Dynamic Articulatory Model of Speech Production Using Computer Simulation”, Unpublished doctoral dissertation, MIT, Cambridge, MA.
本多, 高; 井上, 誠一; 小川, 康男 (1968), Kohasi, Y., ed., “A hybrid control system of a human vocal tract simulator”, Reports of the 6th International Congress on Acoustics (Tokyo, International Council of Scientific Unions.): 175–8
Kelly, John L.; Lochbaum, Carol (1962), “Speech synthesis”, Proceedings of the Speech Communications Seminar, paper F7 (Stockholm, Speech Transmission Laboratory, Royal Institute of Technology)

Kempelen, Wolfgang R. Von (1791), Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine, Wien: J. B. Degen
前田, 眞治 (1988), “Improved articulatory models”, Journal of the Acoustical Society of America 84 (Sup. 1): S146, doi:10.1121/1.2025845
前田, 眞治 (1990), Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model In W. J. Hardcastle & A. Marchal, ed., Speech Production and Speech Modelling, Dordrecht: Kluwer Academic, pp. 131–149
松井, 英一 (1968), Kohasi, Y., ed., “Computer-simulated vocal organs”, Reports of the 6th International Congress on Acoustics (Tokyo, International Council of Scientific Unions.): 151–4
Mermelstein, Paul. (1969), Walker, D. E., ed., “Computer simulation of articulatory activity in speech production”, Proceedings of the International Joint Conference on Artificial Intelligence, Washington, D.C., 1969 (New York: Gordon & Breach)
Mermelstein, P. (1973). “Articulatory model for the study of speech production”. Journal of the Acoustical Society of America 53 (4): 1070–1082. doi:10.1121/1.1913427. PMID 4697807.
中田, 和男; 光岡, 輝義 (1965). “Phonemic transformation and control aspects of synthesis of connected speech”. J. Radio Res. Labs. 12: 171–86.
Mrayati, M.; Carre, R; Guerin, B. (1988), “Distinctive regions and modes: a new theory of speech production”, Speech Communication 7 (3): 257–286, October 1988, doi:10.1016/0167-6393(88)90073-8
Mrayati, M.; Carré, R; Guérin, B. (1990), “Distinctive regions and modes: articulatory-acoustic-phonetic aspects: A reply to Boë and Perrier's comments”, Speech Communication 9 (3): 231–238, June 1990, doi:10.1016/0167-6393(90)90059-I
Paget, R. (1930), Human Speech, New York: Harcourt
Rahim, M.; Goodyear, C.; Kleijn, W.; Schroeter, J.; Sondhi, M. (1993). “On the use of neural networks in articulatory speech synthesis”. Journal of the Acoustical Society of America 93 (2): 1109–1121. doi:10.1121/1.405559.
Rosen, George (1958). “Dynamic analog speech synthesizer”. Journal of the Acoustical Society of America 30 (3): 201–9. doi:10.1121/1.1909541.
Rubin, P. E.; Baer, T.; Mermelstein, P. (1981). “An articulatory synthesizer for perceptual research”. Journal of the Acoustical Society of America 70 (2): 321–328. doi:10.1121/1.386780.
Rubin, P.; Saltzman, E.; Goldstein, L.; McGowan, R.; Tiede, M.; Browman, C. (1996), “CASY and extensions to the task-dynamic model”, Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Producing Modeling - 4th Speech Production Seminar: 125-128 . (other PDF)
Stevens, Kenneth N.; Kasowski, S.; Fant, C. Gunnar M. (1953). “An electrical analog of the vocal tract”. Journal of the Acoustical Society of America 25 (4): 734–42. doi:10.1121/1.1907169.

外部リンク

“Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002”. 2013年10月3日時点のオリジナルよりアーカイブ。2014年5月28日閲覧。

Introduction to Articulatory Speech Synthesis
Simulated singing with the singing robot Pavarobotti or a description from the BBC on how the robot synthesized the singing.

[1] Birkholz, Peter (2013). “Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis”. PLOS ONE 8 (4): e60603. Bibcode: 2013PLoSO...860603B. doi:10.1371/journal.pone.0060603. PMC 3628899. PMID 23613734.

[2] Rubin, Philip; Vatikiotis-Bateson, Eric (1998–2006), Talking Heads, Haskins Laboratories . (PDF)

[3] Paget 1930

[4] Kempelen 1791

[5] Articulatory Synthesis, Haskins Laboratories

[6] “15th ICPhS - Barcelona 2003 - Programme”, The 15th International Congress of Phonetic Sciences, Barcelona, 2003 (International Phonetic Association), オリジナルの2007-05-22時点におけるアーカイブ。

[7] Mark Tiede, Haskins Laboratories

[8] Louis M. Goldstein, Haskins Laboratories

[9] CASY, Haskins Laboratories

[10] Olov Engwall, Sweden: Royal Institute of Technology (KTH), http://www.speech.kth.se/~olov/

[11] Engwall 2003

[12] Peter Birkholz, VocalTractLab, http://www.vocaltractlab.de/, "An articulatory speech synthesizer and tool to visualize and explore the mechanism of speech production with regard to articulation, acoustics, and control."

[13] ArtiSynth, Canada: University of British Columbia, "A 3D Biomechanical Modeling Toolkit for Physical Simulation of Anatomical Structures"

[14] Sidney Fels, Canada: University of British Columbia, http://www.ece.ubc.ca/~ssfels/

[15] Reiner Wilhelms-Tricarico, Haskins Laboratories

[16] Yohan Payan, TIMC-IMAG, http://www-timc.imag.fr/Yohan.Payan/

[17] http://www-timc.imag.fr/gmcao/en-fiches-projets/modele-langue.htm, TIMC-IMAG

[18] Intelligent Information Processing Laboratory (Dang Lab), JAIST, http://iipl.jaist.ac.jp/dang-lab/en/

[19] 本多清志 (Spring 2004), “生体イメージングによる音声生成機構の観測”, ATR Journal (51)

[20] Gnuspeech, GNU Project, Free Software Foundation (FSF)

[21] René Carré, Dynamique Du Langage, CNRS

[22] Mrayati, Carre & Guerin 1988

[23] Mrayati, Carre & Guerin 1990

[24] Hill, David; Manzara, Leonard; Schock, Craig (1995), “Real-time articulatory speech-synthesis-by-rules”, Proc. AVIOS Symposium: 27–44 . (PDF)

[25] Manzara, Leonard, “The Tube Resonance Model Speech Synthesizer”, 49th Meeting of the Acoustical Society of America (ASA) , poster

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]