From Wikipedia, the free encyclopedia

Compact, open-source, software speech synthesizer

eSpeakNG
Original author(s)Jonathan Duddington
Developer(s)Alexander Epaneshnikov et al.
Initial releaseFebruary 2006; 19 years ago (2006-02)
Stable release

1.51[1]  / 2 April 2022; 2 years ago (2 April 2022)

Repositorygithub.com/espeak-ng/espeak-ng/
Written inC
Operating systemLinux
Windows
macOS
FreeBSD
TypeSpeech synthesizer
LicenseGPLv3
Websitegithub.com/espeak-ng/espeak-ng/

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG (Next Generation) is a continuation of the original developer’s project with more feedback from native speakers.

Because of its small size and many languages, eSpeakNG is included in NVDA1 open source screen reader for Windows, as well as Android,2 Ubuntu3 and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 20164 and was used by Google Translate for 27 languages in 2010;5 17 of these were subsequently replaced by proprietary voices.6

The quality of the language voices varies greatly. In eSpeakNG’s predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.7 Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.8 On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.9 Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release)9 with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,10 with separate source and binary downloads made available on SourceForge.9 From eSpeak 1.27, eSpeak was updated to use the GPLv3 license.10 The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.11 The last development release of eSpeak was 1.48.15 on 16 April 2015.12

eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.13

On 25 June 2010,14 Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.

On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.1516

On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan’s absence.1718 The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On 11 December 2015, the espeak-ng fork was started.19 The first release of espeak-ng was 1.49.0 on 10 September 2016,20 containing significant code cleanup, bug fixes, and language updates.

eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language’s ISO 639-1 code. They can be modified by “voice variants”. These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, “af” is the Afrikaans voice. “af+f2” is the Afrikaans voice modified with the “f2” voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en “Hello w3:ld” will say Hello worldⓘ in English.

Duration: 7 seconds.ESpeakNG intro by eSpeakNG in English

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation

There are many languages (notably English) which do not have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g., zi@r0ks is voiced as zi@r0ks in monotone wayⓘ

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z’i@r0ks which provides more natural speech: z’i@r0ks with intonationⓘ

For comparison two samples with and without prosody data:

  1. DIs Iz m0noUntoUn spi:tS is spelled in monotone wayⓘ
  2. DIs Iz ‘Int@n,eItI2d sp’i:tS is spelled intonated wayⓘ

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:21

  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,22 because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

eSpeakNG performs text-to-speech synthesis for the following languages:23

  1. Afrikaans24

  2. Albanian25

  3. Amharic

  4. Ancient Greek

  5. Arabic1

  6. Aragonese26

  7. Armenian (Eastern Armenian)

  8. Armenian (Western Armenian)

  9. Assamese

  10. Azerbaijani

  11. Bashkir

  12. Basque

  13. Belarusian

  14. Bengali

  15. Bishnupriya Manipuri

  16. Bosnian

  17. Bulgarian26

  18. Burmese

  19. Cantonese26

  20. Catalan26

  21. Cherokee

  22. Chinese (Mandarin)

  23. Croatian26

  24. Czech

  25. Chuvash

  26. Danish26

  27. Dutch26

  28. English (American)26

  29. English (British)

  30. English (Caribbean)

  31. English (Lancastrian)

  32. English (New York City)5

  33. English (Received Pronunciation)

  34. English (Scottish)

  35. English (West Midlands)

  36. Esperanto26

  37. Estonian26

  38. Finnish26

  39. French (Belgian)26

  40. French (Canada)

  41. French (France)

  42. Georgian26

  43. German26

  44. Greek (Modern)26

  45. Greenlandic

  46. Guarani

  47. Gujarati

  48. Hakka Chinese3

  49. Haitian Creole

  50. Hawaiian

  51. Hebrew

  52. Hindi26

  53. Hungarian26

  54. Icelandic26

  55. Indonesian26

  56. Ido

  57. Interlingua

  58. Irish26

  59. Italian26

  60. Japanese427

  61. Kannada26

  62. Kazakh

  63. Klingon

  64. Kʌicheʌ

  65. Konkani28

  66. Korean

  67. Kurdish26

  68. Kyrgyz

  69. Quechua

  70. Latin

  71. Latgalian

  72. Latvian26

  73. Lingua Franca Nova

  74. Lithuanian

  75. Lojban26

  76. Luxembourgish

  77. Macedonian

  78. Malay26

  79. Malayalam26

  80. Maltese

  81. Manipuri

  82. Māori

  83. Marathi26

  84. Nahuatl (Classical)

  85. Nepali26

  86. Norwegian (BokmÄl)26

  87. Nogai

  88. Oromo

  89. Papiamento

  90. Persian26

  91. Persian (Latin alphabet)2

  92. Polish26

  93. Portuguese (Brazilian)26

  94. Portuguese (Portugal)

  95. Punjabi29

  96. Pyash (a constructed language)

  97. Quenya

  98. Romanian26

  99. Russian26

  100. Russian (Latvia)

  101. Scottish Gaelic

  102. Serbian26

  103. Setswana

  104. Shan (Tai Yai)

  105. Sindarin

  106. Sindhi

  107. Sinhala

  108. Slovak26

  109. Slovenian

  110. Spanish (Spain)26

  111. Spanish (Latin American)

  112. Swahili24

  113. Swedish26

  114. Tamil26

  115. Tatar

  116. Telugu

  117. Thai

  118. Turkmen

  119. Turkish26

  120. Uyghur

  121. Ukrainian

  122. Urarina

  123. Urdu

  124. Uzbek

  125. Vietnamese (Central Vietnamese)26

  126. Vietnamese (Northern Vietnamese)

  127. Vietnamese (Southern Vietnamese)

  128. Welsh

  129. Currently, only fully diacritized Arabic is supported.

  130. Persian written using English (Latin) characters.

  131. Currently, only Pha̍k-fa-sáčł is supported.

  132. Currently, only Hiragana and Katakana are supported.

  133. Currently unreleased; it must be built from the latest source code.

Footnotes

  1. “Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/nvda”. GitHub. ↩

  2. “eSpeak TTS for Android”. ↩

  3. “espeak-ng package : Ubuntu”. Launchpad. 21 December 2023. ↩

  4. “Download voices for Immersive Reader, Read Mode, and Read Aloud”. ↩

  5. Google blog, Giving a voice to more languages on Google Translate, May 2010 ↩

  6. Google blog, Listen to us now, December 2010. ↩

  7. “eSpeak Speech Synthesizer”. espeak.sourceforge.net. ↩

  8. “eSpeak: Speech Synthesizer”. espeak.sourceforge.net. ↩

  9. “ESpeak: Speech synthesis - Browse /Espeak at SourceForge.net”. ↩ ↩2 ↩3

  10. “eSpeak: speech synthesis / Code / Browse Commits”. sourceforge.net. ↩ ↩2

  11. “Espeak: Downloads”. ↩

  12. http://espeak.sourceforge.net/test/latest.html [bare URL] ↩

  13. van Leussen, Jan-Wilem; Tromp, Maarten (26 July 2007). “Latin to Speech”. p. 6. CiteSeerX 10.1.1.396.7811. ↩

  14. “Build: Allow portaudio 18 and 19 to be switched easily. · rhdunn/Espeak@63daaec”. GitHub. ↩

  15. “Espeakedit: Fix argument processing for unicode argv types · rhdunn/Espeak@61522a1”. GitHub. ↩

  16. “Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/Nvda”. GitHub. ↩

  17. “[Espeak-general] Taking ownership of the espeak project and its future | eSpeak: speech synthesis”. sourceforge.net. ↩

  18. “[Espeak-general] Vote for new main espeak developer | eSpeak: speech synthesis”. sourceforge.net. ↩

  19. Rebrand the espeak program to espeak-ng. ↩

  20. “Release 1.49.0 · espeak-ng/espeak-ng”. GitHub. ↩

  21. Klatt, Dennis H. (1979). “Software for a cascade/parallel formant synthesizer” (PDF). J. Acoustical Society of America, 67(3) March 1980. ↩

  22. “espeak-ng”. GitHub. ↩

  23. “ESpeak NG Text-to-Speech”. GitHub. 13 February 2022. ↩

  24. Butgereit, L., & Botha, A. (2009, May). Hadeda: The noisy way to practice spelling vocabulary using a cell phone. In The IST-Africa 2009 Conference, Kampala, Uganda. ↩ ↩2

  25. Hamiti, M., & Kastrati, R. (2014). Adapting eSpeak for converting text into speech in Albanian. International Journal of Computer Science Issues (IJCSI), 11(4), 21. ↩

  26. Kayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31 ↩32 ↩33 ↩34 ↩35 ↩36 ↩37 ↩38 ↩39 ↩40 ↩41 ↩42

  27. Pronk, R. (2013). Adding Japanese language synthesis support to the eSpeak system. University of Amsterdam. ↩

  28. Mohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414. ↩

  29. Kaur, R., & Sharma, D. (2016). An Improved System for Converting Text into Speech for Punjabi Language using eSpeak. International Research Journal of Engineering and Technology, 3(4), 500-504. ↩