=================================================== FestCat: Speech Synthesis in Catalan using Festival http://www.talp.upc.edu/festcat Antonio Bonafonte TALP Research Center Barcelona, November 2007 =================================================== 1. WHAT? 2. WHO? 3. TERMS AND CONDITIONS 4. REQUERIMENTS 5. INSTALLATION 6. EXECUTION 7. THANKS =================================================== 1. WHAT? "Festival parla català" The FestCat package consists of a library providing analysis of Catalan text, and the data to extend Festival so that it can speak Catalan. This project has been originally developed by the TALP Center, Universitat Politecnica de Catalunya, Barcelona. http://www.talp.upc.edu/festcat Basically, there are two components: (1) Linguistic data and code to extend Festival for Catalan. Dictionaries, tokenizer, lts rules, POStagger data, etc. This includes two folders: dicts/upc (basically dictionaries) upc_catalan (basically code) (2) Voices: speaker dependent data. There is one folder for each voice voices/catalan/upc_ca_'speaker-name' Several voices have already been developed. Check the web page to get the latest downloads. 2. WHO? Most of the code and data has been specifically developed for this project by the TALP Research Center at UPC www.talp.upc.edu/festcat www.talp.upc.edu www.upc.edu A significant exception are the dictionaries. The main source for building the dictionaries is the Catalan lexicon provided by the FreeLing project, also developed by the TALP Research Center and others: please, visit FreeLing web site for more information: http://garraf.epsevg.upc.es/freeling/ The lexicon has been enriched in the following way: - Phonetic transcriptions have been automatically generated using the the TALP phonetic transcription toolkit - New word forms have been added using frequent words found in our corpus and words found in our 'speech' data to ensure better coverage when designing the voices. 3. TERMS AND CONDITIONS All the code and linguistic resources are provided under the LGPL license (see the COPYING file). 4. REQUERIMENTS You need a working Festival system. Check in your Linux distribution or in the Festival home page http://www.cstr.ed.ac.uk/projects/festival/ We have been working with version 2.1 November 2010 (Execute $ festival --version ) 5. INSTALLATION We have developed several catalan voices. All of them share a common library, which is language related. Therefore, you need the 'base' package plus the specific voices you are interested in. You just need to copy several folder in the datadir of Festival. To find this directory, you can execute $ festival -b '(print datadir)' If this directory is not defined, you should use the 'libdir' directory: $ festival -b '(print libdir)' * COMMON PACKAGE * Download the file upc_ca_base.tgz and extract the files: $ tar -zxf upc_ca_base.tgz Move the extracted files to the festival 'datadir': a) Dictionaries: Copy the folder dicts/upc to 'datadir'/dicts/upc b) Catalan tokenizer, tagger, etc. Copy the folder upc_catalan to 'datadir'/upc_catalan c) If you want that Festival understand the --language option and exports the catalan speakers to other applications, you need to update the languages.scm file to add Catalan. We provide this file: languages.scm => 'datadir'/languages.scm * VOICE SPECIFIC PACKAGES * Download the file of each voice (check the web for updates, http://www.talp.upc.edu/festcat ) and extract the content Ex: $ tar -zxf upc_ca_ona_hts.tgz d) Copy each catalan voice, ex: upc_ca_ona_hts, in the voices directory. Example: upc_ca_ona_hts => 'datadir'/voices/catalan/upc_ca_ona_hts 6. EXECUTION There are several front-ends to be used with Festival, as gnopernicus, or emacs-speak ... Here we only mention the direct use of Festival. WARNING WARNING WARNING !!! Festival expects ISO-8859-15 encoding. Be sure that you use this encoding in your terminal or files. If your system uses UTF-8 (as do many distributions today) you need to convert the file before reading. Some front-ends, as gnopernicus, do the conversions for you. You can use the "save as" options in gedit; or use programs to convert the format, as iconv: $ iconv -f utf8 -t ISO-8859-15//TRANSLIT myfile_utf8.text > myfile_latin1.text !!! * A quick test: $ echo "Bon dia, Catalunya" | festival --tts --language catalan * You can also execute Festival in interactive way: $festival (language_catalan) (intro-catalan) (SayText "Bon dia, Catalunya.") (SayText "Bona nit.") (exit) If you want to specify the speaker, introduce the command to select the speaker instead of the language selection command; or just use it to change the speaker: (voice_upc_ca_ona_hts) (SayText "I tu, qui ets?") (voice_upc_ca_pau_hts) (SayText "Jo sóc, el que tu ets, i si et faig mal, em faig mal a mi mateix.") (voice_upc_ca_ona_hts) (SayText "Que maco. Això és de l'assemblea dels infants, oi?") (exit) O per llegir un fitxer de text, per exemple "bon_dia.txt": $ echo "Bon dia, Catalunya." > bon_dia.txt $ festival (language_catalan) (tts_file "bon_dia.txt") (exit) * Or use the text2wave script to create a .wav file: $ text2wave -o bondia.wav -eval '(language_catalan)' bon_dia.txt If you want to specify the speaker: $ text2wave -o bondia.wav -eval '(voice_upc_ca_ona_hts)' bon_dia.txt 7. THANKS This work has been supported by the Catalan Government www.gencat.net The project was promoted by several departments from the Catalan Government - Departament d'Educació - Secretaria de Telecomunicacions i Societat de la Informació del Departament de Presidència. and from the Universitat Politècnica de Catalunya (UPC) - TALP Research Centre - Càtedra d'Accessibilitat - Càtedra de Programari Lliure Read the THANKS file to see the list of people that have contributed to this project.