LDP/LDP/howto/docbook/Speech-Recognition-HOWTO.sgml

1510 lines
45 KiB
Plaintext

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
<article>
<!-- Header -->
<articleinfo>
<title>Speech Recognition HOWTO</title>
<author>
<firstname>Stephen</firstname>
<surname>Cook</surname>
<affiliation>
<address>
<email>scook@gear21.com</email>
</address>
</affiliation>
</author>
<revhistory>
<revision>
<revnumber>v2.0</revnumber>
<date>April 19, 2002</date>
<authorinitials>scc</authorinitials>
<revremark>
Changed license information (now GFDL) and added a new publication.
</revremark>
</revision>
<revision>
<revnumber>v1.2</revnumber>
<date>February 5, 2002</date>
<authorinitials>scc</authorinitials>
<revremark>
Added more commercial software listings (sent by Mayur Patel).
</revremark>
</revision>
<revision>
<revnumber>v1.1</revnumber>
<date>October 5, 2001</date>
<authorinitials>scc</authorinitials>
<revremark>
Added info for Vocalis Speechware. Fixed/Updated various other items.
</revremark>
</revision>
<revision>
<revnumber>v1.0</revnumber>
<date>November 20, 2000</date>
<authorinitials>scc</authorinitials>
<revremark>
Added info on L and H and HTK
</revremark>
</revision>
<revision>
<revnumber>v0.5</revnumber>
<date>September 13, 2000</date>
<authorinitials>scc</authorinitials>
<revremark>
Initial HOWTO Submission
</revremark>
</revision>
</revhistory>
<abstract>
<indexterm>
<primary>Speech Recognition</primary>
</indexterm>
<para>
Automatic Speech Recognition (ASR) on Linux is becoming easier.
Several packages are available for users as well as developers.
This document describes the basics of speech recognition and
describes some of the available software.
</para>
</abstract>
</articleinfo>
<!-- Section1: legal -->
<sect1 id="legal">
<title>Legal Notices</title>
<!-- Section2: copyright -->
<sect2 id="copyright">
<title>Copyright/License</title>
<para>
Copyright (c) 2000-2002 Stephen C. Cook.
Permission is granted to copy, distribute, and/or modify this document under the
terms of the GNU Free Documentation License, Version 1.1 or any later version published
by the Free Software Foundation.
</para>
<para>
This document is made available under the terms of the <ulink url="http://www.gnu.org/copyleft/fdl.html">
GNU Free Documentation License (GFDL)</ulink>, which is hereby
incorporated by reference.
</para>
</sect2>
<!-- Section2: disclaimer -->
<sect2 id="disclaimer">
<title>Disclaimer</title>
<para>
The author disclaims all warranties with regard to this document,
including all implied warranties of merchantability and fitness for a
certain purpose; in no event shall the author be liable for any
special, indirect or consequential damages or any damages whatsoever
resulting from loss of use, data or profits, whether in an action of
contract, negligence or other tortious action, arising out of or in
connection with the use of this document.
</para>
</sect2>
<!-- Section2: trademarks -->
<sect2 id="trademarks">
<title>Trademarks</title>
<para>
All trademarks contained in this document are copyright/trademark
of their respective owners.
</para>
</sect2>
</sect1>
<!-- Section1: Forward -->
<sect1 id="forward">
<title>Forward</title>
<!-- Section2: copyright -->
<sect2 id="about">
<title>About This Document</title>
<para>
This document is targeted at the beginner to intermediate level Linux
user interested in learning about Speech Recognition and trying it out.
It may also help the interested developer in explaining the basics of
speech recognition programming.
</para>
<para>
I started this document when I began researching what speech
recognition software and development libraries were available for Linux.
Automated Speech Recognition (ASR or just SR) on Linux is just starting
to come into its own, and I hope this document gives it a push in the
right direction - by supporting both users and developers of ASR
technology.
</para>
<para>
I have left a variety of SR techniques out of this document, and
instead I have focused on the "HOWTO" aspect (since this is a howto...).
I have included a Publications section so the interested reader can
find books and articles on anything not covered here. This is not
meant to be a definitive statement of ASR on Linux.
</para>
<para>
For the most recent version of this document, check the LDP archive,
or go to:
<ulink url="http://www.gear21.com/speech/index.html">
http://www.gear21.com/speech/index.html</ulink>.
</para>
</sect2>
<!-- Section2: copyright -->
<sect2 id="acknowledgements">
<title>Acknowledgements</title>
<para>
I would like to thank the following people for the help, reviewing,
and support of this document:
</para>
<para>
<itemizedlist>
<listitem><para>
Jessica Perry Hekman
</para></listitem>
<listitem><para>
Geoff Wexler
</para></listitem>
</itemizedlist
</para>
</sect2>
<!-- Section2: comments -->
<sect2 id="comments">
<title>Comments/Updates/Feedback</title>
<para>
If you have any comments, suggestions, revisions, updates, or just
want to chat about ASR, please send an email to me at
<ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
</para>
</sect2>
<!-- Section2: todo -->
<sect2 id="todo">
<title>ToDo</title>
<para>
The following things are left "to do":
</para>
<para>
<itemizedlist>
<listitem><para>
Add descriptions in the Publications section.
</para></listitem>
<listitem><para>
Add more books to the Publications section.
</para></listitem>
<listitem><para>
Add more links with descriptions.
</para></listitem>
<listitem><para>
Enhance the description of the ASR system steps
</para></listitem>
<listitem><para>
Include descriptions of FFTs and Filters.
</para></listitem>
<listitem><para>
Include descriptions of DSP principles.
</para></listitem>
</itemizedlist>
</para>
</sect2>
<!-- Section2: todo -->
<sect2 id="revision">
<title>Revision History</title>
<para>
v0.1 first rough draft - August 2000
</para>
<para>
v0.5 final draft - September 2000
</para>
</sect2>
</sect1>
<!-- Section1: Introduction -->
<sect1 id="introduction">
<title>Introduction</title>
<!-- Section2: copyright -->
<sect2 id="basics">
<title>Speech Recognition Basics</title>
<para>
Speech recognition is the process by which a computer (or
other type of machine) identifies spoken words. Basically, it means
talking to your computer, AND having it correctly recognize what you
are saying.
</para>
<para>
The following definitions are the basics needed for understanding
speech recognition technology.
</para>
<para>
<variablelist>
<varlistentry>
<term>Utterance</term>
<listitem>
<para>
An utterance is the vocalization (speaking) of a word or words that
represent a single meaning to the computer. Utterances can be a
single word, a few words, a sentence, or even multiple sentences.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Speaker Dependance</term>
<listitem>
<para>
Speaker dependent systems are designed around a specific speaker.
They generally are more accurate for the correct speaker, but much
less accurate for other speakers. They assume the speaker will
speak in a consistent voice and tempo. Speaker independent systems
are designed for a variety of speakers. Adaptive systems usually start
as speaker independent systems and utilize training techniques to
adapt to the speaker to increase their recognition accuracy.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Vocabularies</term>
<listitem>
<para>
Vocabularies (or dictionaries) are lists of words or utterances that
can be recognized by the SR system. Generally, smaller vocabularies
are easier for a computer to recognize, while larger vocabularies
are more difficult. Unlike normal dictionaries, each entry doesn't
have to be a single word. They can be as long as a sentence or two.
Smaller vocabularies can have as few as 1 or 2 recognized utterances
(e.g."Wake Up"), while very large vocabularies can have a hundred
thousand or more!
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Accuract</term>
<listitem>
<para>
The ability of a recognizer can be examined by measuring its
accuracy - or how well it recognizes utterances. This includes not
only correctly identifying an utterance but also identifying if the
spoken utterance is not in its vocabulary. Good ASR systems have an
accuracy of 98% or more! The acceptable accuracy of a system
really depends on the application.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Training</term>
<listitem>
<para>
Some speech recognizers have the ability to adapt to a speaker.
When the system has this ability, it may allow training to take
place. An ASR system is trained by having the speaker repeat
standard or common phrases and adjusting its comparison algorithms
to match that particular speaker. Training a recognizer usually
improves its accuracy.
</para>
<para>
Training can also be used by speakers that have difficulty
speaking, or pronouncing certain words. As long as the speaker
can consistently repeat an utterance, ASR systems with training
should be able to adapt.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
</sect2>
<!-- Section2: types -->
<sect2 id="types">
<title>Types of Speech Recognition</title>
<para>
Speech recognition systems can be separated in several different
classes by describing what types of utterances they have the ability
to recognize. These classes are based on the fact that one of the
difficulties of ASR is the ability to determine when a speaker starts
and finishes an utterance. Most packages can fit into more than one
class, depending on which mode they're using.
</para>
<para>
<variablelist>
<varlistentry>
<term>Isolated Words</term>
<listitem>
<para>
Isolated word recognizers usually require each utterance to have
quiet (lack of an audio signal) on BOTH sides of the sample window.
It doesn't mean that it accepts single words, but does require
a single utterance at a time. Often, these systems have
"Listen/Not-Listen" states, where they require the speaker to wait
between utterances (usually doing processing during the pauses).
Isolated Utterance might be a better name for this class.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Connected Words</term>
<listitem>
<para>
Connect word systems (or more correctly 'connected utterances')
are similar to Isolated words, but allow separate utterances to be
'run-together' with a minimal pause between them.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Continuous Speech</term>
<listitem>
<para>
Continuous recognition is the next step. Recognizers with continuous
speech capabilities are some of the most difficult to create because
they must utilize special methods to determine utterance boundaries.
Continuous speech recognizers allow users to speak almost naturally,
while the computer determines the content. Basically, it's computer
dictation.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Spontaneous Speech</term>
<listitem>
<para>
There appears to be a variety of definitions for what spontaneous
speech actually is. At a basic level, it can be thought of as
speech that is natural sounding and not rehearsed. An ASR system
with spontaneous speech ability should be able to handle a variety
of natural speech features such as words being run together, "ums"
and "ahs", and even slight stutters.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Voice Verification/Identification</term>
<listitem>
<para>
Some ASR systems have the ability to identify specific users. This
document doesn't cover verification or security systems.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
</sect2>
<!-- Section2: uses -->
<sect2 id="uses">
<title>Uses and Applications</title>
<para>
Although any task that involves interfacing with a computer can
potentially use ASR, the following applications are the most
common right now.
</para>
<para>
<variablelist>
<varlistentry>
<term>Dictation</term>
<listitem>
<para>
Dictation is the most common use for ASR systems today. This
includes medical transcriptions, legal and business dictation, as
well as general word processing. In some cases special vocabularies
are used to increase the accuracy of the system.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Command and Control</term>
<listitem>
<para>
ASR systems that are designed to perform functions and actions on the
system are defined as Command and Control systems. Utterances like
"Open Netscape" and "Start a new xterm" will do just that.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Telephony</term>
<listitem>
<para>
Some PBX/Voice Mail systems allow callers to speak commands instead of
pressing buttons to send specific tones.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Wearables</term>
<listitem>
<para>
Because inputs are limited for wearable devices, speaking is a
natural possibility.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Medical/Disabilities</term>
<listitem>
<para>
Many people have difficulty typing due to physical limitations such
as repetitive strain injuries (RSI), muscular dystrophy, and
many others. For example, people with difficulty hearing could use
a system connected to their telephone to convert the caller's speech
to text.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>Embedded Applications</term>
<listitem>
<para>
Some newer cellular phones include C&amp;C speech recognition that allow
utterances such as "Call Home". This could be a major factor in the
future of ASR and Linux. Why can't I talk to my television yet?
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
</sect2>
</sect1>
<!-- Section1: Hardware -->
<sect1 id="hardware">
<title>Hardware</title>
<!-- Section2: soundcards -->
<sect2 id="soundcards">
<title>Sound Cards</title>
<para>
Because speech requires a relatively low bandwidth, just about any
medium-high quality 16 bit sound card will get the job done. You must
have sound enabled in your kernel, and you must have correct drivers
installed. For more information on sound cards, please see "The Linux
Sound HOWTO" available at: http://www.LinuxDoc.org/. Sound card
quality often starts a heated discussion about their impact on accuracy
and noise.
</para>
<para>
Sound cards with the 'cleanest' A/D (analog to digital) conversions
are recommended, but most often the clarity of the digital sample is
more dependent on the microphone quality and even more dependent on the
environmental noise. Electrical "noise" from monitors, pci slots,
hard-drives, etc. are usually nothing compared to audible noise
from the computer fans, squeaking chairs, or heavy breathing.
</para>
<para>
Some ASR software packages may require a specific sound card. It's
usually a good idea to stay away from specific hardware requirements,
because it limits many of your possible future options and decisions.
You'll have to weigh the benefits and costs if you are considering
packages that require specific hardware to function properly.
</para>
</sect2>
<!-- Section2: Microphones -->
<sect2 id="microphones">
<title>Microphones</title>
<para>
A quality microphone is key when utilizing ASR. In most cases, a
desktop microphone just won't do the job. They tend to pick up more
ambient noise that gives ASR programs a hard time.
</para>
<para>
Hand held microphones are also not the best choice as they can be
cumbersome to pick up all the time. While they do limit the amount
of ambient noise, they are most useful in applications that require
changing speakers often, or when speaking to the recognizer isn't
done frequently (when wearing a headset isn't an option).
</para>
<para>
The best choice, and by far the most common is the headset style.
It allows the ambient noise to be minimized, while allowing you to
have the microphone at the tip of your tongue all the time. Headsets
are available without earphones and with earphones (mono or stereo).
I recommend the stereo headphones, but it's just a matter of personal
taste.
</para>
<para>
You can get excellent quality microphone headsets for between $25
$100. A good place to start looking is http://www.headphones.com or
http://www.speechcontrol.com.
</para>
<para>
A quick note about levels: Don't forget to turn up your microphone
volume. This can be done with a program such as XMixer or OSS Mixer
and care should be used to avoid feedback noise. If the ASR software
includes auto-adjustment programs, use them instead, as they are
optimized for their particular recognition system.
</para>
</sect2>
<!-- Section2: computers -->
<sect2 id="computers">
<title>Computers/Processors</title>
<para>
ASR applications can be heavily dependent on processing speed. This
is because a large amount of digital filtering and signal processing
can take place in ASR.
</para>
<para>
As with just about any cpu intensive software, the faster the better.
Also, the more memory the better. It's possible to do some SR with 100MHz
and 16M RAM, but for fast processing (large dictionaries, complex
recognition schemes, or high sample rates), you should shoot for a
minimum of a 400MHz and 128M RAM. Because of the processing required,
most software packages list their minimum requirements.
</para>
<para>
Using a cluster (Beowulf or otherwise) to perform massive recognition
efforts hasn't yet been undertaken. If you know of any project underway,
or in development please send me a note! <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>
</para>
</sect2>
</sect1>
<!-- Section1: Software -->
<sect1 id="software">
<title>Speech Recognition Software</title>
<!-- Section2: Free Software -->
<sect2 id="freesoftware">
<title>Free Software</title>
<para>
Much of the free software listed here is available for download at:
http://sunsite.uio.no/pub/Linux/sound/apps/speech/
</para>
<!-- Section3: XVoice -->
<sect3 id="xvoice">
<title>XVoice</title>
<para>
XVoice is a dictation/continuous speech recognizer that can be used
with a variety of XWindow applications. It allows user-defined macros.
This is a fine program with a definite future. Once setup, it
performs with adequate accuracy.
</para>
<para>
XVoice requires that you download and install IBM's (free) ViaVoice
for Linux (See Commercial Section). It also requires the configuration
of ViaVoice to work correctly. Additionally, Lesstif/Motif (libXm) is
required. It is also important to note that because this program
interacts with X windows, you must leave X resources open on your
machine, so caution should be used if you use this on a networked or
multi-user machine.
</para>
<para>
This software is primarily for users. An RPM is available.
</para>
<para>
HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/
http://www.zachary.com/creemer/xvoice.html
</para>
<para>
Project: http://xvoice.sourceforge.net
</para>
<para>
Community: http://www.onelist.com/community/xvoice
</para>
</sect3>
<!-- Section3: XVoice -->
<sect3 id="cvoicecontrol">
<title>CVoiceControl/kVoiceControl</title>
<para>
CVoiceControl (which stands for Console Voice Control) started its
life as KVoiceControl (KDE Voice Control). It is a basic speech
recognition system that allows a user to execute Linux commands by
using spoken commands. CVoiceControl replaces KVoiceControl.
</para>
<para>
The software includes a microphone level configuration utility,
a vocabulary "model editor" for adding new commands and utterances,
and the speech recognition system.
</para>
<para>
CVoiceControl is an excellent starting point for experienced users
looking to get started in ASR. It is not the most user friendly,
but once it has been trained correctly, it can be very helpful. Be
sure to read the documentation while setting up.
</para>
<para>
This software is primarily for users.
</para>
<para>
Homepage: http://www.kiecza.de/daniel/linux/index.html
</para>
<para>
Documents: http://www.kiecza.de/daniel/linux/cvoicecontrol/index.html
</para>
</sect3>
<!-- Section3: Open Mind Speech -->
<sect3 id="openmind">
<title>Open Mind Speech</title>
<para>
Started in late 1999, Open Mind Speech has changed names several times
(was VoiceControl, then SpeechInput, and then FreeSpeech), and is now
part of the "Open Mind Initiative". This is an open source project.
Currently it isn't completely operational and is primarily for developers.
</para>
<para>
This software is primarily for developers.
</para>
<para>
Homepage: http://freespeech.sourceforge.net
</para>
</sect3>
<!-- Section3: GVoice -->
<sect3 id="gvoice">
<title>GVoice</title>
<para>
GVoice is a speech ASR library that uses IBM's ViaVoice (free) SDK
to control Gtk/GNOME applications. It includes libraries for
initialization, recognition engine, vocabulary manipulation, and panel
control. Development on this has been idle for over a year.
</para>
<para>
This software is primarily for developers.
</para>
<para>
Homepage: http://www.cse.ogi.edu/~omega/gnome/gvoice/
</para>
</sect3>
<!-- Section3: ISIP -->
<sect3 id="isip">
<title>ISIP</title>
<para>
The Institute for Signal and Information Processing at Mississippi
State University has made its speech recognition engine available. The
toolkit includes a front-end, a decoder, and a training module. It's a
functional toolkit.
</para>
<para>
This software is primarily for developers.
</para>
<para>
The toolkit (and more information about ISIP) is available at:
http://www.isip.msstate.edu/project/speech/
</para>
</sect3>
<!-- Section3: ISIP -->
<sect3 id="sphinx">
<title>CMU Sphinx</title>
<para>
Sphinx originally started at CMU and has recently been released as
open source. This is a fairly large program that includes a lot of
tools and information. It is still "in development", but includes
trainers, recognizers, acoustic models, language models, and some
limited documentation.
</para>
<para>
This software is primarily for developers.
</para>
<para>
Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html
</para>
<para>
Source: http://download.sourceforge.net/cmusphinx/sphinx2-0.1a.tar.gz
</para>
</sect3>
<!-- Section3: ears -->
<sect3 id="ears">
<title>Ears</title>
<para>
Although Ears isn't fully developed, it is a good starting
point for programmers wishing to start in ASR.
</para>
<para>
This software is primarily for developers.
</para>
<para>
FTP site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/recognition/
</para>
</sect3>
<!-- Section3: NICO -->
<sect3 id="nico">
<title>NICO ANN Toolkit</title>
<para>
The NICO Artificial Neural Network toolkit is a flexible back
propagation neural network toolkit optimized for speech recognition
applications.
</para>
<para>
This software is primarily for developers.
</para>
<para>
Its homepage: http://www.speech.kth.se/NICO/index.html
</para>
</sect3>
<!-- Section3: Myers -->
<sect3 id="Myers">
<title>Myers' Hidden Markov Model Software</title>
<para>
This software by Richard Myers is HMM algorithms written in C++ code.
It provides an example and learning tool for HMM models described in
the L. Rabiner book "Fundamentals of Speech Recognition".
</para>
<para>
This software is primarily for developers.
</para>
<para>
Information is available at:
http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html
</para>
</sect3>
<!-- Section3: Jialong -->
<sect3 id="Jialong">
<title>Jialong He's Speech Recognition Research Tool</title>
<para>
Although not originally written for Linux, this research tool can be
compiled on Linux. It contains three different types of recognizers:
DTW, Dynamic Hidden Markov Model, and a Continuous Density Hidden
Markov Model. This is for research and development uses, as it is
not a fully functional ASR system. The toolkit contains some very
useful tools.
</para>
<para>
This software is primarily for developers.
</para>
<para>
More information is available at:
http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html
</para>
</sect3>
<!-- Section3: morefree -->
<sect3 id="morefree">
<title>More Free Software?</title>
<para>
If you know of free software that isn't included in the above list,
please send me a note at: <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>. If you're in the mood,
you can also send me where to get a copy of the software, and any
impressions you may have about it. Thanks!
</para>
</sect3>
</sect2>
<!-- Section2: Commercial Software -->
<sect2 id="comsoftware">
<title>Commercial Software</title>
<!-- Section3: IBM ViaVoice -->
<sect3 id="viavoice">
<title>IBM ViaVoice</title>
<para>
IBM has made true on their promise to support Linux with their series
of ViaVoice products for Linux, though the future of their SDKs aren't
set in stone (their licensing agreement for developers isn't officially
released as of this date - more to come).
</para>
<para>
Their commercial (not-free) product, IBM ViaVoice Dictation for Linux
(available at http://www-4.ibm.com/software/speech/linux/dictation.html)
performs very well, but has some sizeable system requirements compared
to the more basic ASR systems (64M RAM and 233MHz Pentium). For the
$59.95US price tag you also get an Andrea NC-8 microphone. It also
allows multiple users (but I haven't tried it with multiple users, so
if anyone has any experience please give me a shout). The package
includes: documentation (PDF), Trainer, dictation system, and
installation scripts. Support for additional Linux Distributions based
on 2.2 kernels is also available in the latest release.
</para>
<para>
The ASR SDK is available for free, and includes IBM's SMAPI, grammar
API, documentation, and a variety of sample programs. The ViaVoice
Run Time Kit provides an ASR engine and data files for dictation
functions, and user utilities. The ViaVoice Command & Control Run Time
Kit includes the ASR engine and data files for command and control
functions, and user utilities. The SDK and Kits require 128M RAM and
a Linux 2.2 or better kernel)
</para>
<para>
The SDKs and Kits are available for free at:
http://www-4.ibm.com/software/speech/dev/sdk_linux.html
</para>
</sect3>
<!-- Section3: Vocalis Speechware -->
<sect3 id="vocalis">
<title>Vocalis Speechware</title>
<para>
More information on Vocalis and Vocalis Speechware is available at:
<ulink url="http://www.vocalisspeechware.com">
http://www.vocalisspeechware.com</ulink> and
<ulink url="http://www.vocalis.com">
http://www.vocalis.com</ulink>.
</para>
</sect3>
<!-- Section3: -->
<sect3 id="babeltech">
<title>Babel Technologies</title>
<para>
Babel Technologies has a Linux SDK available called Babear. It is a speaker-independent
system based on Hybrid Markov Models and Artificial Neural Networks technology. They also
have a variety of products for Text-to-speech, speaker verification, and phoneme analysis.
More information is available at: http://www.babeltech.com.
</para>
</sect3>
<!-- Section3: -->
<sect3 id="speechworks">
<title>SpeechWorks</title>
<para>
I didn't see anything on their website that specifically mentioned Linux, but their
"OpenSpeech Recognizer" uses VoiceXML, which is an open standard.
More information is available at: http://www.speechworks.com.
</para>
</sect3>
<!-- Section3: -->
<sect3 id="nuance">
<title>Nuance</title>
<para>
Nuance offers a speech recognition/natural language product (currently Nuance 8.0) for
a variety of *nix platforms. It can handle very large vocabularies and uses a unqiue
distributed architecture for scalability and fault tolerance.
More information is available at: http://www.nuance.com.
</para>
</sect3>
<!-- Section3: Abbot -->
<sect3 id="abbot">
<title>Abbot/AbbotDemo</title>
<para>
Abbot is a very large vocabulary, speaker independent ASR system.
It was originally developed by the Connectionist Speech Group at
Cambridge University. It was transferred (commercialized) to
SoftSound. More information is available at:
http://www.softsound.com.
</para>
<para>
AbbotDemo is a demonstration package of Abbot. This demo system
has a vocabulary of about 5000 words and uses the connectionist/HMM
continuous speech algorithm. This is a demonstration program with no
source code.
</para>
</sect3>
<!-- Section3: entropic -->
<sect3 id="entropic">
<title>Entropic</title>
<para>
The fine people over at Entropic have been bought out by Micro$oft...
Their products and support services have all but disappeared. Their
support for HTK and ESPS/waves+ is gone, and their future is in the
hands of M$. Their old website as http://www.entropic.com has more
information.
</para>
<para>
K.K. Chin advised me that the original developers of the HTK (the
Speech Vision and Robotic Group at Cambridge) are still
providing support for it. There is also a "free" version
available at: <ulink url="http://htk.eng.cam.ac.uk">http://htk.eng.cam.ac.uk</ulink>.
Also note that Microsoft still owns the copyright to the current
HTK code...
</para>
</sect3>
<!-- Section3: morecommercial -->
<sect3 id="morecom">
<title>More Commercial Products</title>
<para>
There are rumors of more commercial ASR products becoming available
in the near future (including L&amp;H). I talked with a couple of
L&amp;H representatives at Comdex 2000 (Vegas) and none of them could give
me any information on a Linux release, or even if they planned on releasing
any products for Linux. If you have any further information, please send
any details to me at <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
</para>
</sect3>
</sect2>
</sect1>
<!-- Section1: Inside -->
<sect1 id="inside">
<title>Inside Speech Recognition</title>
<!-- Section2: recognizers -->
<sect2 id="recognizers">
<title>How Recognizers Work</title>
<para>
Recognition systems can be broken down into two main types. Pattern
Recognition systems compare patterns to known/trained patterns to
determine a match. Acoustic Phonetic systems use knowledge of the
human body (speech production, and hearing) to compare speech features
(phonetics such as vowel sounds). Most modern systems focus on the
pattern recognition approach because it combines nicely with current
computing techniques and tends to have higher accuracy.
</para>
<para>
Most recognizers can be broken down into the following steps:
</para>
<para>
<orderedlist>
<listitem><para>
Audio recording and Utterance detection
</para></listitem>
<listitem><para>
Pre-Filtering (pre-emphasis, normalization, banding, etc.)
</para></listitem>
<listitem><para>
Framing and Windowing (chopping the data into a usable format)
</para></listitem>
<listitem><para>
Filtering (further filtering of each window/frame/freq. band)
</para></listitem>
<listitem><para>
Comparison and Matching (recognizing the utterance)
</para></listitem>
<listitem><para>
Action (Perform function associated with the recognized pattern)
</para></listitem>
</orderedlist>
</para>
<para>
Although each step seems simple, each one can involve a multitude of
different (and sometimes completely opposite) techniques.
</para>
<para>
(1) Audio/Utterance Recording: can be accomplished in a number of ways.
Starting points can be found by comparing ambient audio levels (acoustic
energy in some cases) with the sample just recorded. Endpoint detection
is harder because speakers tend to leave "artifacts" including
breathing/sighing,teeth chatters, and echoes.
</para>
<para>
(2) Pre-Filtering: is accomplished in a variety of ways, depending on
other features of the recognition system. The most common methods are
the "Bank-of-Filters" method which utilizes a series of audio filters to
prepare the sample, and the Linear Predictive Coding method which uses
a prediction function to calculate differences (errors). Different
forms of spectral analysis are also used.
</para>
<para>
(3) Framing/Windowing involves separating the sample data into
specific sizes. This is often rolled into step 2 or step 4. This step
also involves preparing the sample boundaries for analysis (removing
edge clicks, etc.)
</para>
<para>
(4) Additional Filtering is not always present. It is the final
preparation for each window before comparison and matching. Often this
consists of time alignment and normalization.
</para>
<para>
There are a huge number of techniques available for (5), Comparison
and Matching. Most involve comparing the current window with known
samples. There are methods that use Hidden Markov Models (HMM),
frequency analysis, differential analysis, linear algebra
techniques/shortcuts, spectral distortion, and time distortion methods.
All these methods are used to generate a probability and accuracy match.
</para>
<para>
(6) Actions can be just about anything the developer wants. *GRIN*
</para>
</sect2>
<!-- Section2: digitalaudio -->
<sect2 id="digitalaudio">
<title>Digital Audio Basics</title>
<para>
Audio is inherently an analog phenomenon. Recording a digital sample
is done by converting the analog signal from the microphone to an
digital signal through the A/D converter in the sound card. When a
microphone is operating, sound waves vibrate the magnetic element in
the microphone, causing an electrical current to the sound card (think
of a speaker working in reverse). Basically, the A/D converter records
the value of the electrical voltage at specific intervals.
</para>
<para>
There are two important factors during this process. First is the
"sample rate", or how often to record the voltage values. Second, is
the "bits per sample", or how accurate the value is recorded. A third
item is the number of channels (mono or stereo), but for most ASR
applications mono is sufficient. Most applications use pre-set values
for these parameters and user's shouldn't change them unless the
documentation suggests it. Developers should experiment with different
values to determine what works best with their algorithms.
</para>
<para>
So what is a good sample rate for ASR? Because speech is relatively
low bandwidth (mostly between 100Hz-8kHz), 8000 samples/sec (8kHz) is
sufficient for most basic ASR. But, some people prefer 16000
samples/sec (16kHz) because it provides more accurate high frequency
information. If you have the processing power, use 16kHz. For most
ASR applications, sampling rates higher than about 22kHz is a waste.
</para>
<para>
And what is a good value for "bits per sample"? 8 bits per sample
will record values between 0 and 255, which means that the position
of the microphone element is in one of 256 positions. 16 bits per
sample divides the element position into 65536 possible values.
Similar to sample rate, if you have enough processing power and
memory, go with 16 bits per sample. For comparison, an audio
Compact Disc is encoded with 16 bits per sample at about 44kHz.
</para>
<para>
The encoding format used should be simple - linear signed or
unsigned. Using a U-Law/A-Law algorithm or some other compression
scheme is usually not worth it, as it will cost you in computing power,
and not gain you much.
</para>
</sect2>
</sect1>
<!-- Section1: Publications -->
<sect1 id="publications">
<title>Publications</title>
<para>
If there is a publication that is not on this list, that you think
should be, please send the information to me at: <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
</para>
<!-- Section2: Books -->
<sect2 id="books">
<title>Books</title>
<para>
<itemizedlist>
<listitem><para>
"Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993.
ISBN: 0130151572.
</para></listitem>
<listitem><para>
"How to Build a Speech Recognition Application". B. Balentine,
D. Morgan, and W. Meisel. 1999. ISBN: 0967127815.
</para></listitem>
<listitem><para>
"Speech Recognition : Theory and C++ Implementation". C. Becchetti
and L.P. Ricotti. 1999. ISBN: 0471977306.
</para></listitem>
<listitem><para>
"Applied Speech Technology". A. Syrdal, R. Bennett, S. Greenspan.
1994. ISBN: 0849394562.
</para></listitem>
<listitem><para>
"Speech Recognition : The Complete Practical Reference Guide".
P. Foster, T. Schalk. 1993. ISBN: 0936648392.
</para></listitem>
<listitem><para>
"Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics and Speech Recognition".
D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.
</para></listitem>
<listitem><para>
"Discrete-Time Processing of Speech Signals (IEEE Press Classic
Reissue)". J. Deller, J. Hansen, J. Proakis. 1999.
ISBN: 0780353862.
</para></listitem>
<listitem><para>
"Statistical Methods for Speech Recognition (Language, Speech, and
Communication)". F. Jelinek. 1999. ISBN: 0262100665.
</para></listitem>
<listitem><para>
"Digital Processing of Speech Signals" L. Rabiner, R. Schafer. 1978.
ISBN: 0132136031
</para></listitem>
<listitem><para>
"Foundations of Statistical Natural Language Processing".
C. Manning, H. Schutze. 1999. ISBN: 0262133601.
</para></listitem>
<listitem><para>
"Designing Effective Speech Interfaces".
S. Weinschenk, D. T. Barker. 2000. ISBN: 0471375454.
</para></listitem>
</itemizedlist>
</para>
<para>
For a very LARGE online biography, check the Institut Fur Phonetik:
http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
</para>
</sect2>
<!-- Section2: internet -->
<sect2 id="internet">
<title>Internet</title>
<para>
<variablelist>
<varlistentry>
<term>news:comp.speech</term>
<listitem><para>
Newsgroup dedicated to computer and speech.
<itemizedlist>
<listitem><para>
US: http://www.speech.cs.cmu.edu/comp.speech/
</para></listitem>
<listitem><para>
UK: http://svr-www.eng.cam.ac.uk/comp.speech/
</para></listitem>
<listitem><para>
Aus: http://www.speech.su.oz.au/comp.speech/
</para></listitem>
</itemizedlist>
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>news:comp.speech.users</term>
<listitem><para>
Newsgroup dedicated to users of speech software.
</para><para>
<itemizedlist>
<listitem><para>
http://www.speechtechnology.com/users/comp.speech.users.html
</para></listitem>
</itemizedlist>
</para></listitem>
</varlistentry>
<varlistentry>
<term>news:comp.speech.research</term>
<listitem><para>
Newsgroup dedicated to speech software and hardware research.
</para></listitem>
</varlistentry>
<varlistentry>
<term>news:comp.dsp</term>
<listitem><para>
Newsgroup dedicated to digital signal processing.
</para></listitem>
</varlistentry>
<varlistentry>
<term>news:alt.sci.physics.acoustics</term>
<listitem><para>
Newsgroup dedicated to the physics of sound.
</para></listitem>
</varlistentry>
<varlistentry>
<term>DDLinux Email List</term>
<listitem><para>
Speech Recognition on Linux Mailing List.
<itemizedlist>
<listitem><para>
Homepage: http://leb.net/ddlinux/
</para></listitem>
<listitem><para>
Archives: http://leb.net/pipermail/ddlinux/
</para></listitem>
</itemizedlist>
</para></listitem>
</varlistentry>
<varlistentry>
<term>Linux Software Repository for speech applications</term>
<listitem><para>
http://sunsite.uio.no/pub/linux/sound/apps/speech/
</para></listitem>
</varlistentry>
<varlistentry>
<term>Russ Wilcox's List of Speech Recognition Links</term>
<listitem><para>
(excellent) http://www.tiac.net/users/rwilcox/speech.html
</para></listitem>
</varlistentry>
<varlistentry>
<term>Online Bibliography</term>
<listitem><para>
Online Bibliography of Phonetics and Speech Technology Publications.
http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
</para></listitem>
</varlistentry>
<varlistentry>
<term>MIT's Spoken Language Systems Homepage</term>
<listitem><para>
http://www.sls.lcs.mit.edu/sls/
</para></listitem>
</varlistentry>
<varlistentry>
<term>Oregon Graduate Institute</term>
<listitem><para>
Center for Spoken Language Understanding at Oregon Graduate
Institute. An excellent location for developers and researchers.
http://cslu.cse.ogi.edu/
</para></listitem>
</varlistentry>
<varlistentry>
<term>IBM's ViaVoice Linux SDK</term>
<listitem><para>
http://www-4.ibm.com/software/speech/dev/sdk_linux.html
</para></listitem>
</varlistentry>
<varlistentry>
<term>Mississippi State</term>
<listitem><para>
Mississippi State Institute for Signal and Information Processing
homepage with a large amount of useful information for developers.
http://www.isip.msstate.edu/projects/speech/
</para></listitem>
</varlistentry>
<varlistentry>
<term>Speech Technology</term>
<listitem><para>
ASR software and accessories.
http://www.speechtechnology.com
</para></listitem>
</varlistentry>
<varlistentry>
<term>Speech Control</term>
<listitem><para>
Speech Controlled Computer Systems. Microphones, headsets, and
wireless products for ASR.
http://www.speechcontrol.com
</para></listitem>
</varlistentry>
<varlistentry>
<term>Microphones.com</term>
<listitem><para>
Microphones and accessories for ASR.
http://www.microphones.com
</para></listitem>
</varlistentry>
<varlistentry>
<term>21st Century Eloquence</term>
<listitem><para>
"Speech Recognition Specialists."
http://voicerecognition.com
</para></listitem>
</varlistentry>
<varlistentry>
<term>Computing Out Loud</term>
<listitem><para>
Primarily for Windows users, but good info.
http://www.out-loud.com
</para></listitem>
</varlistentry>
<varlistentry>
<term>Say I Can.com</term>
<listitem><para>
"The Speech Recognition Information Source."
http://www.sayican.com
</para></listitem>
</varlistentry>
</variablelist>
</para>
</sect2>
</sect1>
</article>