mirror of https://github.com/tLDP/LDP
1510 lines
45 KiB
Plaintext
1510 lines
45 KiB
Plaintext
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
|
|
<article>
|
|
|
|
<!-- Header -->
|
|
|
|
<articleinfo>
|
|
<title>Speech Recognition HOWTO</title>
|
|
|
|
<author>
|
|
<firstname>Stephen</firstname>
|
|
<surname>Cook</surname>
|
|
<affiliation>
|
|
<address>
|
|
<email>scook@gear21.com</email>
|
|
</address>
|
|
</affiliation>
|
|
</author>
|
|
|
|
<revhistory>
|
|
<revision>
|
|
<revnumber>v2.0</revnumber>
|
|
<date>April 19, 2002</date>
|
|
<authorinitials>scc</authorinitials>
|
|
<revremark>
|
|
Changed license information (now GFDL) and added a new publication.
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>v1.2</revnumber>
|
|
<date>February 5, 2002</date>
|
|
<authorinitials>scc</authorinitials>
|
|
<revremark>
|
|
Added more commercial software listings (sent by Mayur Patel).
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>v1.1</revnumber>
|
|
<date>October 5, 2001</date>
|
|
<authorinitials>scc</authorinitials>
|
|
<revremark>
|
|
Added info for Vocalis Speechware. Fixed/Updated various other items.
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>v1.0</revnumber>
|
|
<date>November 20, 2000</date>
|
|
<authorinitials>scc</authorinitials>
|
|
<revremark>
|
|
Added info on L and H and HTK
|
|
</revremark>
|
|
</revision>
|
|
<revision>
|
|
<revnumber>v0.5</revnumber>
|
|
<date>September 13, 2000</date>
|
|
<authorinitials>scc</authorinitials>
|
|
<revremark>
|
|
Initial HOWTO Submission
|
|
</revremark>
|
|
</revision>
|
|
</revhistory>
|
|
|
|
<abstract>
|
|
<indexterm>
|
|
<primary>Speech Recognition</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Automatic Speech Recognition (ASR) on Linux is becoming easier.
|
|
Several packages are available for users as well as developers.
|
|
This document describes the basics of speech recognition and
|
|
describes some of the available software.
|
|
</para>
|
|
</abstract>
|
|
</articleinfo>
|
|
|
|
<!-- Section1: legal -->
|
|
|
|
<sect1 id="legal">
|
|
<title>Legal Notices</title>
|
|
|
|
<!-- Section2: copyright -->
|
|
|
|
<sect2 id="copyright">
|
|
<title>Copyright/License</title>
|
|
|
|
<para>
|
|
Copyright (c) 2000-2002 Stephen C. Cook.
|
|
Permission is granted to copy, distribute, and/or modify this document under the
|
|
terms of the GNU Free Documentation License, Version 1.1 or any later version published
|
|
by the Free Software Foundation.
|
|
</para>
|
|
|
|
<para>
|
|
This document is made available under the terms of the <ulink url="http://www.gnu.org/copyleft/fdl.html">
|
|
GNU Free Documentation License (GFDL)</ulink>, which is hereby
|
|
incorporated by reference.
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: disclaimer -->
|
|
|
|
<sect2 id="disclaimer">
|
|
<title>Disclaimer</title>
|
|
|
|
<para>
|
|
The author disclaims all warranties with regard to this document,
|
|
including all implied warranties of merchantability and fitness for a
|
|
certain purpose; in no event shall the author be liable for any
|
|
special, indirect or consequential damages or any damages whatsoever
|
|
resulting from loss of use, data or profits, whether in an action of
|
|
contract, negligence or other tortious action, arising out of or in
|
|
connection with the use of this document.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<!-- Section2: trademarks -->
|
|
|
|
<sect2 id="trademarks">
|
|
<title>Trademarks</title>
|
|
|
|
<para>
|
|
All trademarks contained in this document are copyright/trademark
|
|
of their respective owners.
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<!-- Section1: Forward -->
|
|
|
|
<sect1 id="forward">
|
|
<title>Forward</title>
|
|
|
|
<!-- Section2: copyright -->
|
|
|
|
<sect2 id="about">
|
|
<title>About This Document</title>
|
|
|
|
<para>
|
|
This document is targeted at the beginner to intermediate level Linux
|
|
user interested in learning about Speech Recognition and trying it out.
|
|
It may also help the interested developer in explaining the basics of
|
|
speech recognition programming.
|
|
</para>
|
|
|
|
<para>
|
|
I started this document when I began researching what speech
|
|
recognition software and development libraries were available for Linux.
|
|
Automated Speech Recognition (ASR or just SR) on Linux is just starting
|
|
to come into its own, and I hope this document gives it a push in the
|
|
right direction - by supporting both users and developers of ASR
|
|
technology.
|
|
</para>
|
|
|
|
<para>
|
|
I have left a variety of SR techniques out of this document, and
|
|
instead I have focused on the "HOWTO" aspect (since this is a howto...).
|
|
I have included a Publications section so the interested reader can
|
|
find books and articles on anything not covered here. This is not
|
|
meant to be a definitive statement of ASR on Linux.
|
|
</para>
|
|
|
|
<para>
|
|
For the most recent version of this document, check the LDP archive,
|
|
or go to:
|
|
|
|
<ulink url="http://www.gear21.com/speech/index.html">
|
|
http://www.gear21.com/speech/index.html</ulink>.
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: copyright -->
|
|
|
|
<sect2 id="acknowledgements">
|
|
<title>Acknowledgements</title>
|
|
|
|
<para>
|
|
I would like to thank the following people for the help, reviewing,
|
|
and support of this document:
|
|
</para>
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
Jessica Perry Hekman
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Geoff Wexler
|
|
</para></listitem>
|
|
</itemizedlist
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<!-- Section2: comments -->
|
|
|
|
<sect2 id="comments">
|
|
<title>Comments/Updates/Feedback</title>
|
|
|
|
<para>
|
|
If you have any comments, suggestions, revisions, updates, or just
|
|
want to chat about ASR, please send an email to me at
|
|
<ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
<!-- Section2: todo -->
|
|
|
|
<sect2 id="todo">
|
|
<title>ToDo</title>
|
|
|
|
<para>
|
|
The following things are left "to do":
|
|
</para>
|
|
|
|
<para>
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
Add descriptions in the Publications section.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Add more books to the Publications section.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Add more links with descriptions.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Enhance the description of the ASR system steps
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Include descriptions of FFTs and Filters.
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Include descriptions of DSP principles.
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<!-- Section2: todo -->
|
|
|
|
<sect2 id="revision">
|
|
<title>Revision History</title>
|
|
|
|
<para>
|
|
v0.1 first rough draft - August 2000
|
|
</para>
|
|
<para>
|
|
v0.5 final draft - September 2000
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<!-- Section1: Introduction -->
|
|
|
|
<sect1 id="introduction">
|
|
<title>Introduction</title>
|
|
|
|
<!-- Section2: copyright -->
|
|
|
|
<sect2 id="basics">
|
|
<title>Speech Recognition Basics</title>
|
|
|
|
<para>
|
|
Speech recognition is the process by which a computer (or
|
|
other type of machine) identifies spoken words. Basically, it means
|
|
talking to your computer, AND having it correctly recognize what you
|
|
are saying.
|
|
</para>
|
|
|
|
<para>
|
|
The following definitions are the basics needed for understanding
|
|
speech recognition technology.
|
|
</para>
|
|
|
|
<para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Utterance</term>
|
|
<listitem>
|
|
<para>
|
|
An utterance is the vocalization (speaking) of a word or words that
|
|
represent a single meaning to the computer. Utterances can be a
|
|
single word, a few words, a sentence, or even multiple sentences.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Speaker Dependance</term>
|
|
<listitem>
|
|
<para>
|
|
Speaker dependent systems are designed around a specific speaker.
|
|
They generally are more accurate for the correct speaker, but much
|
|
less accurate for other speakers. They assume the speaker will
|
|
speak in a consistent voice and tempo. Speaker independent systems
|
|
are designed for a variety of speakers. Adaptive systems usually start
|
|
as speaker independent systems and utilize training techniques to
|
|
adapt to the speaker to increase their recognition accuracy.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Vocabularies</term>
|
|
<listitem>
|
|
<para>
|
|
Vocabularies (or dictionaries) are lists of words or utterances that
|
|
can be recognized by the SR system. Generally, smaller vocabularies
|
|
are easier for a computer to recognize, while larger vocabularies
|
|
are more difficult. Unlike normal dictionaries, each entry doesn't
|
|
have to be a single word. They can be as long as a sentence or two.
|
|
Smaller vocabularies can have as few as 1 or 2 recognized utterances
|
|
(e.g."Wake Up"), while very large vocabularies can have a hundred
|
|
thousand or more!
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Accuract</term>
|
|
<listitem>
|
|
<para>
|
|
The ability of a recognizer can be examined by measuring its
|
|
accuracy - or how well it recognizes utterances. This includes not
|
|
only correctly identifying an utterance but also identifying if the
|
|
spoken utterance is not in its vocabulary. Good ASR systems have an
|
|
accuracy of 98% or more! The acceptable accuracy of a system
|
|
really depends on the application.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Training</term>
|
|
<listitem>
|
|
<para>
|
|
Some speech recognizers have the ability to adapt to a speaker.
|
|
When the system has this ability, it may allow training to take
|
|
place. An ASR system is trained by having the speaker repeat
|
|
standard or common phrases and adjusting its comparison algorithms
|
|
to match that particular speaker. Training a recognizer usually
|
|
improves its accuracy.
|
|
</para>
|
|
<para>
|
|
Training can also be used by speakers that have difficulty
|
|
speaking, or pronouncing certain words. As long as the speaker
|
|
can consistently repeat an utterance, ASR systems with training
|
|
should be able to adapt.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: types -->
|
|
|
|
<sect2 id="types">
|
|
<title>Types of Speech Recognition</title>
|
|
|
|
<para>
|
|
Speech recognition systems can be separated in several different
|
|
classes by describing what types of utterances they have the ability
|
|
to recognize. These classes are based on the fact that one of the
|
|
difficulties of ASR is the ability to determine when a speaker starts
|
|
and finishes an utterance. Most packages can fit into more than one
|
|
class, depending on which mode they're using.
|
|
</para>
|
|
|
|
<para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Isolated Words</term>
|
|
<listitem>
|
|
<para>
|
|
Isolated word recognizers usually require each utterance to have
|
|
quiet (lack of an audio signal) on BOTH sides of the sample window.
|
|
It doesn't mean that it accepts single words, but does require
|
|
a single utterance at a time. Often, these systems have
|
|
"Listen/Not-Listen" states, where they require the speaker to wait
|
|
between utterances (usually doing processing during the pauses).
|
|
Isolated Utterance might be a better name for this class.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Connected Words</term>
|
|
<listitem>
|
|
<para>
|
|
Connect word systems (or more correctly 'connected utterances')
|
|
are similar to Isolated words, but allow separate utterances to be
|
|
'run-together' with a minimal pause between them.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Continuous Speech</term>
|
|
<listitem>
|
|
<para>
|
|
Continuous recognition is the next step. Recognizers with continuous
|
|
speech capabilities are some of the most difficult to create because
|
|
they must utilize special methods to determine utterance boundaries.
|
|
Continuous speech recognizers allow users to speak almost naturally,
|
|
while the computer determines the content. Basically, it's computer
|
|
dictation.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Spontaneous Speech</term>
|
|
<listitem>
|
|
<para>
|
|
There appears to be a variety of definitions for what spontaneous
|
|
speech actually is. At a basic level, it can be thought of as
|
|
speech that is natural sounding and not rehearsed. An ASR system
|
|
with spontaneous speech ability should be able to handle a variety
|
|
of natural speech features such as words being run together, "ums"
|
|
and "ahs", and even slight stutters.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Voice Verification/Identification</term>
|
|
<listitem>
|
|
<para>
|
|
Some ASR systems have the ability to identify specific users. This
|
|
document doesn't cover verification or security systems.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: uses -->
|
|
|
|
<sect2 id="uses">
|
|
<title>Uses and Applications</title>
|
|
|
|
<para>
|
|
Although any task that involves interfacing with a computer can
|
|
potentially use ASR, the following applications are the most
|
|
common right now.
|
|
</para>
|
|
|
|
<para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>Dictation</term>
|
|
<listitem>
|
|
<para>
|
|
Dictation is the most common use for ASR systems today. This
|
|
includes medical transcriptions, legal and business dictation, as
|
|
well as general word processing. In some cases special vocabularies
|
|
are used to increase the accuracy of the system.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Command and Control</term>
|
|
<listitem>
|
|
<para>
|
|
ASR systems that are designed to perform functions and actions on the
|
|
system are defined as Command and Control systems. Utterances like
|
|
"Open Netscape" and "Start a new xterm" will do just that.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Telephony</term>
|
|
<listitem>
|
|
<para>
|
|
Some PBX/Voice Mail systems allow callers to speak commands instead of
|
|
pressing buttons to send specific tones.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Wearables</term>
|
|
<listitem>
|
|
<para>
|
|
Because inputs are limited for wearable devices, speaking is a
|
|
natural possibility.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Medical/Disabilities</term>
|
|
<listitem>
|
|
<para>
|
|
Many people have difficulty typing due to physical limitations such
|
|
as repetitive strain injuries (RSI), muscular dystrophy, and
|
|
many others. For example, people with difficulty hearing could use
|
|
a system connected to their telephone to convert the caller's speech
|
|
to text.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Embedded Applications</term>
|
|
<listitem>
|
|
<para>
|
|
Some newer cellular phones include C&C speech recognition that allow
|
|
utterances such as "Call Home". This could be a major factor in the
|
|
future of ASR and Linux. Why can't I talk to my television yet?
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<!-- Section1: Hardware -->
|
|
|
|
<sect1 id="hardware">
|
|
<title>Hardware</title>
|
|
|
|
<!-- Section2: soundcards -->
|
|
|
|
<sect2 id="soundcards">
|
|
<title>Sound Cards</title>
|
|
|
|
<para>
|
|
Because speech requires a relatively low bandwidth, just about any
|
|
medium-high quality 16 bit sound card will get the job done. You must
|
|
have sound enabled in your kernel, and you must have correct drivers
|
|
installed. For more information on sound cards, please see "The Linux
|
|
Sound HOWTO" available at: http://www.LinuxDoc.org/. Sound card
|
|
quality often starts a heated discussion about their impact on accuracy
|
|
and noise.
|
|
</para>
|
|
|
|
<para>
|
|
Sound cards with the 'cleanest' A/D (analog to digital) conversions
|
|
are recommended, but most often the clarity of the digital sample is
|
|
more dependent on the microphone quality and even more dependent on the
|
|
environmental noise. Electrical "noise" from monitors, pci slots,
|
|
hard-drives, etc. are usually nothing compared to audible noise
|
|
from the computer fans, squeaking chairs, or heavy breathing.
|
|
</para>
|
|
|
|
<para>
|
|
Some ASR software packages may require a specific sound card. It's
|
|
usually a good idea to stay away from specific hardware requirements,
|
|
because it limits many of your possible future options and decisions.
|
|
You'll have to weigh the benefits and costs if you are considering
|
|
packages that require specific hardware to function properly.
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: Microphones -->
|
|
|
|
<sect2 id="microphones">
|
|
<title>Microphones</title>
|
|
<para>
|
|
A quality microphone is key when utilizing ASR. In most cases, a
|
|
desktop microphone just won't do the job. They tend to pick up more
|
|
ambient noise that gives ASR programs a hard time.
|
|
</para>
|
|
|
|
<para>
|
|
Hand held microphones are also not the best choice as they can be
|
|
cumbersome to pick up all the time. While they do limit the amount
|
|
of ambient noise, they are most useful in applications that require
|
|
changing speakers often, or when speaking to the recognizer isn't
|
|
done frequently (when wearing a headset isn't an option).
|
|
</para>
|
|
|
|
<para>
|
|
The best choice, and by far the most common is the headset style.
|
|
It allows the ambient noise to be minimized, while allowing you to
|
|
have the microphone at the tip of your tongue all the time. Headsets
|
|
are available without earphones and with earphones (mono or stereo).
|
|
I recommend the stereo headphones, but it's just a matter of personal
|
|
taste.
|
|
</para>
|
|
|
|
<para>
|
|
You can get excellent quality microphone headsets for between $25
|
|
$100. A good place to start looking is http://www.headphones.com or
|
|
http://www.speechcontrol.com.
|
|
</para>
|
|
|
|
<para>
|
|
A quick note about levels: Don't forget to turn up your microphone
|
|
volume. This can be done with a program such as XMixer or OSS Mixer
|
|
and care should be used to avoid feedback noise. If the ASR software
|
|
includes auto-adjustment programs, use them instead, as they are
|
|
optimized for their particular recognition system.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<!-- Section2: computers -->
|
|
|
|
<sect2 id="computers">
|
|
<title>Computers/Processors</title>
|
|
<para>
|
|
ASR applications can be heavily dependent on processing speed. This
|
|
is because a large amount of digital filtering and signal processing
|
|
can take place in ASR.
|
|
</para>
|
|
|
|
<para>
|
|
As with just about any cpu intensive software, the faster the better.
|
|
Also, the more memory the better. It's possible to do some SR with 100MHz
|
|
and 16M RAM, but for fast processing (large dictionaries, complex
|
|
recognition schemes, or high sample rates), you should shoot for a
|
|
minimum of a 400MHz and 128M RAM. Because of the processing required,
|
|
most software packages list their minimum requirements.
|
|
</para>
|
|
|
|
<para>
|
|
Using a cluster (Beowulf or otherwise) to perform massive recognition
|
|
efforts hasn't yet been undertaken. If you know of any project underway,
|
|
or in development please send me a note! <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
|
|
<!-- Section1: Software -->
|
|
|
|
<sect1 id="software">
|
|
<title>Speech Recognition Software</title>
|
|
|
|
|
|
<!-- Section2: Free Software -->
|
|
|
|
<sect2 id="freesoftware">
|
|
<title>Free Software</title>
|
|
<para>
|
|
Much of the free software listed here is available for download at:
|
|
http://sunsite.uio.no/pub/Linux/sound/apps/speech/
|
|
</para>
|
|
|
|
|
|
<!-- Section3: XVoice -->
|
|
|
|
<sect3 id="xvoice">
|
|
<title>XVoice</title>
|
|
<para>
|
|
XVoice is a dictation/continuous speech recognizer that can be used
|
|
with a variety of XWindow applications. It allows user-defined macros.
|
|
This is a fine program with a definite future. Once setup, it
|
|
performs with adequate accuracy.
|
|
</para>
|
|
|
|
<para>
|
|
XVoice requires that you download and install IBM's (free) ViaVoice
|
|
for Linux (See Commercial Section). It also requires the configuration
|
|
of ViaVoice to work correctly. Additionally, Lesstif/Motif (libXm) is
|
|
required. It is also important to note that because this program
|
|
interacts with X windows, you must leave X resources open on your
|
|
machine, so caution should be used if you use this on a networked or
|
|
multi-user machine.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for users. An RPM is available.
|
|
</para>
|
|
|
|
<para>
|
|
HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/
|
|
http://www.zachary.com/creemer/xvoice.html
|
|
</para>
|
|
|
|
<para>
|
|
Project: http://xvoice.sourceforge.net
|
|
</para>
|
|
|
|
<para>
|
|
Community: http://www.onelist.com/community/xvoice
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<!-- Section3: XVoice -->
|
|
|
|
<sect3 id="cvoicecontrol">
|
|
<title>CVoiceControl/kVoiceControl</title>
|
|
<para>
|
|
CVoiceControl (which stands for Console Voice Control) started its
|
|
life as KVoiceControl (KDE Voice Control). It is a basic speech
|
|
recognition system that allows a user to execute Linux commands by
|
|
using spoken commands. CVoiceControl replaces KVoiceControl.
|
|
</para>
|
|
|
|
<para>
|
|
The software includes a microphone level configuration utility,
|
|
a vocabulary "model editor" for adding new commands and utterances,
|
|
and the speech recognition system.
|
|
</para>
|
|
|
|
<para>
|
|
CVoiceControl is an excellent starting point for experienced users
|
|
looking to get started in ASR. It is not the most user friendly,
|
|
but once it has been trained correctly, it can be very helpful. Be
|
|
sure to read the documentation while setting up.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for users.
|
|
</para>
|
|
|
|
<para>
|
|
Homepage: http://www.kiecza.de/daniel/linux/index.html
|
|
</para>
|
|
|
|
<para>
|
|
Documents: http://www.kiecza.de/daniel/linux/cvoicecontrol/index.html
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: Open Mind Speech -->
|
|
|
|
<sect3 id="openmind">
|
|
<title>Open Mind Speech</title>
|
|
<para>
|
|
Started in late 1999, Open Mind Speech has changed names several times
|
|
(was VoiceControl, then SpeechInput, and then FreeSpeech), and is now
|
|
part of the "Open Mind Initiative". This is an open source project.
|
|
Currently it isn't completely operational and is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
Homepage: http://freespeech.sourceforge.net
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: GVoice -->
|
|
|
|
<sect3 id="gvoice">
|
|
<title>GVoice</title>
|
|
<para>
|
|
GVoice is a speech ASR library that uses IBM's ViaVoice (free) SDK
|
|
to control Gtk/GNOME applications. It includes libraries for
|
|
initialization, recognition engine, vocabulary manipulation, and panel
|
|
control. Development on this has been idle for over a year.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
Homepage: http://www.cse.ogi.edu/~omega/gnome/gvoice/
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: ISIP -->
|
|
|
|
<sect3 id="isip">
|
|
<title>ISIP</title>
|
|
<para>
|
|
The Institute for Signal and Information Processing at Mississippi
|
|
State University has made its speech recognition engine available. The
|
|
toolkit includes a front-end, a decoder, and a training module. It's a
|
|
functional toolkit.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
The toolkit (and more information about ISIP) is available at:
|
|
http://www.isip.msstate.edu/project/speech/
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: ISIP -->
|
|
|
|
<sect3 id="sphinx">
|
|
<title>CMU Sphinx</title>
|
|
<para>
|
|
Sphinx originally started at CMU and has recently been released as
|
|
open source. This is a fairly large program that includes a lot of
|
|
tools and information. It is still "in development", but includes
|
|
trainers, recognizers, acoustic models, language models, and some
|
|
limited documentation.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html
|
|
</para>
|
|
|
|
<para>
|
|
Source: http://download.sourceforge.net/cmusphinx/sphinx2-0.1a.tar.gz
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: ears -->
|
|
|
|
<sect3 id="ears">
|
|
<title>Ears</title>
|
|
<para>
|
|
Although Ears isn't fully developed, it is a good starting
|
|
point for programmers wishing to start in ASR.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
FTP site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/recognition/
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: NICO -->
|
|
|
|
<sect3 id="nico">
|
|
<title>NICO ANN Toolkit</title>
|
|
<para>
|
|
The NICO Artificial Neural Network toolkit is a flexible back
|
|
propagation neural network toolkit optimized for speech recognition
|
|
applications.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
Its homepage: http://www.speech.kth.se/NICO/index.html
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: Myers -->
|
|
|
|
<sect3 id="Myers">
|
|
<title>Myers' Hidden Markov Model Software</title>
|
|
<para>
|
|
This software by Richard Myers is HMM algorithms written in C++ code.
|
|
It provides an example and learning tool for HMM models described in
|
|
the L. Rabiner book "Fundamentals of Speech Recognition".
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
Information is available at:
|
|
http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: Jialong -->
|
|
|
|
<sect3 id="Jialong">
|
|
<title>Jialong He's Speech Recognition Research Tool</title>
|
|
<para>
|
|
Although not originally written for Linux, this research tool can be
|
|
compiled on Linux. It contains three different types of recognizers:
|
|
DTW, Dynamic Hidden Markov Model, and a Continuous Density Hidden
|
|
Markov Model. This is for research and development uses, as it is
|
|
not a fully functional ASR system. The toolkit contains some very
|
|
useful tools.
|
|
</para>
|
|
|
|
<para>
|
|
This software is primarily for developers.
|
|
</para>
|
|
|
|
<para>
|
|
More information is available at:
|
|
http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html
|
|
</para>
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: morefree -->
|
|
|
|
<sect3 id="morefree">
|
|
<title>More Free Software?</title>
|
|
<para>
|
|
If you know of free software that isn't included in the above list,
|
|
please send me a note at: <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>. If you're in the mood,
|
|
you can also send me where to get a copy of the software, and any
|
|
impressions you may have about it. Thanks!
|
|
</para>
|
|
</sect3>
|
|
</sect2>
|
|
|
|
<!-- Section2: Commercial Software -->
|
|
|
|
<sect2 id="comsoftware">
|
|
<title>Commercial Software</title>
|
|
|
|
|
|
<!-- Section3: IBM ViaVoice -->
|
|
|
|
<sect3 id="viavoice">
|
|
<title>IBM ViaVoice</title>
|
|
<para>
|
|
IBM has made true on their promise to support Linux with their series
|
|
of ViaVoice products for Linux, though the future of their SDKs aren't
|
|
set in stone (their licensing agreement for developers isn't officially
|
|
released as of this date - more to come).
|
|
</para>
|
|
|
|
<para>
|
|
Their commercial (not-free) product, IBM ViaVoice Dictation for Linux
|
|
(available at http://www-4.ibm.com/software/speech/linux/dictation.html)
|
|
performs very well, but has some sizeable system requirements compared
|
|
to the more basic ASR systems (64M RAM and 233MHz Pentium). For the
|
|
$59.95US price tag you also get an Andrea NC-8 microphone. It also
|
|
allows multiple users (but I haven't tried it with multiple users, so
|
|
if anyone has any experience please give me a shout). The package
|
|
includes: documentation (PDF), Trainer, dictation system, and
|
|
installation scripts. Support for additional Linux Distributions based
|
|
on 2.2 kernels is also available in the latest release.
|
|
</para>
|
|
|
|
<para>
|
|
The ASR SDK is available for free, and includes IBM's SMAPI, grammar
|
|
API, documentation, and a variety of sample programs. The ViaVoice
|
|
Run Time Kit provides an ASR engine and data files for dictation
|
|
functions, and user utilities. The ViaVoice Command & Control Run Time
|
|
Kit includes the ASR engine and data files for command and control
|
|
functions, and user utilities. The SDK and Kits require 128M RAM and
|
|
a Linux 2.2 or better kernel)
|
|
</para>
|
|
|
|
<para>
|
|
The SDKs and Kits are available for free at:
|
|
http://www-4.ibm.com/software/speech/dev/sdk_linux.html
|
|
</para>
|
|
</sect3>
|
|
|
|
<!-- Section3: Vocalis Speechware -->
|
|
|
|
<sect3 id="vocalis">
|
|
<title>Vocalis Speechware</title>
|
|
<para>
|
|
More information on Vocalis and Vocalis Speechware is available at:
|
|
<ulink url="http://www.vocalisspeechware.com">
|
|
http://www.vocalisspeechware.com</ulink> and
|
|
<ulink url="http://www.vocalis.com">
|
|
http://www.vocalis.com</ulink>.
|
|
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<!-- Section3: -->
|
|
|
|
<sect3 id="babeltech">
|
|
<title>Babel Technologies</title>
|
|
<para>
|
|
Babel Technologies has a Linux SDK available called Babear. It is a speaker-independent
|
|
system based on Hybrid Markov Models and Artificial Neural Networks technology. They also
|
|
have a variety of products for Text-to-speech, speaker verification, and phoneme analysis.
|
|
More information is available at: http://www.babeltech.com.
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<!-- Section3: -->
|
|
|
|
<sect3 id="speechworks">
|
|
<title>SpeechWorks</title>
|
|
<para>
|
|
I didn't see anything on their website that specifically mentioned Linux, but their
|
|
"OpenSpeech Recognizer" uses VoiceXML, which is an open standard.
|
|
More information is available at: http://www.speechworks.com.
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
<!-- Section3: -->
|
|
|
|
<sect3 id="nuance">
|
|
<title>Nuance</title>
|
|
<para>
|
|
Nuance offers a speech recognition/natural language product (currently Nuance 8.0) for
|
|
a variety of *nix platforms. It can handle very large vocabularies and uses a unqiue
|
|
distributed architecture for scalability and fault tolerance.
|
|
More information is available at: http://www.nuance.com.
|
|
</para>
|
|
|
|
</sect3>
|
|
|
|
|
|
<!-- Section3: Abbot -->
|
|
|
|
<sect3 id="abbot">
|
|
<title>Abbot/AbbotDemo</title>
|
|
<para>
|
|
Abbot is a very large vocabulary, speaker independent ASR system.
|
|
It was originally developed by the Connectionist Speech Group at
|
|
Cambridge University. It was transferred (commercialized) to
|
|
SoftSound. More information is available at:
|
|
http://www.softsound.com.
|
|
</para>
|
|
|
|
<para>
|
|
AbbotDemo is a demonstration package of Abbot. This demo system
|
|
has a vocabulary of about 5000 words and uses the connectionist/HMM
|
|
continuous speech algorithm. This is a demonstration program with no
|
|
source code.
|
|
</para>
|
|
</sect3>
|
|
|
|
<!-- Section3: entropic -->
|
|
|
|
<sect3 id="entropic">
|
|
<title>Entropic</title>
|
|
<para>
|
|
The fine people over at Entropic have been bought out by Micro$oft...
|
|
Their products and support services have all but disappeared. Their
|
|
support for HTK and ESPS/waves+ is gone, and their future is in the
|
|
hands of M$. Their old website as http://www.entropic.com has more
|
|
information.
|
|
</para>
|
|
<para>
|
|
K.K. Chin advised me that the original developers of the HTK (the
|
|
Speech Vision and Robotic Group at Cambridge) are still
|
|
providing support for it. There is also a "free" version
|
|
available at: <ulink url="http://htk.eng.cam.ac.uk">http://htk.eng.cam.ac.uk</ulink>.
|
|
Also note that Microsoft still owns the copyright to the current
|
|
HTK code...
|
|
</para>
|
|
</sect3>
|
|
|
|
<!-- Section3: morecommercial -->
|
|
|
|
<sect3 id="morecom">
|
|
<title>More Commercial Products</title>
|
|
<para>
|
|
There are rumors of more commercial ASR products becoming available
|
|
in the near future (including L&H). I talked with a couple of
|
|
L&H representatives at Comdex 2000 (Vegas) and none of them could give
|
|
me any information on a Linux release, or even if they planned on releasing
|
|
any products for Linux. If you have any further information, please send
|
|
any details to me at <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
|
|
</para>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<!-- Section1: Inside -->
|
|
|
|
<sect1 id="inside">
|
|
<title>Inside Speech Recognition</title>
|
|
|
|
<!-- Section2: recognizers -->
|
|
|
|
<sect2 id="recognizers">
|
|
<title>How Recognizers Work</title>
|
|
|
|
<para>
|
|
Recognition systems can be broken down into two main types. Pattern
|
|
Recognition systems compare patterns to known/trained patterns to
|
|
determine a match. Acoustic Phonetic systems use knowledge of the
|
|
human body (speech production, and hearing) to compare speech features
|
|
(phonetics such as vowel sounds). Most modern systems focus on the
|
|
pattern recognition approach because it combines nicely with current
|
|
computing techniques and tends to have higher accuracy.
|
|
</para>
|
|
|
|
<para>
|
|
Most recognizers can be broken down into the following steps:
|
|
</para>
|
|
|
|
<para>
|
|
<orderedlist>
|
|
<listitem><para>
|
|
Audio recording and Utterance detection
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Pre-Filtering (pre-emphasis, normalization, banding, etc.)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Framing and Windowing (chopping the data into a usable format)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Filtering (further filtering of each window/frame/freq. band)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Comparison and Matching (recognizing the utterance)
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Action (Perform function associated with the recognized pattern)
|
|
</para></listitem>
|
|
</orderedlist>
|
|
</para>
|
|
|
|
<para>
|
|
Although each step seems simple, each one can involve a multitude of
|
|
different (and sometimes completely opposite) techniques.
|
|
</para>
|
|
|
|
<para>
|
|
(1) Audio/Utterance Recording: can be accomplished in a number of ways.
|
|
Starting points can be found by comparing ambient audio levels (acoustic
|
|
energy in some cases) with the sample just recorded. Endpoint detection
|
|
is harder because speakers tend to leave "artifacts" including
|
|
breathing/sighing,teeth chatters, and echoes.
|
|
</para>
|
|
|
|
<para>
|
|
(2) Pre-Filtering: is accomplished in a variety of ways, depending on
|
|
other features of the recognition system. The most common methods are
|
|
the "Bank-of-Filters" method which utilizes a series of audio filters to
|
|
prepare the sample, and the Linear Predictive Coding method which uses
|
|
a prediction function to calculate differences (errors). Different
|
|
forms of spectral analysis are also used.
|
|
</para>
|
|
|
|
<para>
|
|
(3) Framing/Windowing involves separating the sample data into
|
|
specific sizes. This is often rolled into step 2 or step 4. This step
|
|
also involves preparing the sample boundaries for analysis (removing
|
|
edge clicks, etc.)
|
|
</para>
|
|
|
|
<para>
|
|
(4) Additional Filtering is not always present. It is the final
|
|
preparation for each window before comparison and matching. Often this
|
|
consists of time alignment and normalization.
|
|
</para>
|
|
|
|
<para>
|
|
There are a huge number of techniques available for (5), Comparison
|
|
and Matching. Most involve comparing the current window with known
|
|
samples. There are methods that use Hidden Markov Models (HMM),
|
|
frequency analysis, differential analysis, linear algebra
|
|
techniques/shortcuts, spectral distortion, and time distortion methods.
|
|
All these methods are used to generate a probability and accuracy match.
|
|
</para>
|
|
|
|
<para>
|
|
(6) Actions can be just about anything the developer wants. *GRIN*
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<!-- Section2: digitalaudio -->
|
|
|
|
<sect2 id="digitalaudio">
|
|
<title>Digital Audio Basics</title>
|
|
|
|
<para>
|
|
Audio is inherently an analog phenomenon. Recording a digital sample
|
|
is done by converting the analog signal from the microphone to an
|
|
digital signal through the A/D converter in the sound card. When a
|
|
microphone is operating, sound waves vibrate the magnetic element in
|
|
the microphone, causing an electrical current to the sound card (think
|
|
of a speaker working in reverse). Basically, the A/D converter records
|
|
the value of the electrical voltage at specific intervals.
|
|
</para>
|
|
|
|
<para>
|
|
There are two important factors during this process. First is the
|
|
"sample rate", or how often to record the voltage values. Second, is
|
|
the "bits per sample", or how accurate the value is recorded. A third
|
|
item is the number of channels (mono or stereo), but for most ASR
|
|
applications mono is sufficient. Most applications use pre-set values
|
|
for these parameters and user's shouldn't change them unless the
|
|
documentation suggests it. Developers should experiment with different
|
|
values to determine what works best with their algorithms.
|
|
</para>
|
|
|
|
<para>
|
|
So what is a good sample rate for ASR? Because speech is relatively
|
|
low bandwidth (mostly between 100Hz-8kHz), 8000 samples/sec (8kHz) is
|
|
sufficient for most basic ASR. But, some people prefer 16000
|
|
samples/sec (16kHz) because it provides more accurate high frequency
|
|
information. If you have the processing power, use 16kHz. For most
|
|
ASR applications, sampling rates higher than about 22kHz is a waste.
|
|
</para>
|
|
|
|
<para>
|
|
And what is a good value for "bits per sample"? 8 bits per sample
|
|
will record values between 0 and 255, which means that the position
|
|
of the microphone element is in one of 256 positions. 16 bits per
|
|
sample divides the element position into 65536 possible values.
|
|
Similar to sample rate, if you have enough processing power and
|
|
memory, go with 16 bits per sample. For comparison, an audio
|
|
Compact Disc is encoded with 16 bits per sample at about 44kHz.
|
|
</para>
|
|
|
|
<para>
|
|
The encoding format used should be simple - linear signed or
|
|
unsigned. Using a U-Law/A-Law algorithm or some other compression
|
|
scheme is usually not worth it, as it will cost you in computing power,
|
|
and not gain you much.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
|
|
<!-- Section1: Publications -->
|
|
|
|
<sect1 id="publications">
|
|
<title>Publications</title>
|
|
|
|
<para>
|
|
If there is a publication that is not on this list, that you think
|
|
should be, please send the information to me at: <ulink url="mailto:scook@gear21.com">scook@gear21.com</ulink>.
|
|
</para>
|
|
|
|
<!-- Section2: Books -->
|
|
|
|
<sect2 id="books">
|
|
<title>Books</title>
|
|
|
|
<para>
|
|
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
"Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993.
|
|
ISBN: 0130151572.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"How to Build a Speech Recognition Application". B. Balentine,
|
|
D. Morgan, and W. Meisel. 1999. ISBN: 0967127815.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Speech Recognition : Theory and C++ Implementation". C. Becchetti
|
|
and L.P. Ricotti. 1999. ISBN: 0471977306.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Applied Speech Technology". A. Syrdal, R. Bennett, S. Greenspan.
|
|
1994. ISBN: 0849394562.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Speech Recognition : The Complete Practical Reference Guide".
|
|
P. Foster, T. Schalk. 1993. ISBN: 0936648392.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Speech and Language Processing: An Introduction to Natural Language
|
|
Processing, Computational Linguistics and Speech Recognition".
|
|
D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Discrete-Time Processing of Speech Signals (IEEE Press Classic
|
|
Reissue)". J. Deller, J. Hansen, J. Proakis. 1999.
|
|
ISBN: 0780353862.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Statistical Methods for Speech Recognition (Language, Speech, and
|
|
Communication)". F. Jelinek. 1999. ISBN: 0262100665.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Digital Processing of Speech Signals" L. Rabiner, R. Schafer. 1978.
|
|
ISBN: 0132136031
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Foundations of Statistical Natural Language Processing".
|
|
C. Manning, H. Schutze. 1999. ISBN: 0262133601.
|
|
</para></listitem>
|
|
|
|
<listitem><para>
|
|
"Designing Effective Speech Interfaces".
|
|
S. Weinschenk, D. T. Barker. 2000. ISBN: 0471375454.
|
|
</para></listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
For a very LARGE online biography, check the Institut Fur Phonetik:
|
|
http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
|
|
</para>
|
|
</sect2>
|
|
|
|
<!-- Section2: internet -->
|
|
|
|
<sect2 id="internet">
|
|
<title>Internet</title>
|
|
|
|
<para>
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term>news:comp.speech</term>
|
|
<listitem><para>
|
|
Newsgroup dedicated to computer and speech.
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
US: http://www.speech.cs.cmu.edu/comp.speech/
|
|
</para></listitem>
|
|
<listitem><para>
|
|
UK: http://svr-www.eng.cam.ac.uk/comp.speech/
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Aus: http://www.speech.su.oz.au/comp.speech/
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>news:comp.speech.users</term>
|
|
<listitem><para>
|
|
Newsgroup dedicated to users of speech software.
|
|
</para><para>
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
http://www.speechtechnology.com/users/comp.speech.users.html
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>news:comp.speech.research</term>
|
|
<listitem><para>
|
|
Newsgroup dedicated to speech software and hardware research.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>news:comp.dsp</term>
|
|
<listitem><para>
|
|
Newsgroup dedicated to digital signal processing.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>news:alt.sci.physics.acoustics</term>
|
|
<listitem><para>
|
|
Newsgroup dedicated to the physics of sound.
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>DDLinux Email List</term>
|
|
<listitem><para>
|
|
Speech Recognition on Linux Mailing List.
|
|
<itemizedlist>
|
|
<listitem><para>
|
|
Homepage: http://leb.net/ddlinux/
|
|
</para></listitem>
|
|
<listitem><para>
|
|
Archives: http://leb.net/pipermail/ddlinux/
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Linux Software Repository for speech applications</term>
|
|
<listitem><para>
|
|
http://sunsite.uio.no/pub/linux/sound/apps/speech/
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Russ Wilcox's List of Speech Recognition Links</term>
|
|
<listitem><para>
|
|
(excellent) http://www.tiac.net/users/rwilcox/speech.html
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Online Bibliography</term>
|
|
<listitem><para>
|
|
Online Bibliography of Phonetics and Speech Technology Publications.
|
|
http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>MIT's Spoken Language Systems Homepage</term>
|
|
<listitem><para>
|
|
http://www.sls.lcs.mit.edu/sls/
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Oregon Graduate Institute</term>
|
|
<listitem><para>
|
|
Center for Spoken Language Understanding at Oregon Graduate
|
|
Institute. An excellent location for developers and researchers.
|
|
http://cslu.cse.ogi.edu/
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>IBM's ViaVoice Linux SDK</term>
|
|
<listitem><para>
|
|
http://www-4.ibm.com/software/speech/dev/sdk_linux.html
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Mississippi State</term>
|
|
<listitem><para>
|
|
Mississippi State Institute for Signal and Information Processing
|
|
homepage with a large amount of useful information for developers.
|
|
http://www.isip.msstate.edu/projects/speech/
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Speech Technology</term>
|
|
<listitem><para>
|
|
ASR software and accessories.
|
|
http://www.speechtechnology.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Speech Control</term>
|
|
<listitem><para>
|
|
Speech Controlled Computer Systems. Microphones, headsets, and
|
|
wireless products for ASR.
|
|
http://www.speechcontrol.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Microphones.com</term>
|
|
<listitem><para>
|
|
Microphones and accessories for ASR.
|
|
http://www.microphones.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
|
|
<varlistentry>
|
|
<term>21st Century Eloquence</term>
|
|
<listitem><para>
|
|
"Speech Recognition Specialists."
|
|
http://voicerecognition.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Computing Out Loud</term>
|
|
<listitem><para>
|
|
Primarily for Windows users, but good info.
|
|
http://www.out-loud.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term>Say I Can.com</term>
|
|
<listitem><para>
|
|
"The Speech Recognition Information Source."
|
|
http://www.sayican.com
|
|
</para></listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
</article>
|
|
|
|
|
|
|