380 lines
8.6 KiB
HTML
380 lines
8.6 KiB
HTML
<HTML
|
|
><HEAD
|
|
><TITLE
|
|
>Introduction</TITLE
|
|
><META
|
|
NAME="GENERATOR"
|
|
CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
|
|
"><LINK
|
|
REL="HOME"
|
|
TITLE="Speech Recognition HOWTO"
|
|
HREF="index.html"><LINK
|
|
REL="PREVIOUS"
|
|
TITLE="Forward"
|
|
HREF="forward.html"><LINK
|
|
REL="NEXT"
|
|
TITLE="Hardware"
|
|
HREF="hardware.html"></HEAD
|
|
><BODY
|
|
CLASS="SECT1"
|
|
BGCOLOR="#FFFFFF"
|
|
TEXT="#000000"
|
|
LINK="#0000FF"
|
|
VLINK="#840084"
|
|
ALINK="#0000FF"
|
|
><DIV
|
|
CLASS="NAVHEADER"
|
|
><TABLE
|
|
SUMMARY="Header navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TH
|
|
COLSPAN="3"
|
|
ALIGN="center"
|
|
>Speech Recognition HOWTO</TH
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="left"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="forward.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="80%"
|
|
ALIGN="center"
|
|
VALIGN="bottom"
|
|
></TD
|
|
><TD
|
|
WIDTH="10%"
|
|
ALIGN="right"
|
|
VALIGN="bottom"
|
|
><A
|
|
HREF="hardware.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
></TABLE
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"></DIV
|
|
><DIV
|
|
CLASS="SECT1"
|
|
><H1
|
|
CLASS="SECT1"
|
|
><A
|
|
NAME="INTRODUCTION">3. Introduction</H1
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="BASICS">3.1. Speech Recognition Basics</H2
|
|
><P
|
|
>
|
|
Speech recognition is the process by which a computer (or
|
|
other type of machine) identifies spoken words. Basically, it means
|
|
talking to your computer, AND having it correctly recognize what you
|
|
are saying.</P
|
|
><P
|
|
>The following definitions are the basics needed for understanding
|
|
speech recognition technology.</P
|
|
><P
|
|
> <P
|
|
></P
|
|
><DIV
|
|
CLASS="VARIABLELIST"
|
|
><DL
|
|
><DT
|
|
>Utterance</DT
|
|
><DD
|
|
><P
|
|
> An utterance is the vocalization (speaking) of a word or words that
|
|
represent a single meaning to the computer. Utterances can be a
|
|
single word, a few words, a sentence, or even multiple sentences.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Speaker Dependance</DT
|
|
><DD
|
|
><P
|
|
> Speaker dependent systems are designed around a specific speaker.
|
|
They generally are more accurate for the correct speaker, but much
|
|
less accurate for other speakers. They assume the speaker will
|
|
speak in a consistent voice and tempo. Speaker independent systems
|
|
are designed for a variety of speakers. Adaptive systems usually start
|
|
as speaker independent systems and utilize training techniques to
|
|
adapt to the speaker to increase their recognition accuracy.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Vocabularies</DT
|
|
><DD
|
|
><P
|
|
> Vocabularies (or dictionaries) are lists of words or utterances that
|
|
can be recognized by the SR system. Generally, smaller vocabularies
|
|
are easier for a computer to recognize, while larger vocabularies
|
|
are more difficult. Unlike normal dictionaries, each entry doesn't
|
|
have to be a single word. They can be as long as a sentence or two.
|
|
Smaller vocabularies can have as few as 1 or 2 recognized utterances
|
|
(e.g."Wake Up"), while very large vocabularies can have a hundred
|
|
thousand or more!
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Accuract</DT
|
|
><DD
|
|
><P
|
|
> The ability of a recognizer can be examined by measuring its
|
|
accuracy - or how well it recognizes utterances. This includes not
|
|
only correctly identifying an utterance but also identifying if the
|
|
spoken utterance is not in its vocabulary. Good ASR systems have an
|
|
accuracy of 98% or more! The acceptable accuracy of a system
|
|
really depends on the application.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Training</DT
|
|
><DD
|
|
><P
|
|
> Some speech recognizers have the ability to adapt to a speaker.
|
|
When the system has this ability, it may allow training to take
|
|
place. An ASR system is trained by having the speaker repeat
|
|
standard or common phrases and adjusting its comparison algorithms
|
|
to match that particular speaker. Training a recognizer usually
|
|
improves its accuracy.
|
|
</P
|
|
><P
|
|
> Training can also be used by speakers that have difficulty
|
|
speaking, or pronouncing certain words. As long as the speaker
|
|
can consistently repeat an utterance, ASR systems with training
|
|
should be able to adapt.
|
|
</P
|
|
></DD
|
|
></DL
|
|
></DIV
|
|
></P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="TYPES">3.2. Types of Speech Recognition</H2
|
|
><P
|
|
>
|
|
Speech recognition systems can be separated in several different
|
|
classes by describing what types of utterances they have the ability
|
|
to recognize. These classes are based on the fact that one of the
|
|
difficulties of ASR is the ability to determine when a speaker starts
|
|
and finishes an utterance. Most packages can fit into more than one
|
|
class, depending on which mode they're using.</P
|
|
><P
|
|
> <P
|
|
></P
|
|
><DIV
|
|
CLASS="VARIABLELIST"
|
|
><DL
|
|
><DT
|
|
>Isolated Words</DT
|
|
><DD
|
|
><P
|
|
> Isolated word recognizers usually require each utterance to have
|
|
quiet (lack of an audio signal) on BOTH sides of the sample window.
|
|
It doesn't mean that it accepts single words, but does require
|
|
a single utterance at a time. Often, these systems have
|
|
"Listen/Not-Listen" states, where they require the speaker to wait
|
|
between utterances (usually doing processing during the pauses).
|
|
Isolated Utterance might be a better name for this class.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Connected Words</DT
|
|
><DD
|
|
><P
|
|
> Connect word systems (or more correctly 'connected utterances')
|
|
are similar to Isolated words, but allow separate utterances to be
|
|
'run-together' with a minimal pause between them.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Continuous Speech</DT
|
|
><DD
|
|
><P
|
|
> Continuous recognition is the next step. Recognizers with continuous
|
|
speech capabilities are some of the most difficult to create because
|
|
they must utilize special methods to determine utterance boundaries.
|
|
Continuous speech recognizers allow users to speak almost naturally,
|
|
while the computer determines the content. Basically, it's computer
|
|
dictation.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Spontaneous Speech</DT
|
|
><DD
|
|
><P
|
|
> There appears to be a variety of definitions for what spontaneous
|
|
speech actually is. At a basic level, it can be thought of as
|
|
speech that is natural sounding and not rehearsed. An ASR system
|
|
with spontaneous speech ability should be able to handle a variety
|
|
of natural speech features such as words being run together, "ums"
|
|
and "ahs", and even slight stutters.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Voice Verification/Identification</DT
|
|
><DD
|
|
><P
|
|
> Some ASR systems have the ability to identify specific users. This
|
|
document doesn't cover verification or security systems.
|
|
</P
|
|
></DD
|
|
></DL
|
|
></DIV
|
|
></P
|
|
></DIV
|
|
><DIV
|
|
CLASS="SECT2"
|
|
><H2
|
|
CLASS="SECT2"
|
|
><A
|
|
NAME="USES">3.3. Uses and Applications</H2
|
|
><P
|
|
>
|
|
Although any task that involves interfacing with a computer can
|
|
potentially use ASR, the following applications are the most
|
|
common right now.</P
|
|
><P
|
|
> <P
|
|
></P
|
|
><DIV
|
|
CLASS="VARIABLELIST"
|
|
><DL
|
|
><DT
|
|
>Dictation</DT
|
|
><DD
|
|
><P
|
|
> Dictation is the most common use for ASR systems today. This
|
|
includes medical transcriptions, legal and business dictation, as
|
|
well as general word processing. In some cases special vocabularies
|
|
are used to increase the accuracy of the system.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Command and Control</DT
|
|
><DD
|
|
><P
|
|
> ASR systems that are designed to perform functions and actions on the
|
|
system are defined as Command and Control systems. Utterances like
|
|
"Open Netscape" and "Start a new xterm" will do just that.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Telephony</DT
|
|
><DD
|
|
><P
|
|
> Some PBX/Voice Mail systems allow callers to speak commands instead of
|
|
pressing buttons to send specific tones.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Wearables</DT
|
|
><DD
|
|
><P
|
|
> Because inputs are limited for wearable devices, speaking is a
|
|
natural possibility.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Medical/Disabilities</DT
|
|
><DD
|
|
><P
|
|
> Many people have difficulty typing due to physical limitations such
|
|
as repetitive strain injuries (RSI), muscular dystrophy, and
|
|
many others. For example, people with difficulty hearing could use
|
|
a system connected to their telephone to convert the caller's speech
|
|
to text.
|
|
</P
|
|
></DD
|
|
><DT
|
|
>Embedded Applications</DT
|
|
><DD
|
|
><P
|
|
> Some newer cellular phones include C&C speech recognition that allow
|
|
utterances such as "Call Home". This could be a major factor in the
|
|
future of ASR and Linux. Why can't I talk to my television yet?
|
|
</P
|
|
></DD
|
|
></DL
|
|
></DIV
|
|
></P
|
|
></DIV
|
|
></DIV
|
|
><DIV
|
|
CLASS="NAVFOOTER"
|
|
><HR
|
|
ALIGN="LEFT"
|
|
WIDTH="100%"><TABLE
|
|
SUMMARY="Footer navigation table"
|
|
WIDTH="100%"
|
|
BORDER="0"
|
|
CELLPADDING="0"
|
|
CELLSPACING="0"
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="forward.html"
|
|
ACCESSKEY="P"
|
|
>Prev</A
|
|
></TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="index.html"
|
|
ACCESSKEY="H"
|
|
>Home</A
|
|
></TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
><A
|
|
HREF="hardware.html"
|
|
ACCESSKEY="N"
|
|
>Next</A
|
|
></TD
|
|
></TR
|
|
><TR
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="left"
|
|
VALIGN="top"
|
|
>Forward</TD
|
|
><TD
|
|
WIDTH="34%"
|
|
ALIGN="center"
|
|
VALIGN="top"
|
|
> </TD
|
|
><TD
|
|
WIDTH="33%"
|
|
ALIGN="right"
|
|
VALIGN="top"
|
|
>Hardware</TD
|
|
></TR
|
|
></TABLE
|
|
></DIV
|
|
></BODY
|
|
></HTML
|
|
> |