old-www/HOWTO/Speech-Recognition-HOWTO/introduction.html

<HTML
><HEAD
><TITLE
>Introduction</TITLE
><META
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
"><LINK
REL="HOME"
TITLE="Speech Recognition HOWTO"
HREF="index.html"><LINK
REL="PREVIOUS"
TITLE="Forward"
HREF="forward.html"><LINK
REL="NEXT"
TITLE="Hardware"
HREF="hardware.html"></HEAD
><BODY
CLASS="SECT1"
BGCOLOR="#FFFFFF"
TEXT="#000000"
LINK="#0000FF"
VLINK="#840084"
ALINK="#0000FF"
><DIV
CLASS="NAVHEADER"
><TABLE
SUMMARY="Header navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TH
COLSPAN="3"
ALIGN="center"
>Speech Recognition HOWTO</TH
></TR
><TR
><TD
WIDTH="10%"
ALIGN="left"
VALIGN="bottom"
><A
HREF="forward.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="80%"
ALIGN="center"
VALIGN="bottom"
></TD
><TD
WIDTH="10%"
ALIGN="right"
VALIGN="bottom"
><A
HREF="hardware.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
></TABLE
><HR
ALIGN="LEFT"
WIDTH="100%"></DIV
><DIV
CLASS="SECT1"
><H1
CLASS="SECT1"
><A
NAME="INTRODUCTION">3. Introduction</H1
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="BASICS">3.1. Speech Recognition Basics</H2
><P
>
Speech recognition is the process by which a computer (or
other type of machine) identifies spoken words.  Basically, it means
talking to your computer, AND having it correctly recognize what you
are saying.</P
><P
>The following definitions are the basics needed for understanding
speech recognition technology.</P
><P
>    <P
></P
><DIV
CLASS="VARIABLELIST"
><DL
><DT
>Utterance</DT
><DD
><P
>  An utterance is the vocalization (speaking) of a word or words that
  represent a single meaning to the computer.  Utterances can be a
  single word, a few words, a sentence, or even multiple sentences.
            </P
></DD
><DT
>Speaker Dependance</DT
><DD
><P
>  Speaker dependent systems are designed around a specific speaker.
  They generally are more accurate for the correct speaker, but much
  less accurate for other speakers.  They assume the speaker will
  speak in a consistent voice and tempo.  Speaker independent systems
  are designed for a variety of speakers.  Adaptive systems usually start
  as speaker independent systems and utilize training techniques to
  adapt to the speaker to increase their recognition accuracy.
            </P
></DD
><DT
>Vocabularies</DT
><DD
><P
>  Vocabularies (or dictionaries) are lists of words or utterances that
  can be recognized by the SR system.  Generally, smaller vocabularies
  are easier for a computer to recognize, while larger vocabularies
  are more difficult.  Unlike normal dictionaries, each entry doesn't
  have to be a single word.  They can be as long as a sentence or two.
  Smaller vocabularies can have as few as 1 or 2 recognized utterances
  (e.g."Wake Up"), while very large vocabularies can have a hundred
  thousand or more!
            </P
></DD
><DT
>Accuract</DT
><DD
><P
>  The ability of a recognizer can be examined by measuring its
  accuracy - or how well it recognizes utterances.  This includes not
  only correctly identifying an utterance but also identifying if the
  spoken utterance is not in its vocabulary.  Good ASR systems have an
  accuracy of 98% or more!  The acceptable accuracy of a system
  really depends on the application.
            </P
></DD
><DT
>Training</DT
><DD
><P
>  Some speech recognizers have the ability to adapt to a speaker.
  When the system has this ability, it may allow training to take
  place.  An ASR system is trained by having the speaker repeat
  standard or common phrases and adjusting its comparison algorithms
  to match that particular speaker.  Training a recognizer usually
  improves its accuracy.
            </P
><P
>  Training can also be used by speakers that have difficulty
  speaking, or pronouncing certain words.  As long as the speaker
  can consistently repeat an utterance, ASR systems with training
  should be able to adapt.
            </P
></DD
></DL
></DIV
></P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="TYPES">3.2. Types of Speech Recognition</H2
><P
>
Speech recognition systems can be separated in several different
classes by describing what types of utterances they have the ability
to recognize.  These classes are based on the fact that one of the
difficulties of ASR is the ability to determine when a speaker starts
and finishes an utterance. Most packages can fit into more than one
class, depending on which mode they're using.</P
><P
>    <P
></P
><DIV
CLASS="VARIABLELIST"
><DL
><DT
>Isolated Words</DT
><DD
><P
>  Isolated word recognizers usually require each utterance to have
  quiet (lack of an audio signal) on BOTH sides of the sample window.
  It doesn't mean that it accepts single words, but does require
  a single utterance at a time.  Often, these systems have
  "Listen/Not-Listen" states, where they require the speaker to wait
  between utterances (usually doing processing during the pauses).
  Isolated Utterance might be a better name for this class.
            </P
></DD
><DT
>Connected Words</DT
><DD
><P
>  Connect word systems (or more correctly 'connected utterances')
  are similar to Isolated words, but allow separate utterances to be
  'run-together' with a minimal pause between them.
              </P
></DD
><DT
>Continuous Speech</DT
><DD
><P
>  Continuous recognition is the next step.  Recognizers with continuous
  speech capabilities are some of the most difficult to create because
  they must utilize special methods to determine utterance boundaries.
  Continuous speech recognizers allow users to speak almost naturally,
  while the computer determines the content.  Basically, it's computer
  dictation.
            </P
></DD
><DT
>Spontaneous Speech</DT
><DD
><P
>  There appears to be a variety of definitions for what spontaneous
  speech actually is.   At a basic level, it can be thought of as
  speech that is natural sounding and not rehearsed.  An ASR system
  with spontaneous speech ability should be able to handle a variety
  of natural speech features such as words being run together, "ums"
  and "ahs", and even slight stutters.
            </P
></DD
><DT
>Voice Verification/Identification</DT
><DD
><P
>  Some ASR systems have the ability to identify specific users.  This
  document doesn't cover verification or security systems.
            </P
></DD
></DL
></DIV
></P
></DIV
><DIV
CLASS="SECT2"
><H2
CLASS="SECT2"
><A
NAME="USES">3.3. Uses and Applications</H2
><P
>
Although any task that involves interfacing with a computer can
potentially use ASR, the following applications are the most
common right now.</P
><P
>    <P
></P
><DIV
CLASS="VARIABLELIST"
><DL
><DT
>Dictation</DT
><DD
><P
>  Dictation is the most common use for ASR systems today.  This
  includes medical transcriptions, legal and business dictation, as
  well as general word processing.  In some cases special vocabularies
  are used to increase the accuracy of the system.
            </P
></DD
><DT
>Command and Control</DT
><DD
><P
>  ASR systems that are designed to perform functions and actions on the
  system are defined as Command and Control systems.  Utterances like
  "Open Netscape" and "Start a new xterm" will do just that.
            </P
></DD
><DT
>Telephony</DT
><DD
><P
>  Some PBX/Voice Mail systems allow callers to speak commands instead of
  pressing buttons to send specific tones.
            </P
></DD
><DT
>Wearables</DT
><DD
><P
>  Because inputs are limited for wearable devices, speaking is a
  natural possibility.
            </P
></DD
><DT
>Medical/Disabilities</DT
><DD
><P
>  Many people have difficulty typing due to physical limitations such
  as repetitive strain injuries (RSI), muscular dystrophy, and
  many others.  For example, people with difficulty hearing could use
  a system connected to their telephone to convert the caller's speech
  to text.
            </P
></DD
><DT
>Embedded Applications</DT
><DD
><P
>  Some newer cellular phones include C&#38;C speech recognition that allow
  utterances such as "Call Home".  This could be a major factor in the
  future of ASR and Linux.   Why can't I talk to my television yet?
            </P
></DD
></DL
></DIV
></P
></DIV
></DIV
><DIV
CLASS="NAVFOOTER"
><HR
ALIGN="LEFT"
WIDTH="100%"><TABLE
SUMMARY="Footer navigation table"
WIDTH="100%"
BORDER="0"
CELLPADDING="0"
CELLSPACING="0"
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
><A
HREF="forward.html"
ACCESSKEY="P"
>Prev</A
></TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
><A
HREF="index.html"
ACCESSKEY="H"
>Home</A
></TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
><A
HREF="hardware.html"
ACCESSKEY="N"
>Next</A
></TD
></TR
><TR
><TD
WIDTH="33%"
ALIGN="left"
VALIGN="top"
>Forward</TD
><TD
WIDTH="34%"
ALIGN="center"
VALIGN="top"
>&nbsp;</TD
><TD
WIDTH="33%"
ALIGN="right"
VALIGN="top"
>Hardware</TD
></TR
></TABLE
></DIV
></BODY
></HTML
>