2 BASIC THEORY
2.1 Natural Language
Natural language is an interactive communicative
mechanism. The computer cannot understand the
instructions that humans use in everyday language.
Natural language can understand written input and
speech input (Dix, 2009). However, there are still
deficiencies in terms of ambiguity in syntactic and
semantic aspects. The oral dialogue system consists
of several components needed so that the system can
function successfully, including speech recognition
and text-to-speech systems (Mc Tear, 2004).
2.2 Speech Recognition
Each speech recognition system essentially runs a
process for recognizing human speech and turning it
into something that is understood by computers
(Amundsen, 1996). Research into effective speech
recognition algorithms and processing models has
lasted almost since computers were created. There are
4 key operations of listening and understanding
human speech.
1. Word separation is the process of creating
discreet from human speech. Each section can be
a large or small phrase as a single syllable or a
word part.
2. Vocabulary is a list of sound items that can be
identified by speech recognition engines.
3. Word matching is a speech recognition system
method that is used to search for voice parts in the
vocabulary system.
4. Speaker dependence is the extent to which speech
recognition machines depend on vocal tones and
individual speech patterns.
The last element of speech recognition is grammar
rules. Grammar rules are used by speech recognition
software to analyze human speech input in the
process of trying to understand what someone is
saying. There are many grammar rules, each of which
consists of a set of pronunciation rules. One of them
is the context-free grammar (CFG). The main
elements of CFG are the following:
1. Word, a list of valid cases to say;
2. Rules, greeting structures for words used; and
3. List, word list for use in rules.
2.3 Speech Application Programming
Interface (SAPI)
Two basic types of SAPI machines are text-to-speech
and speech recognition. The TTS system synthesizes
text strings and files into speech audios using
synthetic sounds. On the other hand, speech
recognition converts human-spoken audios into
readable text strings. The application can control text-
to-speech by using ISpVoice's Component Object
Model (COM). After the application is created, call
ISpVoice: Speak to generate speech output from
some text data.
An application has two choices of speech
recognition engine types (ISpRecognizer). A shared
recognition that might be shared with other speech
recognition applications is recommended for most
speech recognition applications. To create an
ISpRecoContext for a shared ISpRecognizer (shared),
the application only needs to make a
CoCreateInstance COM call to the
CLSID_SpSharedRecoContext component. For large
server applications that will run alone on a system,
where performance is the key, the InProc speech
recognition engine is more appropriate. In order to
create ISpRecoContext for ISpRecognizer InProc,
firstly, the application must call CoCreateInstance on
the CLSID_SpInprocRecoInstance component to
create its own ISpRecognizer InProc. Next, the
application can call ISpRecognizer:
CreateRecoContext to get an ISpRecoContext.
Finally, a speech recognition application must
create, load, and activate ISpRecoGrammar, which
basically shows the type of utterance to be
recognized, namely dictation or command and
control. First, this application creates
ISpRecoGrammar using ISpRecoContext:
CreateGrammar. Then, it loads the appropriate
grammar application either by calling
ISpRecoGrammar: loadDictation for dictation or one
of the methods ISpRecoGrammar: LoadCmdxxx for
commands and controls. Finally, this grammar rule is
activated so that speech recognition data starts the
application for commands and controls (Syarif et al.,
2011).
3 RESEARCH CONCEPT
3.1 Thinking Framework
The framework for thinking in this research is shown
in Figure 1.
3.2 Flow Chart and Problem Solution
Flow charts and problem solutions from the Speech
Application Programming Interface for English
Pronunciation Learning appear in Figure 2 starting