over lower levels and improving the quality of the
project development. The patterns are fundamental
reuse components that identify common
characteristics between elements of a domain and
can be incorporated into models or defined
structures that can represent the knowledge in a
better way.
B. Natural Language Processing
The need for implementing Natural Language
Processing techniques arises in the field of the
human-machine interaction through many cases such
as text mining, information extraction, language
recognition, language translation, and text
generation, fields that requires a lexical, syntactic
and semantic analysis to be recognized by a
computer (Cowie et al., 2000). The natural language
processing consists of several stages which take into
account the different techniques of analysis and
classification supported by the current computer
systems (Dale, 2000).
1) Tokenization: The tokenization corresponds to a
previous step on the analysis of the natural
language processing, and its objective is to
demarcate words by their sequences of characters
grouped by their dependencies, using separators
such as spaces and punctuation (Moreno, 2009).
Tokens are items that are standardized to
improve their analysis and to simplify
ambiguities in vocabulary and verbal tenses.
2) Lexical Analysis: Lexical analysis aims to obtain
standard tags for each word or token through a
study that identifies the turning of vocabulary,
such as gender, number and verbal irregularities
of the candidate words. An efficient way to
perform this analysis is by using a finite
automaton that takes a repository of terms,
relationships and equivalences between terms to
make a conversion of a token to a standard
format (Hopcroft et al., 1979). There are several
additional approaches that use decision trees and
unification of the databases for the lexical
analysis but this not covered for this project
implementation (Trivino et al., 2000).
3) Syntactic Analysis: The goal of syntactic
analysis is to explain the syntactic relations of
texts to help a subsequent semantic interpretation
(Martí et al., 2002), and thus using the
relationships between terms in a proper context
for an adequate normalization and
standardization of terms. To incorporate lexical
and syntactic analysis, in this project were used
deductive techniques of standardization of terms
that convert texts from a context defined by
sentences through a special function or finite
automata.
4) Grammatical Tagging: Tagging is the process of
assigning grammatical categories to terms of a
text or corpus. Tags are defined into a dictionary
of standard terms linked to grammatical
categories (nouns, verbs, adverb, etc.), so it is
important to normalize the terms before the
tagging to avoid the use of non-standard terms.
The most common issues of this process are
about systems' poor performance (based on large
corpus size), the identification of unknown terms
for the dictionary, and ambiguities of words
(same syntax but different meaning) (Weischedel
et al., 2006). Grammatical tagging is a key factor
in the identification and generation of semantic
index patterns, in where the patterns consist of
categories not the terms themselves. The
accuracy of this technique through the texts
depends on the completeness and richness of the
dictionary of grammatical tags.
5) Semantic and Pragmatic Analysis: Semantic
analysis aims to interpret the meaning of
expressions, after on the results of the lexical and
syntactic analysis. This analysis not only
considers the semantics of the analyzed term, but
also considers the semantics of the contiguous
terms within the same context. Automatic
generation of index patterns at this stage and for
this project does not consider the pragmatic
analysis.
C. RSHP Model
RSHP is a model of information representation
based on relationships that handles all types of
artifacts (models, texts, codes, databases, etc.) using
a same scheme. This model is used to store and link
generated pattern lists to subsequently analyze them
using specialized tools for knowledge representation
(Llorens et al., 2004). Within the Knowledge Reuse
Group at the University Carlos III of Madrid RSHP
model is used for projects relevant to natural
language processing. (Gomez-Perez et al., 2004);
(Thomason, 2012); (Amsler, 1981);(Suarez et al.,
2013) The information model is presented in Figure
1. An analysis of sentences and basic patterns are
shown in Figure 2.