the project development, but it has difficulties in
standardization of components and combination of
features. Also, the software engineering discipline is
constantly changing and updating, which quickly
turns obsolete the reusable components (Llorens,
1996).
At the stage of system requirements reuse is
implemented in templates to manage knowledge in a
higher level of abstraction, providing advantages
over lower levels and improving the quality of the
project development. The patterns are fundamental
reuse components that identify common
characteristics between elements of a domain and
can be incorporated into models or defined
structures that can represent the knowledge in a
better way.
2.2 Natural Language Processing
The need for implementing Natural Language
Processing techniques arises in the field of the
human-machine interaction through many cases such
as text mining, information extraction, language
recognition, language translation, and text
generation, fields that requires a lexical, syntactic
and semantic analysis to be recognized by a
computer (Cowie et al., 2000). The natural language
processing consists of several stages which take into
account the different techniques of analysis and
classification supported by the current computer
systems (Dale, 2000).
1) Tokenization: The tokenization corresponds to
a previous step on the analysis of the natural
language processing, and its objective is to
demarcate words by their sequences of
characters grouped by their dependencies,
using separators such as spaces and
punctuation (Moreno, 2009). Tokens are items
that are standardized to improve their analysis
and to simplify ambiguities in vocabulary and
verbal tenses.
2) Lexical Analysis: Lexical analysis aims to
obtain standard tags for each word or token
through a study that identifies the turning of
vocabulary, such as gender, number and verbal
irregularities of the candidate words. An
efficient way to perform this analysis is by
using a finite automaton that takes a repository
of terms, relationships and equivalences
between terms to make a conversion of a token
to a standard format (Hopcroft et al., 1979).
There are several additional approaches that
use decision trees and unification of the
databases for the lexical analysis but this not
covered for this project implementation
(Trivino et al., 2000).
3) Syntactic Analysis: The goal of syntactic
analysis is to explain the syntactic relations of
texts to help a subsequent semantic
interpretation (Martí et al., 2002), and thus
using the relationships between terms in a
proper context for an adequate normalization
and standardization of terms. To incorporate
lexical and syntactic analysis, in this project
were used deductive techniques of
standardization of terms that convert texts from
a context defined by sentences through a
special function or finite automata.
4) Grammatical Tagging: Tagging is the process
of assigning grammatical categories to terms of
a text or corpus. Tags are defined into a
dictionary of standard terms linked to
grammatical categories (nouns, verbs, adverb,
etc.), so it is important to normalize the terms
before the tagging to avoid the use of non-
standard terms. The most common issues of
this process are about systems' poor
performance (based on large corpus size), the
identification of unknown terms for the
dictionary, and ambiguities of words (same
syntax but different meaning) (Weischedel et
al., 2006). Grammatical tagging is a key factor
in the identification and generation of semantic
index patterns, in where the patterns consist of
categories not the terms themselves. The
accuracy of this technique through the texts
depends on the completeness and richness of
the dictionary of grammatical tags.
5) Semantic and Pragmatic Analysis: Semantic
analysis aims to interpret the meaning of
expressions, after on the results of the lexical
and syntactic analysis. This analysis not only
considers the semantics of the analyzed term,
but also considers the semantics of the
contiguous terms within the same context.
Automatic generation of index patterns at this
stage and for this project does not consider the
pragmatic analysis.
2.3 RSHP Model
RSHP is a model of information representation
based on relationships that handles all types of
artifacts (models, texts, codes, databases, etc.) using
a same scheme. This model is used to store and link
generated pattern lists to subsequently analyze them
using specialized tools for knowledge representation
(Llorens et al., 2004). Within the Knowledge Reuse
Syntactic-Semantic Extraction of Patterns Applied to the US and European Patents Domain
37