about climate change in social networks (Xiaoran et
al., 2014), as well as for conservation planning (Pino-
Díaz et al., 2014). In this work we present a
preliminary approach intended for the assessment and
analysis of the developed climate change programs in
Mexico. We considered the documents that give the
general guidelines as well as the documents that were
derived from them. We wanted to infer, for each
document, their degree of attachment to the central
policy by comparing the documents by means of text
mining techniques. In other words, the idea is to
analyze the different levels of implementation
(national and state level) as well as to assess up to
what point the developed programs are close to the
national documents, and if clusters showing regions
of Mexico can be found. For this purpose, the
numerical indicator “term frequency-inverse
document frequency” (tf-idf) is calculated and used
for the analysis of a data base containing the climate
change policy plans of Mexico. The rest of the
document is structured as follows: Section 2
introduces the methodology and the construction of
the model, and Section 3 presents the results and
some further lines of research.
2 METHODOLOGY AND MODEL
DESIGN
The hierarchical structure of the climate change
programs is the following:
1. The national climate change strategy (NCCS),
which defined the policies on climate change at the
national level, it is at the top of the hierarchy. This
document was developed by a group of specialists
working at the central government.
2. The special program of climate change (SPCC)
is the legal implementation of the national climate
change strategy. It captures the general lines of action
contained in the NCCS. It is developed by the central
government and contains the specifications of the
regulations, rules, and the specific programs to be
followed.
3. The state programs of climate change (StPCC)
are the documents developed for each state (Mexico
is currently divided into 32 federal entities). They
have to implement at local level the policies and
programs established in the SPCC. The StPCC are
developed by committees of local specialists which
depend on the local governments.
There are also municipal programs of climate change,
1
The source code of the NLTK is available at
http://www.nltk.org/_modules/nltk/text.html
which implement the state policies at the lowest
organization level. However, they are not included in
this study because very few of them had developed at
the moment of the construction of the data base.
However, the natural enrichment of the created base
for the project should include these documents.
At the time of the elaboration of the project’s data
base only 14 federal entities had a finished document.
These are: Mexico city, State of Mexico, Nuevo
León, Tabasco, Baja California Norte, Baja
California Sur, Chiapas, Veracruz, Hidalgo, Quintana
Roo, Yucatán, Coahuila, y Guanajuato. A draft
version for the State of Oaxaca was added to these
documents.
The documents were obtained in pdf format and
converted to plain text in UTF-8 encoding. Texts
were processed in the following way:
1. A pre-processing stage for removing the
documents' stop words (Rajaraman and Ullman,
2011). In the case of the Spanish language these are
for example, “el”, “la”, “los”, “a”, etc. (for a quick
reference in Spanish see
http://www.ranks.nl/stopwords/spanish). These
words are filter out because, by the number of times
they appear, they are considered as less informative
terms. Then, the documents are itemized, i.e. they are
separated until word level. Each term is considered an
"atom" or minimum component, since it is given by
the format of the document itself and do not
dependent on other terms. These words, or tokens, are
consider low-level data, because they do not
explicitly depend on any other term. And the
structures or information that are obtained from them
is considered high-level information. This step also
included the stemming, which roughly speaking, is a
methodology for reduce the space by compacting
words that are derived from others, i.e, reducing
inflected (or derived) words to their word stem, base
or root form. In this way the dimension of the space
of words remains “manageable”.
2. With the extracted tokens, a vector space
containing the documents is created. There, each
document is represented in a vector that contains an
entry for each word or token. There is a zero if the
document does not contain the word, a 4 if it appears
4 times, etc. The processing and pre-processing steps
were performed by using the Natural Language
Toolkit NLTK 3.0 for the Python programing
language (for more information and a quick
introduction to the library the reader is refer to
http://www.nltk.org/).
1
To compare the documents
Term-frequencyInverseDocumentFrequencyfortheAssessmentofSimilarityinCentralandStateClimateChange
Programs:AnExampleforMexico
543