ographic aspects of a document. These proposals are
the origin of a new research field called Geographic
Information Retrieval (GIR).
In (Luaces et al., 2008), we present an architec-
ture of a GIR system and an index structure that im-
prove the query capabilities of other proposals. How-
ever, this architecture is not flexible enough to be used
in organizations where the number of documents is
constantly increasing. In public organizations (e.g.
city councils), where new planning permissions or ad-
ministrative files are generated every day, a workflow
process must be implemented to define all the tasks
for indexing a document in the repository. Moreover,
there are some tasks that must be performed before
the document indexing (e.g. metadata storage, scan-
ning, OCR, etc.). These tasks were not taken into
account in the presented architecture that assumes a
static document collection.
Therefore, this paper proposes a set of strategies
for the workflow management of the repository cre-
ation process and a general system architecture sup-
porting them. The proposed strategies improve the
performance of the system, ensuring that all the nec-
essary tasks are correctly performed, and facilitating
the work of the people devoted to this activity. In ad-
dition, textual, spatial, and hybrid queries (e.g. plan-
ning permissions of civil buildings in A Coru
˜
na) can
be solved by means of the index structure integrated
in the system.
The rest of the paper is organized as follows.
Some related work is presented in the next section.
Section 3 presents the general architecture for the
workflow management in the digitalization process.
Then, in the Section 4 we briefly describe the index
structure and the supported query types. Finally, Sec-
tion 5 presents our conclusions and future lines of
work.
2 RELATED WORK
Inverted indexes are considered the classical text
indexing technique (Baeza-Yates and Ribeiro-Neto,
1999). An inverted index associates to each word in
the text a list of pointers to the positions where the
word appears in the documents. The main drawback
of these indexes is that geographic references are
mostly ignored because place names are considered
words just like the other ones. If the user poses a
query such as hotels in Spain, the place name Spain
is considered a word, and only those documents that
contain exactly that word are retrieved.
Regarding indexing geographic information,
many different spatial index structures have been pro-
posed throughout the years. A good survey of these
structures can be found in (Gaede and Gnther, 1998).
A drawback of spatial index structures is that they do
not take into consideration the geographic ontology
of the real world. Internal nodes in the structure are
meaningless in the real world and it is not possible to
associate location-specific information to these nodes
because there is no relation at all between the nodes
in the spatial index structure and real world locations.
Some work has been done to combine both
types of indexes. The papers about the SPIRIT
(Spatially-Aware Information Retrieval on the Inter-
net) project (Jones et al., 2004; Vaid et al., 2005) are
a very good starting point. Regarding our work in this
research area, in (Luaces et al., 2008) we present an
architecture of a GIR system and an index structure
that combines an inverted index, a spatial index, and
an ontology-based structure. Pure textual queries,
pure spatial queries, and hybrid queries can be solved
by this index structure that is described in Section 4.
Finally, regarding our work in document man-
agement systems and workflow processes, in (Places
et al., 2007) we present a set of strategies to face the
management of the workflow of the digital library
building process and a general system architecture
supporting them. The paper also presents a tool
developed following that architecture. This tool
provides an integrated environment where all tasks
involved in the repository building can be performed.
As we noted before, in this work we extend the
architecture to include new tasks that make the index
able to solve queries taking into account the spatial
nature of the geographic references included in the
text of the documents.
3 SYSTEM ARCHITECTURE
According to (Hollingsworth, 1995), a workflow is
concerned with the automation of procedures where
documents, information, or tasks are passed between
participants following a defined set of rules to achieve
or contribute to an overall business goal; the comput-
erized facilitation or automation of a business pro-
cess, in whole or part. Workflow management sys-
tems can be classified in several types depending on
the nature and characteristics of the process (van der
Aalst and van Hee, 2002; Fischer, 2003). Collabo-
rative workflow systems automate business processes
where a group of people participate to achieve a com-
mon goal. This type of business processes involves a
chain of activities where the documents, which hold
the information, are processed and transformed until
that goal is achieved. We based the architecture of
DEFINING A WORKFLOW PROCESS FOR TEXTUAL AND GEOGRAPHIC INDEXING OF DOCUMENTS
79