topic based on the occurrence probability in a “word
cloud” view. Spatial information of topics is
presented in “scatterplot” view in which similar
topics are placed close to each other. The evolution of
topics over 10 years is represented by a Sankey
diagram. They use Treemap to represent their three
tree-structure topic results as a hierarchical structure
of topics, and they represent the trends of a topic by a
Stream diagram.
Mane et al. (2004) presents a way to generate co-
word association maps of major topics based on
highly frequent words and words with a sudden
increase in usage. They use a Fruchterman-Reingold
layout to draw co-occurrence relations in 2D, but the
data source for a citation is only collected from the
title and keywords.
Chen (2004) visualizes salient nodes in a co-
citation study, with a focus on three types of node:
landmark, hub and pivot nodes. They apply time
slicing, thresholding, modelling, pruning, merging
and mapping methods to prune a dense network.
We have not found an existing visualization
method that uses citing paths.
3 DATA MANAGEMENT
We define 4 logical data entities: Citation, Corpus,
Reference and Keyword. A Citation is a published
paper that is managed in our system in full text and
PDF. A set of Citations published in the same year is
a Corpus. A Reference is a cited paper in the reference
list from a Citation. A Keyword of a citation is a CG
keyword that appears at least once in one citation.
We used as benchmarks 1228 publications from
13 years of ACM SIGGRAPH conferences (2002-
2014). Corpuses are organised by year, which
introduces a time factor as it is strongly related to
topics, and we use this natural corpus as our logic
corpus.
The raw resource of a Citation is a PDF file. These
citations are semi-structured, and they follow a
certain template – in this case, the ACM format. We
use text mining to extract META data for each
citation by identifying basic information.
For a Reference in the reference list of a Citation,
we extract the title, year and authors as its identity.
There are two possibilities: this reference is either a
citation that already exists in the system, or it is not a
SIGGRAPH publication. At this stage, we assume
SIGGRAPH represents a history of topics in CG.
Based on this assumption, and in order to simplify the
problem, only references that can be matched to
citations in our system are considered. The other
references are stored, but not processed.
Although the keyword list section in a paper
represents the author’s point of view, it cannot reflect
important information in most cases. Authors may
use different phrases to represent the same concept,
such as “3D”/“three dimensional”, “level of detail”/
“LOD”, and so on. To resolve this problem, an
ontology is introduced. An ontology is a formal,
explicit specification of a shared conceptualization
(Gruber 1993; Borst 1997).
Due to the complexity of data, we employed four
type of data store (a semantic repository, an index and
search repository, a document repository, and a graph
repository) for efficient data management and
information retrieval. We take full advantage of their
features and strengths. Utilizing these repositories in
combination can effectively store and index data with
reliability and efficiency to supply meaningful
information in support of scientific research.
3.1 Semantic Repository
The standard keyword list we used as shared concept
is fetched from the MAS API, It supplies a keyword
function representing keyword objects in many fields.
For the “computer” area, it covers “computer
graphics”, ”computer vision”, “machine learning”,
“artificial intelligence” etc.- 24 fields in total. We
target our research in the “computer graphics” field,
where we collected 13670 keywords.
Each CG keyword in CG field was described as
an ontology graph model with nodes and edges. A
keyword is an RDF (Resource Description
Framework) with “rdf:type” of CG. It has synonyms
described by the “owl:sameAs” predicate. The
outcome of this work is that each keyword in a
citation can be mapped to a node with type of CG in
the semantic repository. We chose Sesame (Fensel
etc, 2005) as our RDF repository as it supplies API
for creating, parsing, storing, inferencing and
querying. It can also be connected to the Semantic
annotation tool GATE which we used for extracting
the META data. From the “GATE ontology,
Gazzetter producer” output, we can calculate the
frequency of each keyword.
3.2 Document Repository
The document repository (CouchDB) is designed for
web application, and files can be treated as
attachments of a document. By passing a document
id, attachments of a document can be accessed easily.
Since CouchDB treats each record as a document
without considering its properties, a database can