patents with the IPC code G10L 17 have something
to do with “speech recognition”, the description of
the IPC code. If a query patent belongs to the cluster
of patents sharing the same IPC code, it is highly
probable to find a conflicting patent in the cluster.
The IPC codes make cluster-based retrieval since
they can be the basis for semantic clustering (Kang
et al., 2007).
However, IPC-based clusters may present a
problem when searches are performed within each of
them. Since the documents in a cluster are similar to
each other, they share many terms, making it
difficult to discriminate among each other. Since the
goal of invalidity search is to pinpoint the patent
documents claiming the same technology, retrieving
many grossly similar documents with ordinary index
terms would not be very helpful, especially when the
size of a cluster is large. Since identifying
discriminating features would be difficult but critical
for patent invalidity search, we need semantically
annotated terms that would help making a fine
distinction between the patents claiming the same
technology or method from those that are grossly
similar to each other based on all the index terms.
The main thrust of this paper, therefore, is to link
problem/solution-based semantic annotations,
clustering, and patent retrieval. We describe a patent
retrieval model based on semantic clusters. The
system proposed in this paper consists of two parts:
semantic annotation for the PROBLEM and
SOLUTION categories and cluster-based retrieval
based on extracted semantic key phrases. For the
retrieval part, we attempt to distinguish patent
documents in a cluster for the same PROBLEM or
SOLUTION from those in other clusters, assuming
that documents belonging to the same semantic
cluster are more likely to be similar and hence
conflicting among each other.
The rest of this paper is organized as follows. In
Section 2, we present the related work in patent
retrieval and cluster-based retrieval. In Section 3, we
describe the semantic clustering method based on
the problem and solution annotations and a semantic
patent retrieval model. We illustrate and interpret the
experimental results in Section 4 and finally present
our conclusion in Section 5.
2 RELATED WORK
A cluster-based model for Information Retrieval (IR)
takes advantages of document clusters by assuming
that relevant documents would be grouped within
the same cluster. In general, documents are
automatically grouped by their topical relatedness
and relevant clusters are chosen with respect to a
given query (Croft, 1980; Voorhees, 1985), so that
the query terms in the relevant cluster are heavily
weighted in the retrieval model. In order to verify
the superiority of cluster-based retrieval model, Liu
and Croft (2004) compared with the cluster-less
model in a large test collection, using the language
modeling approach.
Prior to the series of workshops related to patent
retrieval, Larkey (1999) utilized IPC codes to divide
an entire corpus of patents into sub-corpora. The
patents in each sub-corpus compose a large virtual
document, and a query patent was mapped to each
virtual document to select n-best sub-collections. In
this approach, the search techniques in distributed IR
(Callan et al., 1995) were applied in order to reduce
long search time in several sub-collections. The
work is considered an important attempt to use a
unique aspect of patent documents.
Chen et al. (2003a) proposed a patent document
retrieval system concerning semantic and syntactic
properties. They utilized Latent Semantic Index to
recognize synonymous expressions. The system first
finds the patent documents whose vectors lie in the
neighbourhood of the query vector. It then uses the
template matching algorithm developed by Chen &
Tokuda (2003b) to calculate the similarity of the
document and the query. Takaki (2004) proposed an
associative document retrieval method. They
extracted sub-topics from each query and weighted
them by a term frequency-based entropy model.
They applied this method in patent invalid search by
using a query patent claim.
Many previous studies were presented in the
series of the NTCIR workshops (Kando, 2004, 2005,
2007). Among the work related to this paper is the
one by Konishi et al. (2004) that used an IPC code
as a category for each patent and combined TF/ICF
(term frequency and inverse category frequency)
with a general TF/IDF scoring formula. Fujii (2007)
integrated content and citation information to
identify an authoritative page by citation information
(i.e., a patent is cited by a large number of other
patents – foundation patent) like the PageRank
method, which was combined with the Okapi BM25
model. His system performed the best among all the
participants in the task of patent retrieval in NTCIR-
6 (Fujii et al., 2007b).
The work by Kang et al. (2007) seems to be most
relevant to our research. They proposed a cluster-
based retrieval model utilizing IPC classes. Since the
same IPC class would be assigned to somewhat
relevant patents, this approach is quite effective to
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
212