Figure 1: Adopted architecture of the fuzzy keyword search.
manual keyword extraction are obvious, e.g.,
including the inconformity of keyword combination
and low efficacy, especially in the scenarios of cloud
computing. Many research can be found for the
automatic keyword extraction, e.g., (Witten, 2009)
adopts statistical methods based on English
dictionary to build a KEA system for automatic
keyword extraction, and (Yang, 2002) uses PAT-tree
structure to auto-collect keywords, while an
improved scheme based on the co-occurrence
frequency of Chinese phases is reported in (Du,
2011). However, few studies are done particularly
for automatic keyword extraction over Chinese
patent documents.
In this paper, for increasing the accuracy of
keyword extraction, we overall consider the
influence of the word frequency in the special
regions, the penalty function of parallel structure and
the weighted lexical morphemes upon the subjects of
Chinese patent literature. After removing the
common words, Chinese keywords are automatically
extracted based on an improved method of the term
frequency-inverse document frequency (TF-IDF)
algorithm. To efficiently search Chinese keywords, a
Pinyin-Gram-based algorithm is proposed to build
the fuzzy keyword set, since Chinese Pinyin offers a
unique method to study the Chinese word similarity,
which is substantially different from English.
Encrypted Files and keyword sets are transferred to
the private cloud server. From the side of authorized
users, a keyword trapdoor search index structure
based on the n-ary tree is designed, and the searched
encrypted files are outsourced by the public server,
which usually has much more memories than the
private server. The efficiency of the proposed
scheme is verified through computer experiments,
which is significantly higher than the traditional
methods.
2 SYSTEM DESIGN AND TASK
2.1 System Description
In this paper, the adopted system architecture is
consist of four components, i.e., the owner, the
private cloud server, the public cloud server and the
authorized users as indicated in Fig. 1. The
difference compared to general system architecture,
e.g., in (Li, 2010), lies in that a private cloud server
is introduced. The advantages of such arrangement is
to doubly enhance the security of sensitive files,
since information leakage may happen through the
index analysis if all the data are stored together in
the public server.
The flow of the fuzzy keyword search is depicted
as the follows. The keywords are extracted
automatically from the patent files, and then the
fuzzy keyword sets and search index are constructed.
Patent files are encrypted and transferred to the
private server by the owner. These encrypted files
are uploaded to the public cloud server with
necessary remarks or extra encryption. The
authorized users deliver the search request and the
responding trapdoor functions are processed at the
private server. The file indexes and the found
encrypted files are outsourced to the user. Besides
these features, the encryption of certain patent
literature, e.g., the national defense patents, are
desired for cloud computing. As implied by (Li,
2010), the cloud server cannot be fully trusted. On
one hand, it does not delete the encrypted files and
the index, and only response to the query requests
from authorized authors with unchanged search
results. On the other hand, it may analyze the data
stored in the server for certain purposes and sell the
analyzed results as additional information to