a peer intends to publish or query is associated to
a theme. A user can find the relevant themes from
a catalog of all existing predefined themes shared
in the network. The theme is combined with the
segment number to determine the key used for the
put(key,value). Therefore, the i
th
segments of
the DBF are not indexed on a same peer as they be-
long to different themes.
To avoid checking several segments, we constraint
H
2
to H
k
functions to cover a single segment. Con-
sequently, the H
1
(key) function plays two roles: fil-
tering and routing purpose. It filters values as a tradi-
tional hash function, and it is also used to determine
the segment number where other hash functions are
constrained. The segment number (m ÷ H
1
(key),
m being the DBF size), combined with the theme, is
used as the key of the primitive put(key, value)
for determining the peer storing the segment to check.
2.3 Controlling Segment Selectivity
The selectivity of a Bloom Filter depends on the size
m of the filter, the number k of functions, and the
number n of keys inserted. The probability of having
false positive answers (i.e., the probability of having
the k positions set to 1 for an element not in the set) is
given by (1 − e
−kn/m
)
k
. As the number k of function
is fixed, the probability depends on the ratio n/m.
The false positive probability may imply a lot of
useless network communications. In fact, each time
a value is successfully filtered, the peer that created
this filter is contacted. Therefore, controlling the false
positive probability by keeping it below a threshold is
important as it will reduce network traffic.
When a source peer adds, removes, or modifies
documents, its DBF must reflect the changes. Some
techniques (Fan et al., 2000) are available to support
removal in Bloom Filter. Our goal is to maintain un-
der a threshold the selectivity of a Bloom Filter af-
ter insertions.For controlling the selectivity, we intro-
duce shadow segments. When the ratio n/m makes
the probability exceed the threshold, a shadow seg-
ment with an augmented size is used. Each segment
keeps the number of keys inserted so far. The shadow
segment size is computed so that the ratio n/m keeps
the probability under the threshold. When a shadow
segment is created, keys have to be rehashed in the
shadow segment using the new hash functions. The
H
1
function remains the same, determining the seg-
ment number, and others H
i
functions range is mod-
ified to cover the shadow segment interval. With this
approach we can adjust the size of a bloom filter dy-
namically according to the required need.
3 LOCATING DATA SOURCES
3.1 Network Architecture
As in traditional P2P networks, a peer can be a client,
a server, or a router. We add a fourth role: a peer is
also a controller for managing segments of DBF. The
client role is used for querying the network. A server
peer shares data on the network. For a server peer,
the DBF created from its data is split into segments;
segments are distributed through the network using
the DHT put(key, value) function for send-
ing the segment to a controller peer. The message
sent through the network contains: (i) The segment
of the distributed Bloom Filter. (ii) A set of Bloom
Filter hashing functions (H
2
(key)...H
n
(key)). (iii)
The IP address of the server peer. Each peer is a
router, routing messages according to the DHT princi-
ples. A controller peer manages distributed segments
of others peers. The role of a controller peer is to
check managed segments according to the Bloom Fil-
ter principles.
3.2 Locating Relevant Sources
Queries processed in our system are simple content
and structure queries with absolute path expressions
(i.e. only child axis). A query can be expressed as a
tree of path expression where keywords are attached
to leaf nodes, and a theme. A query tree is decom-
posed into value-localization-paths, as an XML docu-
ment. Each value-localization-path is inserted in a de-
mand message, used to resolve in a distributed man-
ner the query. A demand, illustrated at bottom of fig-
ure 2, is organized as follows:
• a step attribute indicating the current process:
checking the DBF (checkingDBF) or contacting
server peers (checkingSRC),
• a from attribute for the client peer address,
• vlp elements representing value-localization-
paths of the query. It stores the theme of the
query, the path and the value to search. The
state attribute indicate wether it has been re-
solved on a controller peer (found) or in instance
to be (looking),
• a results element storing source peers filtered
by DBFs.
Client Peer At creation time, a new demand mes-
sage is created with step set to checkingDBF. The al-
gorithm 1 describes the behaviours at the client peer.
First, from the query tree, a set of value-localization-
paths is extracted. The value-localization-paths (vlps)
DISTRIBUTED BLOOM FILTER FOR LOCATING XML TEXTUAL RESOURCES IN A P2P NETWORK
263