If we now initialize the system by calling all three strings, i.e.:
(4) ?- s1, s2, s3.
we are in a position to extract substrings from these sentences in a high level fashion, through the
following two propagation rules:
(5) w(Row,C,N), w(Row,C1,N1) ==> C1 is C+1 | sub([N,N1],Row,C).
(6) w(Row,C,N), sub(S,Row,C1) ==> C1 is C+1 | sub([N|S],Row,C).
Rule (5) detects two subsequent words in the same sentence, or row, and records them through
a new constraint sub/3 in list form (in the first argument of sub/2), keeping as well, in its second
argument, a record of the row (or sentence) the substring was found in, and in its third argument,
the column it starts at within that row. Rule (6) similarly identifies all other substrings in the input
strings, by adding one more word at a time to an already found string.
Of course, for different problems we may specialize these rules further, so that they zoom
onto some sufficient subset of the set of all substrings, e.g. on all those substrings of a given size.
We have now enough utilities for the first incarnation of our Power matching rule, which
extracts a substring S that is common to all three strings, and records the position in each sentence
where the substring appears:
(7) sub(S,1,C1), sub(S,2,C2), sub(S,3,C3) ==> common(S,[C1,C2,C3]).
This completes our formulation for this toy example. Among the results the system outputs, we
have:
common([of,march],[3,5,2])
Notice that in their declarative reading, our system’s rules form a specialized concept (e.g.
substring, common string, etc.) and in their operational reading, they produce all instances of that
concept with respect to given input.
4.2 Mining Molecular Biology Text
The same methodology can be directly used for mining sequences of nucleotides given as input,
without touching the system itself. All we need to do is change the input so that the compiler will
treat strings of nucleotides, e.g. from:
c a t g g c a a
t g g c a c t g
a c g t g g c a
the compiler will obtain (we now use “n” instead of “w” for mnemonics):
(1’) s1:- n(1,1,c),n(1,2,a),n(1,3,t),n(1,4,g),n(1,5,g),n(1,6,c),n(1,7,a),n(1,8,a).
(2’) s2:- n(2,1,t),n(2,2,g),n(2,3,g),n(2,4,c),n(2,5,a),n(2,6,c),n(2,7,t),n(2,8,g).
(3’) s3:- n(3,1,a),n(3,2,c),n(3,3,g),n(3,4,t),n(3,5,g),n(3,6,g),n(3,7,c),n(3,8,a).
Calling all input strings through rule(1) results in the output:
common([t,g,g,c,a],[3,1,4])
being generated among others, indicating that t g g c a is a common substring, and that its start
position in strings s1, s2 and s3 is respectively 3, 1 and 4.
So far we have only considered identical subsequences, i.e. there are no ambiguous elements
in the vocabulary. Our formulation however has been designed to accommodate ambiguous input
with minimum extra apparatus and computational overhead, as we discuss in section 5.1.
119