3.2 Mining Molecular Biology Text
The same methodology can be directly used for min-
ing sequences of nucleotides given as input, with-
out touching the system itself. All we need to do is
change the input so that the compiler will treat strings
of nucleotides rather than strings of words, e.g. from
the three sequences of nucleotides:
c a t g g c a a
t g g c a c t g
a c g t g g c a
the compiler will obtain:
(1’) s1:-w(1,1,c), w(1,2,a), w(1,3,t), w(1,4,g), w(1,5,g),
w(1,6,c), w(1,7,a), w(1,8,a).
(2’) s2:- w(2,1,t), w(2,2,g), w(2,3,g), w(2,4,c), w(2,5,a),
w(2,6,c), w(2,7,t), w(2,8,g).
(3’) s3:- w(3,1,a), w(3,2,c), w(3,3,g), w(3,4,t), w(3,5,g),
w(3,6,g), w(3,7,c), w(3,8,a).
The system is then run by calling all input strings,
as before, through rule(1), which will result in the out-
put:
common([t,g,g,c,a],[3,1,4])
being generated among others, indicating that t g g
c a is a common substring, and that its start posi-
tion in strings s1, s2 and s3 is respectively 3, 1 and
4. The complete output is shown in Appendix I at
http://www.geocities.com/ CHRPrograms/SCF.html.
So far we’ve only considered identical subse-
quences, i.e. there are no ambiguous elements in the
vocabulary. Our formulation however has been de-
signed to accommodate ambiguous input with mini-
mum extra apparatus and computational overhead, as
we discuss in section 4.1.
3.3 Efficiency Considerations
Our core rule for finding common substrings in a se-
quence of strings is computationally intensive in the
case of molecular biology applications because we
must actually examine each sequence entirely, draw-
ing subsequences of different lengths from each, be-
fore our core rule discovers through unification which
substrings are common to all strings given. Even in
these applications, however, there are subproblems
where the search space can be reduced, for instance
it is not uncommon to look for common substrings of
a given length, or of a maximum given length. Thus
our approach could be modified in these cases in order
to take advantage of the smaller search space (by only
looking for common substrings of length L where L
is known).
With human language texts, however, the search
space can be greatly reduced. For instance, imag-
ine that instead of having to find arbitrary substrings
of arbitrary lengths as we did above, we are given a
known sequence of words and all we have to do is
check whether they show up in every string. This
would be useful for instance in automatic author-
ship attribution and genre classification (Stamatatos
et al., 2000) where the use of certain subphrases, word
frequencies, word length and sentence length can be
calculated for specific authors or genres and used to
prove or disprove authorship of texts. It could be use-
ful also to determine the age of a manuscript, e.g.
by chequing how frequently a series of words which
might be in disuse in our times appears in a text pre-
sumed to be of a certain age.
4 THREE SPECIAL CASES OF
STRING ANALYSIS
4.1 Ambiguous Matching
Whereas the basic nucleotide set consists of the nu-
cleotides A,C,T,G, ambiguity (where a given string’s
position can take one value or another) is typically ex-
pressed by using extra names for the ambiguous nu-
cleotides, so for instance a nucleotide denoted as R
can materialize as either A or G.
Ambiguous matching usually introduces consid-
erable extra work, both in terms of representing am-
biguous strings, and of processing them. Representa-
tion wise, it is combinatorially explosive to explicitly
construct all alternative strings, one with each possi-
ble value of the ambiguous nucleotides. The alterna-
tive of compacting the representations usually com-
plicates their processing, by having to unfold them at
runtime. Specific procedures might be needed as well
in order to, for instance, explicitly block any proposed
solutions in which the ambiguous nucleotides are not
compatible with their counterparts in other input se-
quences among the comparison set.
In contrast, all our formulation needs in order to
represent and process any ambiguous nucleotide is for
the compiler to materialize all its incarnations locally
when the ambiguous string is read in. For instance, a
nucleotide of type R appearing in the third sequence,
column 7, which following our notation will be in-
put as as n(3,7,r), compiles into the two nucleotides
n(3,7,a) and n(3,7,c). Non-ambiguous nucleotides in
the same sequence remain represented as before, so
that complexity-wise, the representation grows only
linearly with respect to the number of ambiguous nu-
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
142