Additionally, there are a large number of
impurity marker words in most of the crawled web
pages such as "answer", "advertisement",
"consultation". The sentences with these impurity
marker words should not be included in the
information extraction process.
3 SENTENCES OF SYMPTOMS
3.1 Semantic Element Sets
The sentences that describe the causes of symptoms
have their fixed patterns, and the patterns have their
fixed semantic elements.
In order to summarize a sentences, sentence
patterns can be found, and the semantic elements in
the sentence patterns can be obtained too. The entity
words can roughly be divided into two types or
semantic elements, namely symptoms and causes,
and the relationship between them is relational
words. As shown in Figure 3, we constructed the
(symptom, diagnosis, cause) triplet.
Figure 3: (symptom, diagnosis, cause) Triplet.
For example, the sentence " (Weakness in both
legs is caused by osteoporosis)"."(weakness in both
legs)"," (osteoporosis)"are entity words. " (is caused
by)"is a relation word. After studying a large
number of sentences that describe the symptoms and
their causes, Some semantic elements are induced,
and below gives some examples.
Concrete causes of symptoms=
{osteoporosis,amoebic dysentery,...}.
Upper concepts of concrete causes={causes,
factors, reasons, …}.
Preposition words={because, by, since, due to,
with,...}.
Relation words={cause, induce, bring out,
form,...}.
Patients={patients, invalid, sick,...}.
List item={one,two,three,1,2,(1),(2),1), ① , ②
,follows,...}.
Punctuation marks or words that embody peer
or parallel meaning{comma, or, and, in
addition, also,...}.
Adverbs={will, often, generally, more, can,
very, also can, possibly,...}.
Impurity words{ Question, choice, multiple
choice, single choice, answer, advertisement,
consultation,…}.
3.2 Sentence Structure
Some sentence patterns are summarized from a large
number of web texts. Below are some examples.
Every pattern is on a separate line and a example
follows on the below line.
A+B1+C+B2:A(polyuria)B1(by)C(diabetes)B
2(caused).
C+B2+A:C(diabetes)B2(bring out) A
(polyuria).
C+ X+B2 + A + S:C(diabetes) X (is)
B2(bringout) A (polyuria)S(factor).
(B2+)A+S+X+C:B2 (bringout) A(polyuria)
S(reason) X(is) C(diabetes).
C+P+B2+A:C(diabetes)P(patient)B2(fell)A(th
irsty).
There will be more than one cause after
(factor)S, and only part of the cause can be obtained
with a single sentence pattern. Therefore, when
constructing sentence pattern rules after the
completion of clauses, the semantic elements after
(factor) S cannot be classified as a entity word, the
different causes need to be distinguished according
to the punctuation marks or words that embody peer
or parallel meaning above.
A+S:c1+c2+c3:A(polyuria)S(factor):c1(diabet
es), c2(prostatitis), c3 (bladder tumor).
c1+c2+c3+ B2+A:c1(innutrition),c2(habits
and customs), c3(Poor working environment)
B1 (cause) A (Swallowing pain).
The sentence pattern will cover most of the
syntax in describing symptom-cause relations. The
more perfect and comprehensive the sentence
patterns are, the higher the information extraction
recall rate is.
After statistics, we find that the above 9 sentence
patterns have the highest occurrence frequency, and
the specific occurrence frequency is shown in the
table2 below.
Table 2: Sentence rule frequency table.
Sentence structure Frequency
A+B1+C+B2 385
C+B2+A 40