MINING FARMERS PROBLEMS IN WEB-BASED TEXUAL

DATABASE APPLICATION

Said Mabrouk, Mahmoud Rafea

The Central Lab for Agricultural Expert Systems (CLAES), B.O.Box: 438, Dokki, Giza, 12311, Egypt

Ahmed Rafea, Samhaa El-Beltagy

Dept. of Computer Science, AUC, Cairo, Egypt

Dept. of Computer Science, Cairo University, Egypt

Keywords: Data Mining, Text Mining and Clustering Techniques.

Abstract: VERCON (Virtual Extension and Research Communication Network) is an agriculture web-based

application, developed to improve communication between agriculture research institutions and extension

persons for the benefit of farmers and agrarian business. Farmers' problems component is one of VERCON

main components. It is used to receive farmers' problems and provide them with solutions. Over the last five

years, problems and their solutions have been accumulated in a textual database. This paper presents an

integrated approach for mining these problems and their solutions. The opportunity and potential of mining

and extracting information from this resource was identified with several objectives in mind, such as: a)

discovering patterns and relations that can be used to enhance the utilization of this valuable resource, b)

analyzing solutions given for similar problems, by different experts or by the same expert at different time

in terms of their similarities and differences, and c) creating patterns of problems and their solutions that can

be used to classify new problems and provide solutions without the need for domain expert.

1 INTRODUCTION

VERCON: Virtual Extension and Research

Communication Network (http://www.vercon.sci.eg)

is a kind of help and support service. It is a web-

based application, developed in Egypt by the Central

Lab for Agricultural Expert Systems (CLAES)

(http://www.claes.sci.eg), through a project between

FAO and Egyptian Ministry of Agriculture and Land

Reclamation.

This project aims to establish and improve the

communication between extension and research

institutions for the benefit of farmers and agrarian

business at rural and village level. Improved

communication incorporating newest research

results and latest technologies shall ultimately

improve the performance of farmers and business.

Farmers' problems component is one of

VERCON main components. Farmers describe their

problems to the extension officers in the villages,

who in turn classify the problems according to their

topics into one of four categories (Production,

Administration, Marketing and Environment) and

write a description for each problem, in free text.

Problems are classified into other subcategories and

directed to several levels of domain experts in the

directorates and the specialized institutes. Domain

experts study the problems and respond with

recommended solutions.

Over the last five years, more than 10,000

problems and their solutions have been accumulated

in a textual database. Problem text has three parts:

topic of the problem (crop, weed, diagnosis,

treatment, irrigation, etc.), description of the

problem (facts, symptoms, findings) and questions.

The following example of the problem text,

translated into English, illustrates these three parts:

"The field area is one feddan. It has been

cultivated with rice variety sakha102, by baddar, ten

days ago. Mild ogizza weeds are shown. What are

the appropriate chemicals, concentration, and the

rate? , how and when to use? ".

Topic of the Problem: Rice a-kind-of crop and

ogizza a-kind-of weed.

414

Mabrouk S., Rafea M., Rafea A. and El-Beltagy S. (2010).

MINING FARMERS PROBLEMS IN WEB-BASED TEXUAL DATABASE APPLICATION.

In Proceedings of the 12th International Conference on Enterprise Information Systems - Artiﬁcial Intelligence and Decision Support Systems, pages

414-419

DOI: 10.5220/0002966904140419

 SciTePress

Problem Description: Facts: area (one feddan), rice

variety (sakha102), weeds type (ogizza), age of plant

(ten days), and cultivation method (baddar).

Symptoms: mild ogizza weeds.

Questions: What are the appropriate chemicals,

concentration and the rate? How and when to use?

The following is the answer of the above problem,

given by domain expert:

"Satron with concentration 50% and 2 Litre per

Feddan should be used. It should be mixed with fine

sand, after 15 days of cultivation ".

Mining these problems has several objectives.

First, patterns and relations can be discovered and

used to enhance the utilization of this valuable

resource. The discovered patterns and relations may

point to certain types of widespread problems and

pressing needs of people living in rural areas.

Consequently, decision makers could be able to take

necessary actions to tackle these pressing problems

and needs of poor communities. Second, solutions

given for similar problems, by different experts or

by the same expert at different time can be analysed

in terms of their similarities and differences.

Inconsistencies can then be resolved. Third, patterns

of problems and their solutions can be created and

used to classify new problems and provide solutions.

Fourth, outdated recommendations can be identified

and removed from the database. Fifth, users using

the problems database can locate problems that are

similar to theirs.

Section 2 is a review of related work. In section

3, a methodology for mining the problem parts is

given. Three parts can be extracted from the

problem's text. They are topic, description and

questions parts. Similar problems are clustered.

Solutions associated with each cluster are retrieved

and analysed.

Section 4 illustrates the difficulties encountered

when the clustering techniques was used as a means

for identifying similar problems. An alternative

more structured approach, based on transforming the

problems data base into structured data base using

extracted data set of features for each set of

problems before applying the data mining, is

presented. Result of experimentation with weed

control problems is discussed. Section 5 is

conclusion and future work.

2 RELATED WORK

Mining problems and their solutions, accumulated in

textual databases of help and support services is a

novel application of web mining. Previous mining

works focused in dealing with one type of

documents. For example, in opinion mining systems,

documents or reviews of customers are considered.

All opinion holders are of one type which is the

customer (Nauskawa, Yi, Bunescu, R., 2003.

Popescu, A., and Etzioni, O., 2005, Bo Pang and

Lillian Lee, 2008). In our work mining will be in

two different types of documents. Farmers' problems

documents and domain experts' solutions

documents. Furthermore, there is an association

between these two types of documents.

Data mining and text mining techniques can be

used in this application in an integrated manner. In

problem part, feature extraction, text clustering, and

text analysis techniques (Salton, G., 1989. Ayed, H.,

and K. M, 2002) are used to cluster similar problems

and to analyse the problems in terms of their

dominant features and the asked questions. Data

mining techniques (Margaret, H., 2003) are used to

discover patterns and relations among these

problems. In solution part, feature extraction, and

text analysis are used to analyse the solutions and

data mining techniques are used to discover patterns

and relations among solutions. In clusters of

problem-solution pairs, data mining techniques are

used to discover association rules (Jean Marc

Adamo, 2000) and text analysis techniques are used

to find the similarities and differences among

solutions of similar problems.

3 METHODOLOGY

Two modes of operation are considered, training

mode and test mode. In training mode, grouping

similar problems, extracting patterns/relations,

forming exemplars of similar problems, retrieving

solutions associated with each cluster of problems,

summarizing solutions and forming pairs of problem

and solution are done. In test mode, discovered

problem-solution exemplars are used to classify new

problems.

3.1 Problem Analysis

Figure 1, summarises the main steps of the

methodology as follows:

1. Pre-processing: using Arabic language stemmer

to remove affixes and stop words from problem text.

2. Feature Extraction: two approaches are

considered, simple approach that uses terms of text

as features and more sophisticated one that identifies

specific features to be extracted using compiled lists

MINING FARMERS PROBLEMS IN WEB-BASED TEXUAL DATABASE APPLICATION

415

of words from agricultural ontology

[http://www.fao.org/agrovoc].

3. Indexing: using term frequency and inverse

document frequency schema (TF-IDF).

4. Clustering: grouping similar problems using

different clustering algorithms such as partitioning

and agglomerative ones.

5. Summarization: problems in the same cluster are

summarized in terms of their extracted dominant

features, focusing on the three parts of the problem

text, i.e., topic, symptoms, and questions.

6. Generalization: features of texts, in one cluster,

are generalized using different generalization rules

(John Anderson, and Stanislaw, 1983) to obtain an

exemplar text that represent all texts in that cluster.

7. Extracting Patterns and Relations: association

rules technique is used to extract useful patterns and

relations.

Figure 1: Framework of Problem Parts Analysis.

3.2 Solution Analysis

Figure 2, illustrates the methodology used to analyse

solution parts. Clusters of problems are used to

retrieve their associated solutions from the textual

database. Solution texts are pre-processed by

removing stop words and affixes using Arabic

language stemmer. Features are extracted using the

same approaches used with the features of the

problems parts. Texts are summarized in terms of

their similarities and differences. Pairs of problem

and solution summaries are stored.

4 EXPERIMENTS

Several experiments were conducted to investigate

the use of clustering techniques as a means for

identifying similar problems. GCLUTO

[http://glaros.dtc.umn. edu], which is a clustering

tool kit, was used. Different clustering methods such

as bisection, K means, and agglomerative clustering

with various selected cluster sizes were tried. Terms

in problem description were used as features and

their weight were calculated based on the TF * IDF

model.

Figure 2: Framework of Solutions Parts Analysis.

Clustering was applied on three classes of rice

crop problems: weed control, seeding rate and land

preparation. The aim of the experiments was to

investigate whether the simple approach to cluster

similar complaints, based on their wordings would

work or not. Our assumption is that wording the

problems may be too different from such an

approach to work, but we've decided to pursue this

approach to validate this assumption. Clustering

based on the bag of word features can also serve as a

tool for analysing a sample of input complaints.

Identifying and extracting features then constitutes

the next step, followed by formalizing similarity

function.

GCLUTO clustering tool was used for

experimentation. GCLUTO is capable of taking

vectors and clustering them based on their similarity

Pre-processing

Feature

Extraction

Problems

Clusters of

Problems

Clustering

Summarization

Exemplars

Patterns &

Relations

Indexing

Generalization

Extract

patterns &

relations

Retrieve

Solutions

Feature

Extraction

Textual

Database

Clusters of

problems

Solutions

Pre-processing

Summarization

Pairs of Problem

and Solution Exemplars

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

416

Table 1: Experiment 1 with target number of clusters = 10.

(different similarity measures are supported, but the

one used is the cosine similarity) and the number of

desired clusters (k) which the user specifies

beforehand. Experimentation has been carried out

with various values of k. The goal of any cluster

task is to maximize the similarity of the contents of

each cluster and at the same time maximize the

distance between all other clusters. Two metrics:

intra and inter cluster distance are used to evaluate

these criteria respectively.

4.1 Clustering Results

In seeding rate and land preparation classes, problem

parts were used in clustering while in weed control

clustering was carried out using both the problem as

well as solution parts.

Table 1 shows the result of experiment with

seeding rate data set, using number of cluster = 10

(Where: Size = number of problems in the cluster,

ISim = average intra cluster similarity, ISdev =

standard deviation of ISim, ESim = average inter

cluster similarity, and ESdev = deviation of ESim).

This experiment was repeated with different

number of clusters (15, 20, 25). Figures 3

summarizes the intra and inter cluster distances. The

graphs indicate that both intra and inter cluster

distances grow when the desired number of clusters

is increased. This might indicate high degree of

overlap in the created clusters.

Similar experiments were done with land

preparation problem parts while in weed control,

clustering was carried out using both the problem as

well as solution parts.

Analysing the contents of the various clusters

obtained with weed control data set, revealed that

the actual distribution of similar problems amongst

various clusters is more scattered than in the seeding

rate and land preparation data sets. This can be

attributed to the fact that solutions were included in

the clustering process, and that there is a lot of

overlap in the solutions’ text, which means that

clustering of problems pertaining to the same weed

is not achieved.

Figure 3: Intra and Inter cluster Distances.

These Results revealed that clustering using the

vector space model where terms in the problem

represent its features is inappropriate for this kind of

task, in this domain. The main reason for this is that

similarity is primarily determined through matching

of the problem wording which means that two

similar problems with different formulation may not

be considered as similar and two different problems

with a high degree of overlap in terminology and a

difference of major term may match. These results

MINING FARMERS PROBLEMS IN WEB-BASED TEXUAL DATABASE APPLICATION

417

have driven us to design a more structured approach

for extracting features, to store those in database,

and then carry out mining on the database.

4.2 Structured Database

Analysis of weed control problems was done during

clustering experiments to populate weed problems

structured database. Features to be extracted have

been determined. Table 2 shows the dominant

features of these problems and their sources.

For simplicity, extraction of weed names and

herbicides is carried out through a list of known

weeds and herbicides. This is particularly applicable

since these are not long lists and can be easily

obtained from agricultural resources. However, it

must be stated that exact matching between weed

names or herbicide names in problems and entries in

the lists will not be always possible because of the

fact that experts offering the solutions very often

misspell both. A rapid intelligent string matching

utility thus has been built in order to determine

whether an entry in the text matches an entry in the

used lists or not.

4.3 Discovered Patterns and Relations

Association rules were applied on subset of the

structured weed problems and their solutions

database. Multiple frequency item set method was

used to find useful patterns and relations among

selected features. The strength and confidence of

features association is computed. The minimum

strength and confidence thresholds were set to

different levels. Several interesting patterns and

relations were found. The following are some of the

discovered patterns and relations:

Pattern1:

The most Frequently occurring Weeds

and their occurrence Frequency. This is obtained

by applying selected thresholds on the “weed name”

one item set. Geographical distribution of the

problem can also be detailed alongside weed names.

Pattern2:

The distribution Pattern of Weed

Problems among Planting Methods. This is

obtained by using the “Planting type” one item set.

Pattern3:

The most Commonly used Herbicides

and their occurrence Frequency. This is obtained

by applying the selected thresholds on the “herbicide

name” one item set.

Relation1:

Relationship between a Certain Weed

and a Specific Herbicide. This relationship is

obtained using two item set that includes the “weed

name” and “herbicide”. Herbicide related attributes

Table 2.

Feature Source

Weed Name problem or solution text

Weed age in days problem or solution text

Field type

(Nursery/production field)

Problem text or deduction

rule.

Planting method

(Seedlings/Seeds)

Problem text or deduction

rule.

Control method

(chemical, manual)

Solution text

In case of chemical control, the following are possible

additional features:

Herbicide name Solution text

Herbicide concentration

(percentage)

Solution text

Rate of application Solution text

Unit for Rate of Application

(kg/feddan or litres/Feddan)

Solution text

Application Method (free text

representing the solution)

Solution text

Application Time Solution text

Application Reference

(After transplantation/ After

planting seeds)

Solution text

Problem Metadata

Problem ID From VERCON’s Database

Crop Name From VERCON’s Database

Problem’s solution Date From VERCON’s Database

Originating Governorate From VERCON’s Database

Table 3.

Doniba (22 %)

Herbicide

name /other

Concentra-

tion

Rate

per

fedda

Application

Reference

Application

Range

(days)

Satron (54%) 50% 2 Litre

Since

"Shatel"

(50%)

3-4 (64%)

1-7 (36%)

Since

cultivation

(32%)

7-10 (57%)

8-9 (29%)

8-8 (14 %)

Since seeding

(14%)

8-9 (100%)

Unspecified

(4%)

Cafrosatron

(22%)

50% 2 Litre

Since

cultivation

(44.4%)

7-10 (75%)

8-9 (25%)

Since

"Shatel"

(33.3%)

1-7 (100%)

Since seeding

(22.3%)

8–9

(100%)

Aniloguard

(7.3%)

30%

7503

Since

"Shatel"

5-10

(100%)

Nomini

(4.9%)

20%

800

Since seeding

14-18

(100%)

Machit

(2.4 %)

60%

1.5

Litre

Unspecified (100%)

Bazgran

(2.4 %)

50%

1.5

Litre

Since

"Shatel"

12-15

(100%)

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

418

such as (concentration, rate, reference, etc), can also

be presented to the user. Table 3 clarify an example

of this relation for the weed name "doniba" with

strength 22%.

Relation2:

Relationship between the Control

Method and Control Time. The relationship is

obtained using two item set that includes the

features: “Control method” and “Application time”.

Relation3:

Relationship between Herbicides and

Control Times. This relationship is obtained using

two item set that includes the features: “herbicide”

and “application time”.

Relation4:

Breakdown of weeds into (wide and

narrow weeds) and their occurrence frequency as

well as relationship between generalized weeds and

herbicides.

5 CONCLUSIONS

Mining Textual databases accumulated through the

use of Help and Support services is new web mining

application. Farmers' problems module contained in

the Virtual Extension and Research Communication

Network (VERCON) is one of these services.

A methodology for mining both the farmers'

problems and their solutions was presented.

Clustering experiments were carried out using

subsets of the complaints database. The result of

these experiments revealed that clustering using the

vector space model where terms in the problem

represent its features is inappropriate.

A more structured approach for extracting

features and transforming the concerned subsets of

the database into structured database before applying

the mining was developed and applied to the weed

control problem. Result of the experiments and

examples of the discovered patterns and relations

were discussed.

The following activities are under investigation:

1. Developing an automatic approach to extract

dominant problem features.

2. Devising method to generalize similar

problems and their solutions into pair of

problem-solution exemplar and using the

created exemplars to classify new problems

and automatically find their solutions

without the need for human experts.

3. Investigating the use of opinion mining

techniques to analyse the expert solution of a

problem as his opinion for solving it, where

the expert is the opinion holder, the problem

is the object of the opinion and the solution

is the opinion or the view of the expert.

ACKNOWLEDGEMENTS

This work has been supported by the Egyptian

Science and Technology fund # 79/2009

REFERENCES

Nauskawa, Yi, Bunescu, T., and Niblack, R., 2003.

Sentiment analyser: Extracting sentiments about a

given topic using natural language processing

techniques. In Proceeding of the third IEEE Conf. on

Data Mining. Melbourne, Florida, USA.

Popescu, A., and Etzioni, O., 2005. Extracting product

features and opinion from reviews, 2005. Inproceeding

of HLT-EMNLP, Vancouver, Canada.

Salton, G., 1989. Automatic Text Processing

AddisonWesley.

Ayed, H., and K. M, 2002. Topic discovery from text

using aggregation of different clustering methods. In

15th Conference of the Canadian society for

Computational Studies of Intelligence.

DY, J., and Brodley, C., 2004. Feature Selection for

Unsupervised Learning. The Journal of Machine

Learning Research, Vol. 5, pp. 845-889, 2004.

Margaret, H., 2003. Data Mining: Introductory and

Advanced Topics, Pearson Education, Inc.

Jean Marc Adamo, 2000. Data Mining for Association

Rules and Sequential Patterns. New York, Springer-

Verlag.

Bo Pang and Lillian Lee, 2008. Opinion Mining and

Sentiment Analysis," Foundations and Trends in

information Retrieval, Vol 2, No 1-2.

Bing Liu, 2008. Opinion Mining and Summarization-

Sentiment Analysis, tutorial given at WWW-2008,

Beijing, April 21, 2008.

Dave, K., Lawrence, S. and Pennock, D., 2003. Mining the

Peanut Gallery: Opinion Extraction and Semantic

Classification of Product Review.

http:// www.fao.org/agrovoc.

John Robert Anderson, and Ryszard Stanislaw, 1983.

Machine Learning: An Artificial Intelligence

Approach. http://glaros.dtc.umn.edu/gkhome/views/

cluto/

VERCON, 2006. http://www.vercon.sci.eg.

MINING FARMERS PROBLEMS IN WEB-BASED TEXUAL DATABASE APPLICATION

419