• Alternative Hypothesis H1
RP
: The percentage
averages of reduction of unique terms per docu-
ment are different in all four collections (µ
i
RP
6=
µ
j
RP
for at least one pair(i,j)).
Selection of Participants and Objects. The docu-
ments of each collection were chosen randomly tak-
ing into consideration their number of characters. So,
the quantity of documents were determined by the
sample calculation of a finite population:
n =
z
2
.σ
2
.N
e
2
.(N − 1).z
2
.σ
2
(1)
Where, n is the sample size, z is the standardized
value (we adopted 1.96, i.e., 95% of trust level), σ is
the standard deviation of population, e is the margin
of error (we adopted 5% of σ) and N is the population
size. Table 2 shows the number of selected documents
after sample calculation, along with size, mean and
standard deviations of the population.
Table 2: Sample size per collection.
Coll. N µ σ n
JAC 181,994 11,626.65 8,270.08 1,524
MAC 37,142 8,396.27 6,940.01 1,476
JSC 37,161 9,509.41 5,718.97 1,476
MSC 23,151 6,569.90 4,009.80 1,442
Experiment Project. The jurisprudential documents
have a great variability in terms of number of charac-
ters, thus, in order to ensure confidence on hypothesis
tests, we will utilize a randomized complete block de-
sign (RCBD) (Wohlin et al., 2012) , this way, each
algorithm will be applied to the same document and
those documents will be randomly taken from each
collection, increasing the experiment precision. Fur-
thermore, before applying stemming, a preprocessing
for textual standardization will be performed in which
the content of documents will be shifted to small caps
and punctuation characters will be removed. NoStem
represents the unique terms of the document with no
stemming, therefore, it acts as a control group.
Instrumentation. We developed a Java application
in order to iterate on each document of the sample,
applying stemming algorithms and counting the fre-
quency of unique terms after the execution. In the
end, the application will store the observations per-
formed in a CSV file (Comma Separated Values) for
each collection.
5 EXPERIMENT EXECUTION
5.1 Preparation
The preparation phase consisted of obtaining collec-
tions referring to judicial jurisprudence. Thus, doc-
uments were extracted from an OLTP base (Online
Transaction Processing) and converted to XML for-
mat (eXtensible Markup Language) facilitating the
experiment packaging.
5.2 Execution
By the end of previous phases, the experiment started
executing the Java application, in accordance with
what was defined in the planning phase.
5.3 Data Collection
The application recorded, for each collection, the doc-
ument identifier, the number of unique terms and the
stemming algorithm adopted CSV format (Table 3).
Table 3: Input example in CSV file.
ID,UTD,Stemmer
201100205001443632662,679,NoStem
201100205001443632662,580,Porter
201100205001443632662,547,RSLP
201100205001443632662,651,RSLPS
201100205001443632662,636,UniNE
5.4 Data Validation
The Java application was built using Test Driven
Development (TDD) (Agarwal and Deep, 2014) ap-
proach , therefore, we wrote unit test cases to validate
if the frequency count of unique terms per document
worked as expected.
Averages of unique terms per document were
computed and the percentage averages of dimension-
ality reduction were obtained by applying stemming
algorithms, considering control group.
To support this analysis, interpretation and results
validation, we used five types of statistical tests: the
Shapiro-Wilk test, the Friedman test, the Kruskal-
Wallis test, the Wilcoxon test and the Mann-Whitney
test. The Shapiro-Wilk test was used to verify sam-
pling normality, as literature shows it has higher
test power than other approaches (Ahad et al., 2011;
Razali and Wah, 2011). Considering RCBD project of
the experiment, with a factor and multiple treatments,
the Friedman test (Theodorsson-Norheim, 1987) and
the Kruskal-Wallis test (Wohlin et al., 2012) were
ICEIS 2017 - 19th International Conference on Enterprise Information Systems
102