convenient instructions. The participants also were
informed that they could not be helped by the
evaluator, because the web was enough to help the
users to make the tasks properly.
Any incident occurred during the experiment was
solved by the evaluator in the most properly way.
The evaluator finally asked the participants about
the difficulties they had encountered to have a best
vision about the results of the experiment.
3.2.4 Metrics
The metrics we used on the experiment were the
following:
- Efficiency: We use as efficiency metric the
“Task Time”, which is the time that each user spends
completing each task. We did the mean time for all
users on each task.
- Effectiveness: The effectiveness was measured
using the metric “Task Completion” and “Error
frequency”. With these metrics we can obtain the
number of tasks that users did not completed and the
number of errors the users had made on each task.
In this case we considered as completed task the
one where the user had done what we asked to do,
independently of if he did it properly or not.
As well as the number of completed and not
completed tasks, we showed the results as a
percentage to allow a simple analysis.
One task is not completed for example if we
asked to buy something and the user has added the
products to the shopping chart but he has not paid
for them.
The results of the metric “Error Frequency”
allow us to evaluate the errors that each user has
made. If one of the task was to buy a brown skirt,
but the user have bought another issue, it will be
considered that the task is completed but with errors.
If the task consisted on taking note of the price
of a certain shirt and this price was written from
other shirt, then the task will be considered
completed but with errors.
Another possible way to have contemplated this
metrics would have been to consider that the task
with errors it is not completed.
- Satisfaction: The satisfaction metric was
measured using a satisfaction test, that was created
based on some questions of the evaluation test
SUSS, developed by Constantine (Constantine et al.,
1999) to measure key elements in the interfaces
design as could be personal tastes, aesthetic,
organization, understandability, and learning.
Other questions of the test are based on SUMI
test, accepted by expert evaluators and international
organizations as ISO.
Our test is about 20 questions that allow us to
evaluate in a simple way the satisfaction of the users
who have carried out the four tasks in the website.
Each user had to answer each question of the test
choosing an option between 1 and 5, like a Likert
scale, except in the two first questions and the three
final questions that were questions with a scale of
three points.
- Comprensibility and learning: Finally the
metric learning was measured. This kind of metric
measures how the user have learned to use the
website. To measure this metric we use a special
question of the satisfaction test and we compared
also the “Task Time” of task 1 and task 4, because
these two tasks consisted on buying something.
Thus, it was supposed that if the user had been able
to learn, the Task Time for Task 4 would be inferior
to Task Time for Task 1, as indeed occurred.
3.2.5 Results
Before comparing the data obtained by the metrics
in the different versions of e-fashion we can say that
the association established between the interaction
patterns and the feature Guide in general is good and
improves notably the quality of websites.
Special case was e-fashion 5 because the server
felt down for some minutes and some tasks could
not be completed, and some Task Times were higher
than expected, so the results of e-fashion 5 were
taken carefully taking into account this unexpected
situation.
3.2.5.1 Comparative of Medium Task Time
Figure 2 shows a graphic where the medium times of
task are represented for each version of e-fashion.
We highlight the fact that it occurs something
curious on the versions of e-fashion that use all the
patterns of the criteria “Guide”, or some of them.
These sites had more information to load, for
example more quantity of images, than e-fashion 4,
the “worst” version of all. For this reason the
loading time of these pages was slower than the
loading time on e-fashion 4.
A CONTROLLED EXPERIMENT FOR MEASURING THE USABILITY OF WEBAPPS USING PATTERNS
71