Before the application of heuristics, the reverse
engineering software (RES) creates the object model
of each page, i.e. a tree-structured representation of
the page components (tables, divisions, forms, fields
etc. The HTMLParser (htmlparser.sourceforge.net)
package was chosen for this purpose. The heuristics
for each type of component (TSE, TSE group, form,
and TS) are presented in the following paragraphs.
4.2.1 Identifying Transaction Service
Elements
A transaction service element (TSE) in the
SmartGov platform is a compound object
encompassing the input area and its properties
(HTML input type, size, maximum length, initial
value), the input area label, help texts (commonly
provided as hyperlinks or as extended in-place text),
the validation rules that apply to the values entered
(data type, mandatory input, allowable ranges etc)
and, finally, its relationships with other elements.
The first task towards TSEs identification is to
locate the widgets allowing for data input. HTML
provides four basic input widgets, namely input,
select, textarea and button. For each such construct a
respective TSE is created, except for the case of
inputs of type radio, for which a single TSE is
created for all input instances with the same value
for the name attribute. The reverse engineering
process subsequently locates information for the
additional aspects of the TSE as follows:
Firstly, the TSE label is determined. The form is
initially scanned for a label element whose for tag
matches the input element name (e.g.
<label
for="fname">First Name</label>), or for a
label element enclosing the input area definition
(e.g.
<label>First Name <input
type="text" name="fname"></label>). If
such an element is found, the text specified in the
label element is used as the TSE label. If no such
label is found, the RES attempts to determine the
label by its positioning relative to the input area: the
label may be placed on the left of the input area
(figs. 2, 3 and bottom half of fig. 1), or above the
input area (upper half of fig. 1). Note that the text
may be formatted using tables, thus “left” does not
necessarily refer to HTML code immediately
preceding the input tag, but may be the text included
in the table cell appearing on the left of the field
under examination. The RES takes into account the
case that an extra column, indicating whether the
field is mandatory or not, intervenes between the
input area and the label field (fig. 2).
Afterwards, the help items for the field are
located. The help items may be located at the right
of the input area, either as directly following HTML
code (fig. 1) or within an adjacent table cell (fig. 2).
In some cases, only a hyperlink may be present
which has to be clicked to display the help content.
In such cases, the RES retrieves the content pointed
to by the help anchor, and packs this content within
the TSE; the label text (determined in the previous
step) is also scanned for presence of hyperlinks. If
such hyperlinks are found, the content pointed to by
each hyperlink is extracted and packed with the TSE
as a help item. This step may produce multiple help
items for a single TSE. Additional help items may
be determined from code analysis (described below).
The next step is to extract an initial indication
whether a TSE is considered mandatory or not. The
presence of an asterisk either packed within the label
(at its beginning or end – fig. 3) or as a separate
table column (fig. 2) is used as such an initial
indication. An additional check to determine
whether some input element is mandatory or not is
performed in the code analysis phase (see below).
Subsequently, the default value for the input area
is determined by examining the settings of the
HTML attributes associated with the input area (e.g.
the “value” attribute for text boxes and buttons, the
“checked” attribute for check boxes etc). The values
of the “maxlength”, “size”, “rows” and “cols”
attributes, whenever present, are also extracted and
bundled as properties of the TSE under construction.
For input elements with a closed set of values
(such as select widgets and radio buttons), the set of
values is examined to determine the data type of the
input element. If all the values within the set are of
the same type (integers, floats, dates, etc), the data
type of the TSE under construction is set
accordingly; otherwise, the data type is set to
“string”. Data type inference for input elements with
an open set of values (free user type-in) is handled
through code analysis (described below).
The TSE properties listed above can be directly
determined form attributes values of the input
elements or from text placement in relation to the
input element. However, some important aspects of
TSEs, namely the data type, whether a TSE is read-
only or not, as well as validation checks may not be
directly modeled as attribute values; instead, e-
service developers use JavaScript to provide these
features. In order to determine these features, the
RES analyzes the JavaScript code associated with
input element events. This analysis may also reveal
additional help items and supplementary indications
on whether the TSE is mandatory or not. JavaScript
code analysis is based on heuristics, since rigorous
semantic analysis was considered exaggerate for the
issues at hand, taking also into account that the
ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES
276