General Information: Identification metadata as
name, description and authoring.
Versioning Control: Text files formats may change
without any kind of previous notification. To resolve
this issue, each FFD contains a pair of date values
defining a time frame in which the FFD is valid.
Processing: Thread priority to be applied during the
ETD process.
Sectioning: The first step for the extraction of data
consists in the partition of the text file into non-
overlapping sections that identify different areas of
data within the text file. Generally these sections are
easily identified since they usually share some
common property (e.g. all line starting with a given
prefix or following a start / end delimiter condition).
Each section is identified by a name and can be
either delimited (where two boundary conditions
determine the beginning and end of the section) or
contiguous (defined by a common property, shared
by a contiguous set of text lines). Delimited sections
can be defined through absolute conditions as “file
start”, “file end”, “line number” or the first line that
“starts”, “contains” or “ends” a given string pattern.
Besides absolute conditions, delimited sections can
also be defined using relative conditions as “Start
section after previous section end” or “End section
before next section start”. Contiguous sections can
be defined through one of three conditions: group of
lines that “start”, “contain” or “end” a given string
pattern. A simple BNF grammar is present in
Figure
1 for the proposed sectioning scheme.
Sectioning-Spec -> (Section-Spec)*
Section-Spec -> (Contiguous | Delimited)
Contiguous -> LinesStartingWith(pattern) |
LinesContaining(pattern)|
LinesEndingWith(pattern)
Delimited -> (Start, End)
Start, End -> Relative |
Line(number) |
LineStartingWith(pattern) |
LineContaining(pattern) |
LineEndingWith(pattern)
Relative -> AfterPreviousSectionEnd |
BeforeNextSectionStart
Figure 1: A BNF grammar for sectioning.
A set of validation rules can also be imposed to
each section in order to detect (as early as possible)
a change in the file’s format: if the section can be
optional, minimum and / or maximum number of
lines present in the section, existence of a given
pattern in the section start, middle or end.
Fields: Contains the definition for all fields that can
be extracted from the contents of a section. Two
types of fields are available “single fields” and
“table fields”. Single fields refer to individual values
present in a text file. These can be captured in one of
two ways: specifying prefix and / or suffix values or
through a regular expression. Table fields contain
one or more table columns, which can be defined
through fix-width length, a regular expression or by
specifying a column delimiter character that
separates the columns. Both single and tabular fields
can be defined from the start of the section or given
a specific offset of lines within the section.
Data quality mechanisms are available for both
types of fields. To each single value or table column
it is possible to associate a data type, a set of
validation rules (e.g. minimum / maximum numeric
value or text length) and a missing value
representation (usual in scientific data files).
Independent from the type of section and field
definition, both object specifications are represented
internally as regular expressions. This representation
is transparent to the end-user that only requires
knowledge on a set of gestures for interacting with
the graphical application. Using regular expressions
increases substantially the text processing
performance due to their pattern matching
capabilities and efficient supporting libraries.
Transformations: Contains a set of transformation
pipelines (i.e. a directed set of computational
elements handled in sequence), to be executed,
transforming raw data into a suitable format for data
delivery. Two types of transformations are available:
column / single field and table oriented. In the first
case the transformation will affect only one table
column / single value (e.g. append a string to the end
of each value of a selected column). In the second
case the transformation will affect multiple table
columns (e.g. deleting a set of rows from a set of
table column given a matching criteria).
Each transformation pipeline can be mapped to
a direct acyclic graph. Each start node from the
pipeline refers to an extracted field, while the
remaining nodes represent transformation
operations. Connections between transformation
nodes represent that an output of a source
transformation node is being used as input by a
target transformation node. Since the required
transformations are closely related with the domain
and structure in which data is presented, new
transformations can be inserted as needed, according
to a plugin architecture. Some examples of
transformations are: AppendConstant, CreateDate,
DateConvert, DeleteStringRows, Distribute,
Duplicate, GetElement, Join, Map, and Split.
Data Delivery: A Data Delivery consists in a XML
file containing as contents some descriptive
metadata header and a data area organized in tabular
format. In order to specify a data delivery, two types
EXTRACTION AND TRANSFORMATION OF DATA FROM SEMI-STRUCTURED TEXT FILES USING A
DECLARATIVE APPROACH
201