Wrappers are specialised program routines that automatically extract
data from Internet websites and convert the information into a
structured format. More specifically, wrappers have three main
functions. Firstly, they must
be able to download HTML pages from a website. Secondly, search
for, recognise and extract specified data. Thirdly, save this
data in a suitably structured format to enable further manipulation
[6]. The data can then be imported into other applications for
additional processing. According to [20], over 80% of the published
information on the WWW is based on databases running in the background.
When compiling this data into HTML documents the structure of
the underlying databases is completely lost. Wrappers try to reverse
this process by restoring the information to a structured format
[21]. With the right programs, it is even possible to use the
WWW as a large database. By using several wrappers to extract
data from the various information sources of the WWW, the retrieved
data can be made available in an appropriately structured format
[4].
As a rule, a specially developed wrapper is required for each
individual data source, because of the different and unique structures
of websites. The WWW is also extremely dynamic and continually
evolving, which results in frequent changes in the structures
of websites. Consequently, it is often necessary to constantly
update or even completely rewrite existing wrappers, in order
to maintain the desired data extraction capabilities [1]. The
Extensible Markup Language (XML) has the potential to alleviate
such problems. Whereas HTML is presentation oriented, XML keeps
the data structure separate from the presentation. However, it
may take some time before all data is provided in the XML format,
and it remains to be seen whether XML can establish itself in
all areas of electronic information processing [11]. Taking into
consideration that XML documents are based on varying Document
Type Definitions (DTD) or XML-Schemas, the current problems regarding
data extraction from HTML documents can be reduced, but not completely
resolved. Wrappers will, therefore, retain an important role in
the integration of data from WWW sources for some time to come.
Wrapper-Generating Toolkits
Every wrapper can be manually developed from scratch, for example,
in an established programming language using regular expressions.
For smaller applications, this can prove to be a sensible approach.
However, if the use of a larger number of wrappers is required,
this inevitably leads to the use of so-called toolkits, which
can generate a complete wrapper based on user defined parameters
for a given data source. One of the most important features of
generated wrappers is the format in which the extracted data can
be exported. If, for example, the extracted data is converted
into an XML format, then it can be imported and processed by a
large number
of software applications. Toolkits for generating wrappers can
be differentiated in a number of ways. They can be categorised
by their output methods, interface type, Web crawling capability,
use of a graphical user interface (GUI) and several other characteristics.
Laender et al.
[12] categorise a number of toolkits based on the methods used
for generating wrappers. These methods include specially designed
wrapper development languages and algorithms based on HTML-awareness,
induction, modelling, ontology and natural language processing.
However, a detailed presentation of such technical details is
beyond the scope of this survey paper. Therefore, the toolkits
are simply divided into two basic categories based on commercial
and non-commercial availability.
The wrapper generating programs within both of these categories
offer several different means of user interaction. Some toolkits
are solely based on command lines and require routines developed
in a pre-determined unique scripting language, in order to generate
an appropriate wrapper for a specified data source. These wrapper
development scripting languages are used in standard text editors
and can be seen as application specific alternatives to general-purpose
languages such as Perl and Java. A large number of toolkits offer
a GUI, whereby the relevant data within an HTML document is highlighted
with a mouse, and the program then generates a wrapper based on
the specified information. Several toolkits combine both of the
features described above. Initially, the relevant data is highlighted
with a mouse and the program generates a wrapper from this input.
If the automatically generated result does not meet the specified
requirements, the user has the additional possibility of implementing
changes via an editor integrated within the toolkit. Whether frequent
corrections are necessary or not depends, largely, on the underlying
algorithms and the functional maturity of the toolkit.
|