Many web sites contain large sets of pages generated using a
common template or layout. For example,
Amazon lays out the author, title, comments,
etc. in the same way in all its book pages.
The values used to generate the pages (e.g.,
the author, title,...) typically come from
a database. In this paper, we study the
problem of automatically extracting the
database values from the web pages without
any learning examples or other similar human
input. We formally define the notion of
a template, and propose a model that describes
how values are encoded into pages using
a template. We present an extraction algorithm
that uses sets of words that have similar
occurrence pattern in the input pages, to
construct the template. The constructed
template is then used to extract values
from the pages. We show experimentally that
the extracted values make semantic sense
in most cases.
|