The web is an ocean of information containing
more than 10 billion web pages, wherein 90% of them are in non-structured
or semi-structured formats. At present, it is expanding with
an increasing rate of 1 million pages per day. The information
is increasing at an explosive speed while people’s time and
energy are limited. The information absolutely valuable for
enterprises or individuals is just lying in this worldwide ocean
of the Internet, and how to extract them has become one of the
most imperative tasks confronting the research institutions
that are engaging the important topics of Information
Retrieval, Data
Mining, Knowledge
Management and Competitive
Intelligence etc.
The Knowlesys Web Data Mining System(KWDMS) is like a huge blue whale who
cruises in this information ocean everyday and is capable of
automatically and accurately extracting valuable data
for you from the webpage ocean wherein a multitudes of useless
text (such as page headers and footers, column listings
and advertisement messages) shall be excluded.
In more than five year’s time, the Knowlesys
Software, Inc. had developed the KWDMS– a powerful web information extraction system. It has a stratified
structure and a loosely coupled module design comprising many
sub-systems.
The KWDMS can extract designated information in big volume from the web,
and integrate them into specified relational databases, thus
to help customers to excavate precious stones from the Internet
minefield. Since the process converses the information from
the semi-structural form into the structural form, from their
dispersed state to the concentrated state, and changes them
from the remotely existed information to your locally hoarded
treasure, as well as from the visual file into the digital record,
you can surely extensively use them in the future.
The KWDMS is
capable of doing data extraction from various types of websites.
In addition to extracting field data of semi-structured construction,
it can also extract some free text information like e-mail addresses
and many types of multimedia files.
The KWDMS is characterized
as a stable running, intelligent crawling and accurate extracting
software. The KWDMS is an information extraction platform. When new extraction task
is required, it is necessary to use this platform to configure
the new web crawling and extraction script and parameters.
A general database access layer is developed in the KWDMS that enables its back end connect to any relational database,
such as MS SQL Server, Oracle, DB2, Sybase, MySQL and InterBase
etc, even those file database like the Access database. Regardless
which type the database is, the extracted data can be checked
with a general database browser, as well as export them into
various formats such as XML, CVS, HTML, Excel and so on.
|