|
Without extraction tools
Tools are needed to manage all available information including
the Web, subscription services, and internal data stores. Without
an extraction tool (a product specifically designed to find, organize,
and output the data you want), you have very poor choices for getting
information. Your choices are:
Use search engines Search engines help find some
Web information, but they do not pinpoint information, cannot fill
out web forms they encounter to get you the information you need,
are perpetually behind in indexing content, and at best, can only
go two or three levels deep into a Web site. And they cannot search
file directories on your network.
Manually surf the Web and file directories Aside
from the labor-intensive aspect of this option, the work is tedious,
costly, error prone, and very time consuming. Humans have to read
the content of each page to see if it matches their criteria, whereas
a computer is simply matching patterns, which is so much faster.
Create custom programming Custom programming
is costly, can be buggy, requires maintenance, and takes time to
develop. Plus the programs must be constantly updated as the location
of information frequently changes.
Inefficient methods means the information analyst spends time finding,
collecting, and aggregating data instead of analyzing data and gaining
the competitive edge. This also affects the application programmer
who has to spend time developing extraction tools instead of developing
tools for the core business.
New solutions improve productivity
Extraction tools using a concise notation
to define precise navigation and extraction
rules greatly reduce the time spent on systematic
collection efforts. Tools that support a variety
of format options provide a single development
platform for all collection needs regardless
of electronic information source.
Early attempts at software tools for “Web harvesting” and unstructured
data mining emerged, and started to get the attention of information
professionals. These products did a reasonable job of finding and
extracting Web information for intelligence gathering purposes.
But this was not enough. Organizations needed to reach the “deep
Web” and other electronic information sources, capabilities beyond
simplistic Web content clipping.
A new generation of information extraction tools is markedly improving
productivity for information analysts and application developers.
Uses for extraction tools
The most popular applications for information
extraction tools remain competitive intelligence
gathering and market research, but there are
some new applications emerging as organizations
learn how to better use the functionality
in the new generation of tools.
Deep Web price gathering The explosion of e-tailing,
e-business, and e-government makes a plethora of competitive pricing
information available on Web sites and government information portals.
Unfortunately, price lists are difficult to extract without selecting
product categories or filling out Web forms. Also, some prices are
buried deep in .pdf documents. Automated forms completion and automated
downloading are necessary features to retrieve prices from the deep
Web.
Primary research Message boards, e-pinion sites,
and other Web forums provide a wealth of public opinion and user
experience information on consumer products, air travel, test drives,
experimental drugs, etc. While much of this information can be found
with a search engine, features like simultaneous board crawling,
selective content extraction, task scheduling, and custom output
reformatting are only available with extraction tools.
Content aggregation for information portals Content
is exploding and available from Web and non-Web sources. Extraction
tools can crawl the Web, internal information sources, and subscription
services to automatically populate portals with pertinent content
such as competitive information, news, and financial data.
Supporting CRM systems The Web is a valuable source
of external data to selectively populate a data warehouse or a CRM
database. To date most organizations focus on aggregating internal
data for their data warehouses and CRM systems. Now, however, some
organizations are realizing the value of adding external data as
well. In the book Web Farming for the Data Warehouse from Morgan
Kaufman Publishers, Dr. Richard Hackathorn writes, “It is the synergism
of external market information with internal customer data that
creates the greatest business benefit."
Scientific research Scientific information on
a given topic (such as a gene sequence) is available on multiple
Web sites and subscription services. An effective extraction tool
can automate the location and extraction of this information and
aggregate it into a single presentation format or portal. This saves
scientific researchers countless hours of searching, reading, copying,
and pasting.
Business activity monitoring Extraction tools
can continuously monitor dynamically changing information sources
to provide real time alerts and to populate information portals
and dashboards.
|