|
Problem
The unabated growth of the Web has resulted in a situation in which
more information is available to more people than ever in human
history. Along with this unprecedented growth has come the inevitable
problem of information overload. To counteract this information
overload, users typically rely on search engines (like Google and
AllTheWeb) or on manually-created categorization hierarchies (like
Yahoo! and the Open Directory Project). Though excellent for accessing
Web pages on the so-called "crawlable" web, these approaches
overlook a much more massive and high-quality resource: the Deep
Web.
The Deep Web (or Hidden Web) comprises all information that resides
in autonomous databases behind portals and information providers'
web front-ends. Web pages in the Deep Web are dynamically-generated
in response to a query through a web site's search form and often
contain rich content. A recent study has estimated the size of the
Deep Web to be more than 500 billion pages, whereas the size of
the "crawlable" web is only 1% of the Deep Web (i.e.,
less than 5 billion pages). Even those web sites with some static
links that are "crawlable" by a search engine often have
much more information available only through a query interface.
Unlocking this vast deep web content presents a major research challenge.
In analogy to search engines over the "crawlable" web,
we argue that one way to unlock the Deep Web is to employ a fully
automated approach to extracting, indexing, and searching the query-related
information-rich regions from dynamic web pages. For this miniproject,
we focus on the first of these: extracting data from the Deep Web.
Extracting the interesting information from a Deep Web site requires
many things: including scalable and robust methods for analyzing
dynamic web pages of a given web site, discovering and locating
the query-related information-rich content regions, and extracting
itemized objects within each region. By full automation, we mean
that the extraction algorithms should be designed independently
of the presentation features or specific content of the web pages,
such as the specific ways in which the query-related information
is laid out or the specific locations where the navigational links
and advertisement information are placed in the web pages.
There are many possible 7001-miniprojects. Feel free to talk to
either of us for more details. Here are a few possibilities to consider:
1. Develop a Web-based demo for clustering pages of a similar
type from a single Deep Web source. For example, AllMusic produces
three types of pages in response to a user query: a direct match
page (e.g. for Elvis Presley), a list of links to match pages (e.g.
a list of all artists named Jackson), and a page with no matches.
As a first-step to extracting the relevant data from each page,
you may develop techniques to separate out the pages that contain
query matches from pages that contain no matches, and perhaps, rank
each group based on some metric of quality.
2. Design a system for extracting interesting data from a collection
of pages from a Deep Web source. You might define a set of regular
expression that can identify dates, prices, or names. Develop a
small program that converts a page into a type structure. For example,
given a DOM model of a web page, identify all of the types that
you have defined, and replace the string tokens with XML tags identifying
the types. Replace all non-type tokens with a generic type, and
return the tree as a full type structure). Alternatively, you may
suggest your own approach for extracting data.
3. Develop a system to recognize names in page. Given a list of
names and a web page, identify possible matches in the page. Based
on the structure of the page and the distribution of recognized
names, identify strings that may also be names based on their location
in the DOM tree heirarchy representing the page.
4. Write a survey paper about current approaches for understanding
and analyzing the Deep Web. Be sure to include many of your own
comments on the viability of the approaches you review.
5. Or, feel free to suggest a miniproject of your own.
Background: Knowledge of Java or Python would
be helpful. Some knowledge of information retrieval and machine
learning may be useful but is not required.
Deliverables: You should submit a report that
clearly describes what you have learned and what you have accomplished.
The report should include useful references. You should also provide
any source code you may have written to validate your ideas.
Evaluation: You will be graded on the novelty
and quality of your report and implementation.
......
|