Position paper for AI and Information Integration workshop at AAAI-98

Squeal: SQL Access to Information on the Web

The World-Wide Web contains an abundance of semi-structured information, including hyperlinks between pages, structure within hypertext pages, and structure within the addresses of pages (URLs). Because of the difficulty in harnessing this structural information, many Web tools fail to make use of it, instead treating the Web as though it were a flat text collection. We introduce Squeal, a SQL-like language for querying the Web as though it were in a relational database. We describe simple but powerful applications built on top of Squeal, which we call "ParaSites" because they make effective use of this underutilized information on the Web, often in ways unintended by the information’s authors.

Squeal

Relational databases provide a powerful abstraction for manipulating data. The division of data into relations helps users comprehend the data and allows them to specify queries in Structured Query Language (SQL). While the Web is too large and quickly-changing to be stored in a relational database, it is useful to provide users with the illusion that it is. We provide this through the Squeal language and interpreter, which allows users to make SQL queries involving Web information. The Squeal interpreter determines what information needs to be fetched and parsed from the Web in order to answer the question. The retrieved information is cached in a local database in case it is needed again.

The Squeal schema includes relations for representing the following types of Web information:

the contents of a page
hypertext links between pages
different URL strings corresponding to the same page
the parsed version of a URL
tags and attributes appearing on a page
the header and list hierarchy on a page
if a search engine reports that a page contains a specified string
if a search engine reports that a page contains a specified link

For example, hyperlinks are expressed through the link table, which expresses a relation between a source URL, anchor text, and a destination URL. In order to ask what pages are pointed to by the MIT AI Lab home page, the user would enter:

SELECT destination FROM link
WHERE source = "www.ai.mit.edu"

The Squeal interpreter would retrieve the page, parse it, then return a list of the destinations of links on the page. Instead of (or in addition to) specifying the source page, the user could provide another field. For example, to ask for pages containing hyperlinks with "artificial intelligence" as anchor text, the user would enter:

SELECT source FROM link
WHERE anchor = "artificial intelligence"

To answer this, the Squeal interpreter would ask a search engine, such as AltaVista, what pages contain "artificial intelligence" as anchor text. Because no such engine can ensure returning all such pages, the list is likely to be incomplete. The Squeal interpreter then retrieves and examines the pages, returning to the user a list of pages that it has verified as having the desired anchor text. While recall will be less than 1, the precision is guaranteed to be 1. A similar process is used to request pages that link to both "www.ai.mit.edu" and "www.lcs.mit.edu":

SELECT L1.source FROM link L1, L2
WHERE L1.destination = "www.ai.mit.edu"
AND L2.destination = "www.lcs.mit.edu"
AND L1.source = L2.source

Queries involving multiple Squeal- and user-defined tables are also possible and are discussed elsewhere.

ParaSites

We have built a number of ParaSites on top of Squeal that exploit the Web’s structural information, including a personal home page finder, which uses the following algorithm to try finding the home page of the person with name N.

Generate a list of candidate pages from the destinations of hyperlinks whose anchor text is N.
Give bonus points to candidate pages where one or more of the following conditions hold:

The full name appears within title or header tags (e.g., "<title>Lynn Stein’s home page</title>")
The URL is of the form "…./foo/foo.html" (e.g., "/users/las/las.html")
The final directory name in the URL starts with a tilde (e.g., "~las/home.html")

This algorithm, with additional heuristics described elsewhere, is quite effective.

The contrast between text-based and structure-based approaches can be seen most directly in different recommender systems for Web pages. With either approach, a user specifies seed pages. A text-based system, such as Excite, then searches for pages that contain the same words, and return these pages to the user. The ParaSite approach is to find parent pages that point to the seed URLs, then to return the pages pointed to most frequently by the parents, i.e., the seed pages’ siblings. The underlying philosophy is that human beings are best at deciding what pages are alike, so we should associate pages with each other if they co-occur as destinations of hyperlinks on multiple pages. This is entirely analogous to mining citation indexes. A small user study showed the ParaSite approach to be superior in some ways to the purely text-based approach. We predict the best system would combine both approaches.

Conclusions

We have provided a taste of the Squeal programming system for accessing structural information on the Web and have outlined some effective applications that can be easily written in Squeal, demonstrating the utility of the Web’s structural information and how Web tools, such as AltaVista, can be harnessed.

Bibliography

Spertus, Ellen. ParaSite: Mining the Structural Information on the World-Wide Web. PhD Thesis, Department of EECS, MIT, Cambridge, MA, February 1998.