Review - Record-Boundary Discovery in Web Documents.

M. Tamer Özsu: Review - Record-Boundary Discovery in Web Documents. ACM SIGMOD Digital Review 2: (2000)

Review

This paper addresses a problem that has received significant attention in the second half of the 1990s, namely the management of Web data using database techniques. The paper proposes to create a database from Web data using a three-step procedure: (1) analysis of Web pages to determine the boundaries of "records" in these pages and to extract them, (2) determination of objects and relationships encapsulated in these pages and records in order to build tables that contain descriptive information about the structure, and (3) generation of the populated database. This paper addresses only the first step of the process: companion papers [2,3] address steps 2 and 3, which depend on an application ontology from which a database schema is derived.

The definition of a "record" is not precisely stated, but it corresponds roughly to a segment of a Web page that may appear repeatedly in that Web page (e.g., auto sales ads), and that can be treated as one unit. The process of determining record boundaries and groups of records starts with the building of a "tag tree", which represents the nested tag structure of documents. The nodes of the tag tree are individual tags in the document; tag T1 that is nested inside tag T2 appears as the child node of T2 in the tree. The tag tree is analyzed to find the subtree with the highest fanout. The authors' contention is that, among many possible record breaks, considering those in this subtree is the most meaningful. These tags (called "candidate tags") are subjected to five heuristics to determine one that will serve as the record separator. These heuristics, which are described in detail in the paper, are: (1) rank the candidate tags with respect to the number of occurrences and pick the highest ranked one, (2) compare the candidate tags against a predetermined ranked list of tags and pick the one that is ranked highest on this list, (3) consider the standard deviation of the distance (in number of characters) between two occurrences of the same candidate tags in the subtree and pick the one with the smallest standard deviation, (4) compare the number of occurrences of each pair of tags with no intervening text against the number of individual occurrences of each of these tags, and pick the tag for which the difference in these counts is the smallest (this heuristic exploits the fact that certain pairs of tags usually occur together near record boundaries), (5) find certain keywords in the content (using an ontology) that should appear in a record once and only once, and find the candidate tag that separates these keywords. The rankings of each candidate tag according to these five heuristics are combined to provide a compound value using the "Stanford certainty theory". The paper describes experimental studies involving on-line newspaper and university course pages, focusing on obituaries, car sale advertisements, computer job advertisements, and university course descriptions. The reported results indicate 100% success using the compound ranking involving all five heuristics, even when the success of some individual heuristic may be as low as 45%. The types of Web pages considered in these experiments show some regularity in that they contain the same information (i.e., multiple records) in the appropriate separators. The authors clearly indicate that this is the domain in which they work.

References

[1]: David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. SIGMOD Conference 1999: 467-478
[2]: David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu-Kai Ng, Dallan Quass, Randy D. Smith: A Conceptual-Modeling Approach to Extracting Data from the Web. ER 1998: 78-91
[3]: David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle: Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents. CIKM 1998: 52-59