Enhanced Hypertext Categorization Using Hyperlinks.
Soumen Chakrabarti, Byron Dom, Piotr Indyk:
Enhanced Hypertext Categorization Using Hyperlinks.
SIGMOD Conference 1998: 307-318@inproceedings{DBLP:conf/sigmod/ChakrabartiDI98,
author = {Soumen Chakrabarti and
Byron Dom and
Piotr Indyk},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {Enhanced Hypertext Categorization Using Hyperlinks},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-995-5},
pages = {307-318},
ee = {http://doi.acm.org/10.1145/276304.276332, db/conf/sigmod/ChakrabartiDI98.html},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX
Abstract
A major challenge in indexing unstructured hypertext databases is to
automatically extract meta-data that enables structured search using
topic taxonomies, circumvents keyword ambiguity, and improves the
quality of search and profile-based routing and filtering.
Therefore, an accurate classifier is an essential component of a
hypertext database. Hyperlinks pose new problems
not addressed in the extensive text classification literature.
Links clearly contain high-quality semantic clues that are lost
upon a purely term-based classifier, but exploiting link
information is non-trivial because it is noisy.
Naive use of terms in the link neighborhood of a document
can even degrade accuracy.
Our contribution is to propose robust
statistical models and a relaxation labeling
technique for better classification
by exploiting link information in a small neighborhood around
documents. Our technique also
adapts gracefully to the fraction of neighboring documents having
known topics.
We experimented with pre-classified samples from
Yahoo! and the
US Patent Database.
In previous work, we developed a text
classifier that misclassified only 13% of the documents in the well-known
Reuters
benchmark; this was comparable to the best results ever obtained.
This classifier misclassified 36% of the patents,
indicating that classifying hypertext can be more difficult
than classifying text. Naively using terms in neighboring documents
increased error to 38%; our hypertext classifier reduced it
to 21%. Results with the Yahoo! sample were
more dramatic: the text classifier showed 68% error,
whereas our hypertext classifier reduced this to only 21%.
Copyright © 1998 by the ACM,
Inc., used by permission. Permission to make
digital or hard copies is granted provided that
copies are not made or distributed for profit or
direct commercial advantage, and that copies show
this notice on the first page or initial screen of
a display along with the full citation.
CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...
Online Version (ACM WWW Account required): Full Text in PDF Format
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
BibTeX
Printed Edition
Laura M. Haas, Ashutosh Tiwary (Eds.):
SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA.
ACM Press 1998, ISBN 0-89791-995-5 BibTeX
,
SIGMOD Record 27(2),
June 1998
Contents
[Abstract]
[Full Text (Postscript)]
References
- [1]
- ...
- [2]
- Chidanand Apté, Fred Damerau, Sholom M. Weiss:
Automated Learning of Decision Rules for Text Categorization.
ACM Trans. Inf. Syst. 12(3): 233-251(1994) BibTeX
- [3]
- ...
- [4]
- Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan:
Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases.
VLDB 1997: 446-455 BibTeX
- [5]
- ...
- [6]
- ...
- [7]
- W. Bruce Croft, Howard R. Turtle:
A Retrieval Model for Incorporating Hypertext Links.
Hypertext 1989: 213-224 BibTeX
- [8]
- ...
- [9]
- ...
- [10]
- ...
- [11]
- David Eppstein:
Finding the k Shortest Paths.
FOCS 1994: 154-165 BibTeX
- [12]
- Daniela Florescu, Daphne Koller, Alon Y. Levy:
Using Probabilistic Information in Data Integration.
VLDB 1997: 216-225 BibTeX
- [13]
- Hans-Peter Frei, D. Stieger:
Making Use of Hypertext Links when Retrieving Information.
ECHT 1992: 102-111 BibTeX
- [14]
- Hans-Peter Frei, D. Stieger:
The Use of Semantic Links in Hypertext Information Retrieval.
Inf. Process. Manage. 31(1): 1-13(1995) BibTeX
- [15]
- ...
- [16]
- Marti A. Hearst, Chandu Karadi:
Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy.
SIGIR 1997: 246-255 BibTeX
- [17]
- ...
- [18]
- ...
- [19]
- ...
- [20]
- ...
- [21]
- ...
- [22]
- ...
- [23]
- ...
- [24]
- ...
- [25]
- Manish Mehta, Rakesh Agrawal, Jorma Rissanen:
SLIQ: A Fast Scalable Classifier for Data Mining.
EDBT 1996: 18-32 BibTeX
- [26]
- ...
- [27]
- ...
- [28]
- ...
- [29]
- ...
- [30]
- ...
- [31]
- Gerard Salton:
Associative Document Retrieval Techniques Using Bibliographic Information.
J. ACM 10(4): 440-457(1963) BibTeX
- [32]
- ...
- [33]
- ...
- [34]
- ...
- [35]
- ...
- [36]
- ...
- [37]
- John C. Shafer, Rakesh Agrawal, Manish Mehta:
SPRINT: A Scalable Parallel Classifier for Data Mining.
VLDB 1996: 544-555 BibTeX
- [38]
- William W. Cohen, Yoram Singer:
Context-sensitive Learning Methods for Text Categorization.
SIGIR 1996: 307-315 BibTeX
- [39]
- John R. Smith, Shih-Fu Chang:
Visually Searching the Web for Content.
IEEE MultiMedia 4(3): 12-20(1997) BibTeX
- [40]
- ...
Referenced by
- Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D. Sivakumar, Andrew Tomkins, Eli Upfal:
The Web as a Graph.
PODS 2000: 1-10
- Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim:
Data Mining and the Web: Past, Present and Future.
Workshop on Web Information and Data Management 1999: 43-47
- Ke Wang, Senqiang Zhou, Shiang Chen Liew:
Building Hierarchical Classifiers Using Class Proximity.
VLDB 1999: 363-374
- Soumen Chakrabarti, Martin van den Berg, Byron Dom:
Distributed Hypertext Resource Discovery Through Examples.
VLDB 1999: 375-386
- Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan:
Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies.
VLDB J. 7(3): 163-178(1998)
BibTeX
ACM SIGMOD Anthology - DBLP:
[Home | Search: Author, Title | Conferences | Journals]
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Wed Jun 4 18:55:30 2008