<< Heterogeneous Web Documents | Home | Reseller ratings and news >>

Hierarchical Classification of the Web

With the exponential growth of information on the internet and intranets, it is becoming increasingly difficult to find and organize relevant materials. More and more, simple text retrieval systems are being supplemented with structured organizations. Since the 19th century, librarians have used classification systems like Dewey and Library of Congress subject headings to organize vast amounts of information. More recently, web directories such as Yahoo! and LookSmart have been used to classify web pages. Structured directories support browsing and search, but the manual nature of the directory compiling process makes it difficult to keep pace with the ever increasing amount of information. Our work looks at the use of automatic classification methods to supplement human effort in creating structured knowledge hierarchies.

Although many real world classification systems have complex hierarchical structure (e.g., MeSH, U.S. Patents, Yahoo!, LookSmart), few learning methods capitalize on this structure. Most of the approaches mentioned above ignore hierarchical structure and treat each category or class separately, thus in effect “flattening” the class structure. A separate binary classifier is learned to distinguish each class from all other classes. The binary classifiers can be considered independently, so an item may fall into none, one, or more than one category. Or they can be considered as an m-ary problem, where the best matching category is chosen. Such simple approaches work rather well on small problems, but they are likely to be difficult to train when there are a large number of classes and a very large number of features. By utilizing known hierarchical structure, the classification problem can be decomposed into a set of smaller problems corresponding to hierarchical splits in the tree. Roughly speaking, one first learns to distinguish among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree. Each of these sub-problems can be solved much more efficiently, and hopefully more accurately as well.




Add a comment Send a TrackBack