Home Page
 PhD Work

[Home][Motivation] [Information Retrieval] [Web Mining]
[Data Mining] [Intelligent Agents] [Knowledge Management & Ontologies]

 

1. Introduction

Our objective is to improve the user process of search information on the Internet.
After the analysis of this survey, we came to the conclusion that principally exist 3 key-factors that would determine improvements on this search process:

1. An efficient method to obtain the user information needs when he/she searches on the Internet.
2. Once the search concept was clarified, the user must be assisted by the query formulation and refinement, including query terms extension and word-sense disambiguation [39-R1].
3. A reliable method to determine relevant keywords of documents.

The reviewed approaches propose methods to address the problem of search for information on the internet. Most of these approaches can be classified into the points (1) and (2) mentioned previously.
Referencing to (3), we suggest a novelty method to support the standard IR methods.

 

2. Related Work - Improving the Internet Search Process

Providing context for user queries:
Many techniques are used in actual search engines to provide context for the user queries: Google [125], Yahoo [126] and ODP [127] return both categories and documents. Northern Light [128] and WiseNut[129] cluster their results into categories and Vivisimo [130] groups results dynamically into clusters. Teoma [131] clusters its results and provides query refinements.
Research in metasearch [132][133][134][135][136] also develops procedures to mapping user queries to a set of categories or collections.

An appropriate User Interface:
Some approaches assert that one method to improve the search process is defining a suitable user interface. [11-R1] proposes a four-phase search framework, where the fundamental components of the search process are integrated in a defined user interface: search formulation, action (starting the search), review of results and refinement.
That implies that the user must spend time clicking options and defining fields, task that is often undesirable. On the other hand, normal Internet users don’t really know where the most relevant documents for a defined search are, or which document types are more relevant. The actual document diversity on the Web makes this pre-refinement or sources restriction inappropriate. Some studies reveal that most users don’t make use of the extended search options on regular search engines [MIO].

Selecting the Sources:
Prism [26-R1], examines the hypothesis that source selection is a more tractable problem than document filtering and propose that distributed searching can be more effective if it is guided by contextual information, such as time information gathered by just-in-time retrieval sources.
Just-in-time information retrieval systems presented in [101], [102] and [103], attempt to anticipate user information needs based on a task context inferred from user behaviour.
In other words, it is possible to monitor the user's task context, automatically select a small set of on-point sources, and dispatch queries to those sources to provide more useful search results. “For example, if the interface can determine that the user is working on a paper in economics, it could use this information to generate a context description, and select a context-relevant specialized search engine, such as CNN Financial”.
Because each source is different, a Wrapper must be used to format query and interpret the results. For it, some human assistance is needed to create and verify the wrappers, because the methods for automatic wrappers generation are not yet sufficient to fully automate this process. Other problem is that the effectiveness of the specific search engines varied depending of the subject, as presented in one experiment of this approach. General search engines like Google tend to present a greater collection und variety of documents as specialized search engines, obtaining for some subjects more relevant results.

Giving semantic to HTML:
SHOE (Simple HTML Ontology Extensions) [27-R1] proposes a set of HTML Ontology Extensions which allow to the WWW authors to annotate their pages with semantic knowledge, making it simple for user-agents and robots to retrieve and stores this knowledge. A superset of HTML is defined, to enable users to classify their web page relationships and attributes in machine-readable form.
To demonstrate the use of SHOE a crawling-agent named Exposé was developed, which parses SHOE enabled HTML documents and adds SHOE knowledge to its internal knowledge-base.
Exposé runs on Macintosh Common Lisp or C, using PARKA (Evett, Anderson, and Hendler 1993), University of Maryland’s massively-parallel semantic network system, for its knowledge representation.
Once Exposé has gathered knowledge from the Web, one can then use this knowledge to answer sophisticated queries about entities and their relationships. For example: Find web pages for all x, y, and z such that x is a person, y is a person, z is an organization where lastName(x,"Cook"), lastName(y,"Cook"), employee(z,x), employee(z,y), marriedTo(x,y), and involvedIn(z,"DoD123-4567").
As we can see, some special conditions are required to make use of this concept: each user must manually embed content (semantic annotation) to his HTML pages that normally is an undesirable task; the Crawling-Agents must examine significant part of the web to find the SHOE documents, requiring a very large amount of processing capacity.
In other words, the system must widely be used to achieve its goals.

Using a personalized view of the web:
[34-R1], propose a framework to personalize search results for each user using a personal ontology and a characterization for a particular site created by the OBIWAN [105] System. OBIWAN classify web pages using the Lycos [106] ontology as reference. The system will try to determine the mapping of the reference ontology to the personal ontology. Using this mapping, the user can then browse any site that has been characterized by OBIWAN with his personal ontology without reclassifying the documents. In other words each site was characterized with the same structure representing the user’s view of the world.
Some negative aspects identified in this approach are: the manual characterization of the personal data (user ontology), OBIWAN must characterize the region of the web that will be taken in account for the analysis, that means as the previous approach a time-consuming task for the user and a very large processing capacity for the system.

Eliminating query ambiguity:
OntoSeek [111] designed to improve content-based information retrieval from online yellow pages and product catalogues. Using the Sensus linguistic ontology (approximately 50,000 nodes) assist the user to construct precise and unambiguous descriptions of resource texts and to formulate unambiguous queries.

Image information retrieval using Ontologies:
Ontogator [32-R1] is a metadata-based system to image information retrieval.
Its main novelty lays in the idea of enhancing keyword search accuracy and usability by combining ontology-based knowledge representation with the view-based search method, adding the possibility to relate the hits items with each other and other images in the database, for example, as a recommendation: if the query contains the keyword “Sibelius”, and the result set contains an image depicting Jean Sibelius, the Finnish composer of symphonies inspired by the Carelian scenery (a part of Finland), then a relation to images of Carelia (not in the actual result set) could be of interest to the user. Furthermore, the notion of view-based searching is complemented with the idea of semantic browsing used in Topic Maps and recommender systems.

Unified semantic representation for query and documents:
In ExtrAns [119] works by transforming documents and queries into a semantic representation called Minimal Logical Form (MLF) and derives the answers by logical proof from the documents. A full linguistic (syntactic and semantic) analysis, complete with lexical alternations (synonyms and hyponyms ) is performed to expand queries. While documents are processed in an off-line stage, the query is processed on-line.

Limit the broadcast of the query using P2P Networks:
The [33-R1] approach proposes a P2P based search on the HyperCup network model, where the content of the Peers is associated with particular topics arranged as concepts in a global ontology. The idea is to restrict the broadcast of a query message to peers that can potentially provide information related to the concept asked in the query.

Using Ontologies & Natural Language:
The OntoQuery Project [31-R1] addresses retrieval of pertinent text segments based on the conceptual content of the text. The queries take the form of natural language expressions and the system is primarily intended to retrieve text segments whose semantic content matches the content of noun phrases in the query phrase. For this task requires that the system recognise not only lexical synonyms and morphological variants, but also paraphrases –including those expressing conceptual generalisations and specialisations. That involves a partial syntactic and semantic analysis of the natural language queries and of the queried texts.
The system is principally used to ranking matches.
This approach is focused in the generation of a suitable ontology to efficiently represent (using descriptors) the content of documents and queries. The study is concentrated on the adequate ontology generation.

Applying user profiles:
[24-R1] analyzes the user navigation patterns to generate a categories profile, which is associated with general categories based on the ODP (Open Directory Project). A combination of both profiles is likely to be related to the user’s interest, and provide a proper context for the user query. For example, it could be used as context tool to disambiguate words in the user query.
Some drawbacks present in this approach are: the search machine may be able to acquire a special tree model of search records defined in this approach, an important parameter to evaluate the relevance of the document is the time that the user spends on a document before he clicks a new link.

Extending the query with Ontologies:
[29-R1] use an Ontology structure where weights are assigned to links to measure the strength of the relation. Spread Activation techniques are used to find related concepts in the ontology given an initial set of concepts and corresponding initial activation value, obtaining a new group of concepts related to the original query

Understanding the user information needs:
SCORE [37-R1]. It uses automatic classification and information-extraction techniques together with metadata and ontology information to enable contextual multi-domain searches that try to understand the exact user information need expressed in a keyword query.

Guessing what the user search for:
Froogle [124] also presents an approach for product searches. It is a search engine specialized in querying for products, where the user expresses the products he wants to search for using keywords that are associated with the product (i.e. its brand, name, model, etc.). Froogle tries to guess the product the user wants to search for by associating the keywords in the query with the metadata that describe the products in their knowledge base (ontology).

 

References

[39-R1] Moldovan, D. I. and Mihalcea, R. Improving the search on the Internet by using WordNet and lexical operators. IEEE Internet Computing, 4 (1) (2000) 34Œ43.

[32-R1] E. Hyvonen, S. Saarela, and K. Viljanen. Ontogator: combining view- and ontology-based search with semantic browsing. In Proceedings of the XML Finland 2003 conference. Kuopio, Finland, 2003.Download Paper

[11-R1] Shneiderman, B., Byrd, D., and Croft, B.: Sorting out searching: A user-interface framework for text searches, Communications of the ACM 41, 4 (April 1998), 95-98.Download Paper

[26-R1] David B. Leake, Ryan Scherle: Towards context-based search engine selection. Intelligent User Interfaces 2001: 109-112.Download Paper

[33-R1] Mario Schlosser, Michael Sintek, Stefan Decker, Wolfgang Nejdl: Ontology-Based Search and Broadcast in HyperCuP (Abstract), International Semantic Web Conference, Sardinia, 2002.Download Paper

[31-R1] Content-based text querying with ontological descriptors" Andreasen T., Jensen P., Nilsson J., Paggio P., Pedersen B., Thomsen H. Data & Knowledge Engineering 48(2): 199-219, 2004.Download Paper

[24-R1] Liu, F., Yu, C., and Meng, W. (2002). Personalized Web search by mapping user queries to categories. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM `02).USA, 558-565.Download Paper

[29-R1] Rocha, Cristiano and Schwabe, Daniel and Poggi de Aragao, Marcus (2004) A Hybrid Approach for Searching in the Semantic Web. In Proceedings International WWW Conference, New York, USA.Download Paper

[37-R1] "Managing Semantic Content for the Web" Sheth A., Bertram C., Avant D., Hammond B., Kochut K., Warke Y. IEEE Internet Computing 6(4): 80-87, 2002.Download Paper

[27-R1] S. Luke, L. Spector, D. Rager, and J. Hendler. Ontology-based Web agents. In Proceedings of First International Conference on Autonomous Agents (AA’97), 1997.Download Paper

[34-R1] Zhu, X., Gauch, S., Gerhard, L., Kral, N. and Pretschner, A. Ontology-Based Web Site Mapping for Information Exploration, Proc. Of the Eighth International Conference on Information and Knowledge Management (CIKM ’99), Kansas City, MO, November 1999, 188-194.Download Paper

[100] Koenemann, J., and Belkin, N. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In Proceedings of CHI ’96, Human Factors in Computing Systems (Vancouver, B.C., Apr. 13–18), ACM Press, New York, 1996, pp. 205–212.Download Paper

[101] Budzik, J. and Hammond, K. User interactions with everyday applications as context for just-in-time information access. In Proceedings of the 2000 International Conference on Intelligent User Interfaces (IUI2000) 44{51, 2000.Download Paper

[102] Horvitz, E. The Lumiere project: Bayesian user modeling for inferring the goals and needs of software users. In Proceedings of the Fourteenth Conference on Uncertainty in Articial Intelligence 256{265. Morgan Kaufmann, July 1998.Download Paper

[103] Rhodes, B. J. Margin Notes: Building a contextually aware associative memory. In Proceedings of the 2000 International Conference on Intelligent User Interfaces (IUI2000) 219{224, 2000.Download Paper

[104] Cassola, E. ProFusion Personal Assistant: An Agent for Personalize Information Filetering on the WWW. Master’s thesis, The University of Kansas, Lawrence, KS, 1998.

[105] Zhu, X., Gauch, S., Gerhard, L., Kral, N. and Pretschner, A. Ontology-Based Web Site Mapping for Information Exploration, Proc. Of the Eighth International Conference on Information and Knowledge Management (CIKM ’99), Kansas City, MO, November 1999, 188-194.Download Paper

[106] Lycos. “Lycos: Your Personal Internet Guide”, http://www.lycos.com , 2002.

[107] Hsu, Wen-Lin and Lang, Sheau-Dong. Classification Algorithms for NETNEWS Articles. In Proc. 8 th Intl. Conf. on Information and Knowledge Management, pp. 114-121, 1999.Download Paper

[108] Göver, N., Lalmas, M., and Fuhr, N. A Probabilistic Description-Oriented Approach for Categorising Web Documents. In Proc. 8 th Intl. Conf. on Information and Knowledge Management, pp. 475-482, 1999.Download Paper

[109] Matsuda, K. and Fukushima, T. Task-Oriented World Wide Web Retrieval by Document Type Classification. In Proc. 8 th Intl. Conf. on Information and Knowledge Management, pp. 109-113, 1999.Download Paper

[110] Larkey, L. Automatic Essay Grading Using Text Categorization Techniques. In Proc. 21 st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, Melbourne, Australia, 1998.Download Paper

[111] Guarino N., Masolo C., and Vetere G., OntoSeek: Content-Based Access to the Web, IEEE Intelligent Systems 14(3), May/June 1999, pp. 70-80.Download Paper

[112] Knight, K. and Luk, S. Building a Large Knowledge Base for Machine Translation. In Proc. Amer. Assoc. Artificial Intelligenc Conf. (AAAI), pp. 773-778, 1999.

[113] Labrou, Y. and Finin T. Yahoo! As an Ontology – Using Yahoo! Categories to Describe Documents. In Proc. 8° Intl. Conf. on Information and Knowledge Management, pp. 180-187, 1999.Download Paper

[114] Yahoo!, http://www.yahoo.com, 2000.

[115] Chowder, G. and Nicholas, C. Resource Selection in Café: an Architecture for Networked Information retrieval. In Proc. SIGIR’96 Workshop on Networked Information Retrieval , Zurich, 1996.Download Paper

[116] Chower, G. and Nicholas, C. Meta-Data for Distributed Text Retrieval. In Proc. First IEEE Metadata Conference, 1996.Download Paper

[117] Pearce, C. and Miller, E. The Telltale Dynamic Hypertext Environment: Approaches to Scalability. In Advances in Intelligent Hypertext, Springer-Verlag, 1997.Download Paper

[119] F.Rinaldi, J.Dowdall, M.Hess, K.Karljurand, M.Koit, K.Vider, N.Kahusk,Terminology as knowledge in answer extraction, in: A.Melby (Ed.), Proceedings of TKE‘02 ––Terminology and Knowledge Engineering, INRIA, France,2002.Download Paper

[120] D. Fensel (ed.). The semantic web and its languages. IEEE Intelligence Systems, Nov/Dec 2000.Download Paper

[121] E. Hyv¨onen, S. Kettula, V. Raatikka, S. Saarela, and Kim Viljanen. Semantic interoper-ability on the web. Case Finnish Museums Online. Number 2002-03 in HIIT Publications, pages 41–53. Helsinki Institute for Information Technology (HIIT), Helsinki, Finland, 2002. http://www.hiit.fi.Download Paper

[122] A. S. Pollitt. The key role of classification and indexing in view-based searching. Technical report, University of Huddersfield, UK, 1998. http://www.ifla.org/IV/ifla63/63polst.pdf.Download Paper

[123] M. Hearst, A. Elliott, J. English, R. Sinha, K. Swearingen, and K.-P. Lee. Finding the flow in web site search. CACM, 45(9):42–49, 2002.Download Paper

[124] http://froogle.google.com

[125] http://www.google.com

[126] http://www.yahoo.com

[127] http://dmoz.org

[128] http://www.northernlight.com

[129] http://www.wisenut.com

[130] http://www.vivisimo.com

[131] http://www.teoma.com

[132] S. Gauch, G. Wang, M. Gomez. ProFusion: Intelligent Fusion from Multiple, Distributed Search Engines. Journal of Universal Computer Science, 2(9), 1996.Download Paper

[133] A. E. Howe and D. Dreilinger. SavvySearch: A meta-search engine that learns which search engines to query. AI Magazine, 18(2), 1997.Download Paper

[134] A. L. Powell, J. C. French, J. P. Callan and M. Connell. The impact of database selection on distributed searching. SIGIR, 2000.Download Paper

[135] R. Dolin, D. Agrawal, A. El Abbadi and J. Pearlman. Using Automated Classification for Summarizating and Selecting Heterogeneous Information Sources. D-Lib Magazine, 1998.

[136] C. Yu, W. Meng, W. Wu and K. Liu. Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. ACM SIGMOD, 2001.Download Paper

[137] M. Pazzani, J. Muramatsu, and D. Billsus, "Syskill & Webert: Identifying interesting Web sites," in Proceedings of the 13th National Conference on Artificial Intelligence (AAA196), 1996, pp. 54--61.