Home Page
 PhD Work

[Home][Motivation] [Information Retrieval] [Web Mining]
[Data Mining] [Intelligent Agents] [Knowledge Management & Ontologies]

Introduction
With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server­side and client­side intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available on­line, i.e. Web content mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining.
 

What is Web Mining ?

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World­Wide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent­based technology may also fall in this category. Web structure mining is the process of inferring knowledge from the World­Wide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs.
  • Web Content Mining
    Web content mining is an automatic process that goes beyond keyword extraction. Since the content of a text document presents no machine­readable semantic, some approaches have suggested to restructure the document content in a representation that could be exploited by machines. The usual approach to exploit known structure in documents is to use wrappers to map documents to some data model. Techniques using lexicons for content interpretation are yet to come.
    There are two groups of web content mining strategies: Those that directly mine the content of documents and those that improve on the content search of other tools like search engines.

  • Web Structure Mining
    World­Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliographical citations. When a paper is cited often, it ought to be important. The PageRank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web pages. By means of counters, higher levels cumulate the number of artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out documents, retrace the structure of the web artifacts summarized.

  • Web Usage Mining
    Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behaviour and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in Web Usage Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking.
    The general access pattern tracking analyzes the web logs to understand access patterns and trends. These analyses can shed light on better structure and grouping of resource providers. Many web analysis tools existd but they are limited and usually unsatisfactory. We have designed a web log data mining tool, WebLogMiner, and proposed techniques for using data mining and OnLine Analytical Processing (OLAP) on treated and transformed web access files. Applying data mining techniques on access logs unveils interesting access patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective advertising locations, and target specific users for specific selling ads.
    Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users. The information displayed, the depth of the site structure and the format of the resources can all be dynamically customized for each user over time based on their access patterns.
    While it is encouraging and exciting to see the various potential applications of web log file analysis, it is important to know that the success of such applications depends on what and how much valid and reliable knowledge one can discover from the large raw log data. Current web servers store limited information about the accesses. Some scripts custom­tailored for some sites may store additional information. However, for an effective web usage mining, an important cleaning and data transformation step before analysis may be needed.

 

People

 

 
Organisations
 

Projects

 

Software

 

  • Commercial Software
Name
Firma
Type
Comments
Apteco Limited, United Kingdom
data mining tool
Apteco have developed the FastStats range of marketing tools including data mining tools for better analysis of data.
Simon Fraser University, Canada
data mining tool
Provides you with a powerful and affordable tool to mine large data warehouse and relational databases fast and efficiently using multiple mining functions. This version of the software uses Microsoft SQL Server 7.0 Plato to build the data cubes on which it performs mining tasks --- a modification that dramatically improves the versatility and efficiency of DBMiner.
IBM
data mining tool
"SpeedTracer is a Web usage mining and analysis tool which tracks user browsing patterns, generating reports to help webmasters refine Web site structure and navigation. The application uses innovative inference algorithms to reconstruct user traversal paths and identify user sessions. Advanced mining algorithms uncover users' movement through a Web site. The end result is a collection of valuable browsing patterns which help webmasters better understand user behavior. SpeedTracer generates three types of statistics: user-based, path-based and group-based. User-based statistics pinpoint reference counts by user and durations of access. Path-based statistics identify frequent traversal paths in Web presentations. Group-based statistics provide information on groups of Web site pages most frequently visited. "
Web Trends
data mining tool
CommerceTrends provides the most powerful eBusiness Intelligence reporting available, enabling customers to track, manage and optimize eBusiness strategies. CommerceTrends advanced functionality includes powerful, enterprise-scalable web traffic analysis, campaign management, eCommerce revenue forecasting, eMarketing ROI and web data warehouse capabilities, enabling customers to apply data warehouse principles to correlate web traffic data with other corporate information from CRM, ERP, and Personalization systems.
SPSS
data mining tool
The application uses innovative inference algorithms to reconstruct user traversal paths and identify user sessions. Advanced mining algorithms uncover users' movement through a Web site. The end result is a collection of valuable browsing patterns which help webmasters better understand user behavior.
Humboldt University Berlin
data mining tool
WUM is a sequence miner. Its primary purpose is to analyze the navigational behaviour of users in a web site, but it is appropriate for sequential pattern discovery in any type of log. It discovers patterns comprised of not necessarily adjacent events and satisfying user-specific criteria. WUM is an integrated environment for log preparation, querying and visualization. Its mining query language MINT supports the specification of criteria describing dominant or statistically rare patterns. Its visualization mechanism displays the nodes comprising the desired pattern and the different non-frequent paths located in-between. This is quite important when examining how the web site is really being navigated
Flowerfire
log file analyzer
Sawmill is a powerful, hierarchical log analysis tool for Windows 95/98/NT/2000, MacOS, UNIX, OS/2 and BeOS. It is particularly well suited to web server access and referrer logs, but can process almost any log. The reports that Sawmill generates are hierarchical, attractive, and heavily crosslinked for easy navigation. Complete documentation is built directly into the program
Active Concepts
log file analyzer
Funnel Web 4.0 is the latest release of our classic intelligent analysis and internet reporting software. Designed with a whole new interface, version 4.0 is even easier to use and configure than previous versions of Funnel Web.Plus, this breakthrough product will feature a series of impressive new capabilities (like entirely web-based remote administration) plus much more!.With an attractive, intuitive new interface and more power than ever, Funnel Web 4.0 is all you need to stay on top of your online empire.
Angoss
data mining tool
KnowledgeSTUDIO is a new generation of data mining software. It integrates advanced data mining techniques into corporate environments so that enterprises can achieve maximum benefits from their investment in data. KnowledgeSTUDIO is a datamining tool which includes the power of decision trees, cluster analysis, and several predictive models to allow users to mine and understand their data from many different perspectives. It includes powerful data visualization tools to support and explain the discoveries.
Net Genesis
data mining tool
NetAnalysis, the award-winning online behavioral analysis solution from NetGenesis, provides the superior scalability and powerful extensibility required by e-business enterprises to excel in the dynamic, increasingly competitive online environment. With its heightened flexibility and functionality, NetAnalysis can be customized to meet any company's specific e-customer intelligence needs, while easily leveraging its supporting architecture.

 

  • Public Software
Name
Firma
Type
Comments
ST Software
Report and Statistics
Is a set of CGI scripts (written in C), that produce HTML reports, based on the access logs that the HTTP server keeps, and it is suitable for almost all http server software (Unix & Windows), supporting now three log formats (Common, Extended and IIS).
weblog_parse ACME Labs Software. Logfiles Processing Extract specified fields from a web log file.
Reads a web server log file, in either "Common Logfile Format" or "Combined Logfile Format". Parses it, and writes out only the user-specified fields, separated by tabs for easier handling
WebLog Darryl C. Burgdorf Logfiles Analysis Tool Is a comprehensive access log analysis tool. It allows you to keep track of activity on your site by month, week, day and hour, to monitor total hits, bytes transferred and page views, and to keep track of your most popular pages.
Analog University of Cambridge Statistical Laboratory Logfiles Analyzer Analog is a program to analyse the logfiles from your web server. It tells you which pages are most popular, which countries people are visiting from, which sites they tried to follow broken links from, etc.