| Introduction |
|
With the explosive growth of information sources available on the World
Wide Web, it has become increasingly necessary for users to utilize
automated tools in find the desired information resources, and to track
and analyze their usage patterns. These factors give rise to the
necessity of creating serverside and clientside
intelligent systems that can effectively mine for knowledge. Web mining
can be broadly defined as the discovery and analysis of useful
information from the World Wide Web. This describes the automatic
search of information resources available online, i.e. Web
content mining, and the discovery of user access patterns from Web
servers, i.e., Web usage mining. |
| |
|
What is Web Mining ?
|
Web
Mining is the extraction of interesting and potentially useful patterns
and implicit information from artifacts or activity related to the
WorldWide Web. There are roughly three knowledge discovery
domains that pertain to web mining: Web Content Mining, Web Structure
Mining, and Web Usage Mining. Web content mining is the process of
extracting knowledge from the content of documents or their
descriptions. Web document text mining, resource discovery based on
concepts indexing or agentbased technology may also fall in
this category. Web structure mining is the process of inferring
knowledge from the WorldWide Web organization and links
between references and referents in the Web. Finally, web usage mining,
also known as Web Log Mining, is the process of extracting interesting
patterns in web access logs.
- Web Content Mining
Web content mining is an automatic process that goes beyond keyword
extraction. Since the content of a text document presents no
machinereadable semantic, some approaches have suggested to
restructure the document content in a representation that could be
exploited by machines. The usual approach to exploit known structure in
documents is to use wrappers to map documents to some data model.
Techniques using lexicons for content interpretation are yet to come.
There are two groups of web content mining strategies: Those that
directly mine the content of documents and those that improve on the
content search of other tools like search engines.
- Web Structure Mining
WorldWide Web can reveal more information than just the
information contained in documents. For example, links pointing to a
document indicate the popularity of the document, while links coming
out of a document indicate the richness or perhaps the variety of
topics covered in the document. This can be compared to bibliographical
citations. When a paper is cited often, it ought to be important. The
PageRank and CLEVER methods take advantage of this information conveyed
by the links to find pertinent web pages. By means of counters, higher
levels cumulate the number of artifacts subsumed by the concepts they
hold. Counters of hyperlinks, in and out documents, retrace the
structure of the web artifacts summarized.
- Web Usage Mining
Web servers record and accumulate data about user interactions whenever
requests for resources are received. Analyzing the web access logs of
different web sites can help understand the user behaviour and the web structure, thereby
improving the design of this colossal collection of resources. There
are two main tendencies in Web Usage Mining driven by the applications
of the discoveries: General Access Pattern Tracking and Customized
Usage Tracking.
The general access pattern tracking analyzes the web logs to understand
access patterns and trends. These analyses can shed light on better
structure and grouping of resource providers. Many web analysis tools
existd but they are limited and usually unsatisfactory. We have
designed a web log data mining tool, WebLogMiner, and proposed
techniques for using data mining and OnLine Analytical Processing
(OLAP) on treated and transformed web access files. Applying data
mining techniques on access logs unveils interesting access patterns
that can be used to restructure sites in a more efficient grouping,
pinpoint effective advertising locations, and target specific users for
specific selling ads.
Customized usage tracking analyzes individual trends. Its purpose is to
customize web sites to users. The information displayed, the depth of
the site structure and the format of the resources can all be
dynamically customized for each user over time based on their access
patterns.
While it is encouraging and exciting to see the various potential
applications of web log file analysis, it is important to know that the
success of such applications depends on what and how much valid and
reliable knowledge one can discover from the large raw log data.
Current web servers store limited information about the accesses. Some
scripts customtailored for some sites may store additional
information. However, for an effective web usage mining, an important
cleaning and data transformation step before analysis may be needed.

|
| |
|
People
|
|
|
| |
| Organisations |
|
|
| |
|
Projects
|
|
|
| Software |
|
|
Name
|
Firma
|
Type
|
Comments
|
|
|
Apteco Limited, United Kingdom
|
data mining tool
|
Apteco have developed the FastStats range of marketing tools including data mining tools for better analysis of data.
|
|
|
Simon Fraser University, Canada
|
data mining tool
|
Provides you with a powerful
and affordable tool to mine large data warehouse and relational
databases fast and efficiently using multiple mining functions. This
version of the software uses Microsoft SQL Server 7.0 Plato to build
the data cubes on which it performs mining tasks --- a modification
that dramatically improves the versatility and efficiency of DBMiner.
|
|
|
IBM
|
data mining tool
|
"SpeedTracer is a Web usage
mining and analysis tool which tracks user browsing patterns,
generating reports to help webmasters refine Web site structure and
navigation. The application uses innovative inference algorithms to
reconstruct user traversal paths and identify user sessions. Advanced
mining algorithms uncover users' movement through a Web site. The end
result is a collection of valuable browsing patterns which help
webmasters better understand user behavior. SpeedTracer generates three
types of statistics: user-based, path-based and group-based. User-based
statistics pinpoint reference counts by user and durations of access.
Path-based statistics identify frequent traversal paths in Web
presentations. Group-based statistics provide information on groups of
Web site pages most frequently visited. "
|
|
|
Web Trends
|
data mining tool
|
CommerceTrends provides the
most powerful eBusiness Intelligence reporting available, enabling
customers to track, manage and optimize eBusiness strategies.
CommerceTrends advanced functionality includes powerful,
enterprise-scalable web traffic analysis, campaign management,
eCommerce revenue forecasting, eMarketing ROI and web data warehouse
capabilities, enabling customers to apply data warehouse principles to
correlate web traffic data with other corporate information from CRM,
ERP, and Personalization systems. |
|
|
SPSS
|
data mining tool
|
The application uses innovative
inference algorithms to reconstruct user traversal paths and identify
user sessions. Advanced mining algorithms uncover users' movement
through a Web site. The end result is a collection of valuable browsing
patterns which help webmasters better understand user behavior.
|
|
|
Humboldt University Berlin
|
data mining tool
|
WUM is a sequence miner. Its
primary purpose is to analyze the navigational behaviour of users in a
web site, but it is appropriate for sequential pattern discovery in any
type of log. It discovers patterns comprised of not necessarily
adjacent events and satisfying user-specific criteria. WUM is an
integrated environment for log preparation, querying and visualization.
Its mining query language MINT supports the specification of criteria
describing dominant or statistically rare patterns. Its visualization
mechanism displays the nodes comprising the desired pattern and the
different non-frequent paths located in-between. This is quite
important when examining how the web site is really being navigated
|
|
|
Flowerfire
|
log file analyzer
|
Sawmill is a
powerful, hierarchical log analysis tool for Windows 95/98/NT/2000,
MacOS, UNIX, OS/2 and BeOS. It is particularly well suited to web
server access and referrer logs, but can process almost any log. The
reports that Sawmill generates are hierarchical, attractive, and
heavily crosslinked for easy navigation. Complete documentation is
built directly into the program
|
|
|
Active Concepts
|
log file analyzer
|
Funnel Web 4.0 is the latest
release of our classic intelligent analysis and internet reporting
software. Designed with a whole new interface, version 4.0 is even
easier to use and configure than previous versions of Funnel Web.Plus,
this breakthrough product will feature a series of impressive new
capabilities (like entirely web-based remote administration) plus much
more!.With an attractive, intuitive new interface and more power than
ever, Funnel Web 4.0 is all you need to stay on top of your online
empire. |
|
|
Angoss
|
data mining tool
|
KnowledgeSTUDIO
is a new generation of data mining software. It integrates advanced
data mining techniques into corporate environments so that enterprises
can achieve maximum benefits from their investment in data.
KnowledgeSTUDIO is a datamining tool which includes the power of
decision trees, cluster analysis, and several predictive models to
allow users to mine and understand their data from many different
perspectives. It includes powerful data visualization tools to support
and explain the discoveries.
|
|
|
Net Genesis
|
data mining tool
|
NetAnalysis, the award-winning
online behavioral analysis solution from NetGenesis, provides the
superior scalability and powerful extensibility required by e-business
enterprises to excel in the dynamic, increasingly competitive online
environment. With its heightened flexibility and functionality,
NetAnalysis can be customized to meet any company's specific e-customer
intelligence needs, while easily leveraging its supporting architecture.
|
|
Name
|
Firma
|
Type
|
Comments
|
|
|
ST Software
|
Report and Statistics
|
Is a set of CGI scripts
(written in C), that produce HTML reports, based on the access logs
that the HTTP server keeps, and it is suitable for almost all http
server software (Unix & Windows), supporting now three log
formats (Common, Extended and IIS). |
| weblog_parse |
ACME Labs Software. |
Logfiles Processing |
Extract specified fields from a web log file.
Reads a web server log file, in either "Common Logfile Format" or
"Combined Logfile Format". Parses it, and writes out only the
user-specified fields, separated by tabs for easier handling
|
| WebLog |
Darryl C. Burgdorf |
Logfiles Analysis Tool |
Is a comprehensive access log
analysis tool. It allows you to keep track of activity on your site by
month, week, day and hour, to monitor total hits, bytes transferred and
page views, and to keep track of your most popular pages.
|
| Analog |
University of Cambridge Statistical Laboratory |
Logfiles Analyzer |
Analog is a program to analyse
the logfiles from your web server. It tells you which pages are most
popular, which countries people are visiting from, which sites they
tried to follow broken links from, etc.
|
|
| |
|