The issues involved with Web Mining include:
- Lots of distributed data
- Volatile data
- Unstructured and redundant data
- Problems with quality of data
- Hetrogeneous data
But the advantages, in comparison are:
- Structural framefork provided my HTML
- Link structure of the web
The web mining taxonomy is:
- Web Content Mining
- Web Page Content Mining
- Search Result Mining
- Web Structure Mining
- Web Usage Mining
- General Access Pattern Tracking
- Customized Usage Tracking
Keeping our focus on 'Web Structure Mining', which mines the structure(links,graph) of the web and uses the techniques, PageRank and CLEVER.
PageRank is Google's "original" algorithm and the reason for its success as the most powerful search engine today and for years to come. ( Try YaGoohoo!gle)
Its the technique to prioritize pages returned from search. The importance of a page is calculated based on number of pages which point to it i.e. Backlinks. Weighting is used to provide more importance to backlinks coming from important pages.
The formula used for calculating a PageRank can be stated as:
There are concerns that Google's PageRanking may not be comprehensively updated these days as Bloggers "mess things up" :)
Google's PageRankExplained is a must read.