It all sounds Geek to me! [Gautam Arora]

Thursday, April 28, 2005

The Mystery of Google's PageRank Algorithm

Chris Ridings,(owner of http://www.searchenginesystems.net/), a custom search engine developer wrote a paper about the enigmatic Google PR. The following contains parts of his original paper ( PageRank Explained or Everything you’ve always wanted to know about PageRank” v1.1, 9 Nov,2001) :

PageRank is Google's method of measuring a page's "importance." When all other factors such as Title tag and keywords are taken into account, Google uses PageRank to adjust results so that sites that are more "important" will move up in the results page of a user's search accordingly.

That is, the order of ranking in Google works like this:

1. Find all pages matching the keywords of the search.
2. Rank accordingly using "on the page factors" such as keywords.
3. Calculate in the inbound anchor text.
4. Adjust the results by PageRank scores.


A synergy of many factors is responsible for being listed on the top search engines like Yahoo,Google and PR is one of them.

A few points to note before we move on:

1. PageRank is a number that assesses solely the voting ability of all incoming links to a page, and how much they recommend that page.
2. Every unique page of a site that is indexed in Google has a PageRank. People often, mistakenly, think of the PageRank of a site being the PageRank of that site’s home page.
3. Internal site links do count in passing PageRank to other pages of the site.
4. PageRank stands on its own; It's not tied in with the anchor text (titling) of links, etc. Sure, they’re related, but saying they’re the same thing is like saying Title tags are the same as keywords in text.


The closest we can get to knowing a pages PR is by using the Google Toolbar (but its not completely accurate, dont expect Google to give out the real PR ! )
Also that the PR calculation is not a linear equation. Say, a jump from PR3 to PR4 is not as BIG as that from PR4 to PR5 (The closer you get to the peak the tougher it is to climb higher!)

The formula for PR calculation is:

* PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where,

* PR(A) is the PageRank of Page A
* D is a dampening factor (approx. 0.85)
* PR(T1) is the PageRank of a site pointing to Page A
* C(T1) is the number of links off that page
* PR(Tn)/C(Tn) means we do that for each page pointing toPage A

A crucial aspect of the PR algo is :
A PR of a page is a measure of its vote, which it can split between the links. Simply stated, if PageA has a PR1 then it can provide:

* 1 link-out of PR 0.85
* 2 link-outs of PR 0.425
* 3 link-outs of PR 0.283

Note: d=0.85

Till here everything seems to be moving smoothly till the author introduces parallel concept, MiniRank (" This should help us to better understand it. We’ll call it MiniRank. ")

The author then presents 2 iterations for 4 webpages, and the MiniRank calculation for each and detailed analysis for PageRank Feedback.

The extent of similarity between MR and PR is never discussed. Following are the issues raised by Ian Rogers:
1. Equation of PR calculation is altered:

* PR(A) =PR(A’) + (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

which will never converge to setteled values and spiral ever upwards.
2.Erroneous analysis of Feedback Loops

The author also covers PR analysis for:
* Links to your site
* Links out from your site
* Internal structure and Linkages

The paper might not have completely solved the PR mystery, but its a great start and a must read too along with The Original 'Google PR' Paper by Sergey Brin and Lawrence Page (Google's Founders)

Sunday, April 24, 2005

Web Mining and Google's PageRank

Web mining is associated with finding interesting content(or patterns) on the WWW. This concept is based on the Data Mining of large Datawarehouses.

The issues involved with Web Mining include:
  • Lots of distributed data
  • Volatile data
  • Unstructured and redundant data
  • Problems with quality of data
  • Hetrogeneous data

But the advantages, in comparison are:
  • Structural framefork provided my HTML
  • Link structure of the web

The web mining taxonomy is:
  • Web Content Mining
    • Web Page Content Mining
    • Search Result Mining
  • Web Structure Mining
  • Web Usage Mining
    • General Access Pattern Tracking
    • Customized Usage Tracking

Keeping our focus on 'Web Structure Mining', which mines the structure(links,graph) of the web and uses the techniques, PageRank and CLEVER.

PageRank is Google's "original" algorithm and the reason for its success as the most powerful search engine today and for years to come. ( Try YaGoohoo!gle)

Its the technique to prioritize pages returned from search. The importance of a page is calculated based on number of pages which point to it i.e. Backlinks. Weighting is used to provide more importance to backlinks coming from important pages.
The formula used for calculating a PageRank can be stated as:
  • PR(p)=c(PR(1)/N1+...+PR(n)/Nn)

There are concerns that Google's PageRanking may not be comprehensively updated these days as Bloggers "mess things up" :)

Google's PageRankExplained is a must read.

Saturday, April 23, 2005

Blogger Email Bug

The 'Mail-to-Blogger' functionality makes it easier to be a regular Blogger but there seems to be some bugs which still need to be sorted out with this feature.

I have mailed the Blogger team w.r.t a bug, which was discovered by Hrishikesh Thite (who had fallen to the dark side of 'MSN Spaces' due to the email feature, but i'm glad that he's back)

The mail to Blogger is as below and adds on to my knowledge of the 'Inner Workings' of Blogger:

When i mail from my email account to blogger, it publishes to the blog but also sends me a mail stating 'undelivered' when it actually has been delivered. I checked out the Known Issues and this is in indirect conflict with the completed issue that 'Everyone should be able to blog via mail'. I am attaching the content of the mail below.
A Blogger Fan,
Gautam
---------------------------------------------------------
Subject: Mail Delivery System 'Undelivered Mail Returned to Sender'
*Mail Delivery System to me

This is the Postfix program at host blogger.com.

I'm sorry to have to inform you that your message could not be
be delivered to one or more recipients. It's attached below.

For further assistance, please send mail to

If you do so, please include this problem report. You can
delete your own text from the attached returned message.

The Postfix program

(expanded from
): Command died with status 1:
"IFS=' '&&exec /home/bloggermail/processmail2||exit 75 #bloggermail"


Final-Recipient: rfc822; bloggermail+gautamsarora.gforgeekpost@blogger.com
Original-Recipient: rfc822; gautamsarora.gforgeekpost@blogger.com
Action: failed
Status: 5.0.0
Diagnostic-Code: X-Postfix; Command died with status 1: "IFS=' '&&exec
/home/bloggermail/processmail2||exit 75 #bloggermail"

Post to Blogger via Email

How To:
1.Login to Blogger
2.Click on the 'Change Settings' icon
3.Select the 'Settings' tab
4.Select 'Email ' option
5.Enter a Mail-to-Blogger Address. This is of the format
"BloggerLoginName.Secret@blogger.com".
6. Save Settings and Republish

Friday, April 22, 2005

Of 'Tiger's and 'Mustang's

Posted by Hello
In what is "the most significant enhancement to the Java platform in the technology's nine-year history," Sun Microsystems introduced its new Java 2 Platform Standard Edition (J2SE) version 5.0, some time back.

J2SE v5.0 ("Tiger") is the latest major revision to the Java platform and language; it contains 15+ component JSRs with nearly 100 other significant updates developed through the Java Community Process (JCP).

Note: J2SE5.0==J2SE1.5=='Tiger'

The J2SE 5.0 release is focused along certain key themes:

  • Ease of Development
    • Metadata
    • Generic Types
    • Autoboxing and Auto-Unboxing of Primitive Types
    • Enhanced for Loop :)
    • Enumerated Types
    • Static Import
    • Formatted Output :)
    • Formatted Input :)
    • Varargs
    • Concurrency Utilities
    • rmic -- The RMI Compiler
  • Scalability and Performance
  • Monitoring and Manageability
    • New JVM Profiling API (JSR-163)
    • Improved Diagnostic Ability
  • Desktop Client
    • Swing toolkit enjoys a fresh new theme called Ocean :)
  • Miscellaneous Features
    • Core XML Support
    • Supplementary Character Support
    • JDBC RowSets

The desktop Java client group at Sun is working on some great features and functionality in the next release of Java (code named Mustang).The Tiger team proclaimed the 5.0 release the best; the Mustang group has more than 100 changes in mind already.Mustang is scheduled for final release in the middle of 2006, the first update to the core Java language under the group's new delivery model.

Note: J2SE6.0==J2SE1.6=='Mustang'

As part of the development process for the Mustang release of the Java 2 Platform, Standard Edition (J2SE 6.0), they are moving to a more transparent development model for both the specification and the implementation and seeking source code contributions into Mustang, Bug reports, Feature requests etc.

This is our opportunity to contribute into the community and to help make the platform better.So we can make the world a better place and be publicly recognized for it too!

The doors are open. Source contributions are welcome. Lets participate!

Thursday, April 21, 2005

The Blogger-Java Connection

I wonder what i did, but i managed to crash Blogger(or maybe they helped me do it...)
But it brought out some very interesting results, Blogger uses a Java backend, here is the proof:

java.lang.RuntimeException: can't load class com.google.blogger.base.User from database
at com.google.blogger.dbreader.SqlServer.query(SqlServer.java:1639)
at com.google.blogger.base.BaseUser.select(BaseUser.java:116)
at com.google.blogger.base.BaseUser.selectByUsername(BaseUser.java:95)
at com.google.blogger.base.User.load(User.java:146)
at com.google.blogger.frontend.LoginHandler.authenticateUser(LoginHandler.java:80)
at com.google.blogger.frontend.LoginHandler.perform(LoginHandler.java:111)
at com.google.servlet.handlers.ActionHandler.execute(ActionHandler.java:27)
at com.google.servlet.handlers.AuthenticatedHandler.execute
(AuthenticatedHandler.java:106)
at com.google.servlet.pipe.HandlerDispatchPipe.invoke(HandlerDispatchPipe.java:82)
at com.google.blogger.frontend.PyraJspDispatchPipe.invoke(PyraJspDispatchPipe.java:74)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.blogger.frontend.ProfileLogPipe.invoke(ProfileLogPipe.java:49)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.servlet.pipe.StatsPipe.invoke(StatsPipe.java:64)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.servlet.pipe.PendingRequestPipe.invoke(PendingRequestPipe.java:31)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.servlet.pipe.HttpRedirectPipe.invoke(HttpRedirectPipe.java:43)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.blogger.frontend.IdentityCookiePipe.invoke(IdentityCookiePipe.java:137)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.servlet.exceptionhandling.ExceptionHandlerPipe.invoke
(ExceptionHandlerPipe.java:99)
at com.google.servlet.pipe.ServletPipe.invokeNextPipe(ServletPipe.java:118)
at com.google.servlet.pipe.LocaleContextPipe.invoke(LocaleContextPipe.java:116)
at com.google.servlet.BaseServlet.doGet(BaseServlet.java:89)
at com.google.servlet.BaseServlet.doPost(BaseServlet.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
(ApplicationFilterChain.java:247)
at org.apache.catalina.core.ApplicationFilterChain.doFilter
(ApplicationFilterChain.java:193)
at org.apache.catalina.core.StandardWrapperValve.invoke
(StandardWrapperValve.java:256)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:643)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at org.apache.catalina.core.StandardContextValve.invoke
(StandardContextValve.java:191)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:643)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke
(AuthenticatorBase.java:494)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:641)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2415)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:643)
at org.apache.catalina.valves.ErrorDispatcherValve.invoke
(ErrorDispatcherValve.java:171)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:641)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:172)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:641)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174)
at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNext
(StandardPipeline.java:643)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:223)
at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:261)
at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:360)
at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:604)
at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:562)
at org.apache.jk.common.SocketConnection.runIt(ChannelSocket.java:679)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:619)
at java.lang.Thread.run(Thread.java:534)
Caused by: java.sql.SQLException: [JTurbo 3.0.2 JDBC 2.1 Driver]: Communication Link Failure. Connection refused
at com.newatlanta.jturbo.driver.DataSource.getConnection(DataSource.java)
at com.newatlanta.jturbo.driver.DataSource.getConnection(DataSource.java)
at com.newatlanta.jturbo.driver.PoolManagerDataSource.b(PoolManagerDataSource.java)
at com.newatlanta.jturbo.driver.PoolManagerDataSource.a(PoolManagerDataSource.java)
at com.newatlanta.jturbo.driver.PoolManagerDataSource.getPooledConnection
(PoolManagerDataSource.java)
at com.newatlanta.jturbo.driver.PoolManagerDataSource.getPooledConnection
(PoolManagerDataSource.java)
at com.google.blogger.dbreader.SqlServer.getConnection(SqlServer.java:259)
at com.google.blogger.dbreader.SqlServer.query(SqlServer.java:1626)
... 60 more


Leave behind a message if you got more info about the 'connection'...

Add-Ons for a Geek Blog

Now a 'Geek Blog' cannot and should not look like just another blog...right?

So i went ahead and added some cool add-ons:

1.Statistics
2.Alexa Associate Tools
3.Firefox Ad Banner and Buttons
4.Chatterbox

and the search continues......

p.s. pointers to these add-ons were recieved from blogs of Faelle and Ajay

A Simple Packet Sniffer using Java

This document will help you make your very own small and simple packet sniffer using Java or in more explanatory terms, using the ‘waseda’ JPcap library.

We will develop a simple command-line packet sniffer application on the Windows platform using the WinPcap packet capture library (you can alternatively use libpcap for UNIX based machines).

The document will flow in the given style:

  • Definitions
  • Installing JPcap in Windows (which requires WinPcap installation)
  • Writing a Packet Sniffer

Definitions:


Packet Sniffing: is the process of capturing network traffic and inspecting it closely to determine what is happening on the network. A sniffer analyzes the data packets of common protocols and displays the network traffic in human-readable format.


Libcap: a system-independent interface for user-level packet capture. Libpcap provides a set of functions independent from the hardware and the operating system that an application can use to capture packets from a network.

TcpDump: uses the functions exported by libpcap to capture packets, set packet filters and communicate with the network adapter.


Winpcap: is an architecture that adds to the operating systems of the Win32 family

the ability to capture the data of a network using the network adapter of the machine (or in other terms, the libpcap for Windows).


WinDump: is the TcpDump for Windows from user point of view ( The kernel part is Windows specific and it is very different according to various Windows flavors).


JPcap: is a Java class package which enables to capture and send IP packets from Java application. This package uses libpcap and Raw Socket API.

P.S. Raw sockets and ICMP aren't available in Java natively, and this is where Jpcap comes to the rescue!

Installation:

1) Download and install Javatm2 Platform, Standard Edition (J2SEtm) JRE or SDK. java.sun.com/download


2) Download and install the latest WinPcap.

from http://winpcap.polito.it/install/default.htm

WinPcap 3.0 download: Windows 95/98/ME/NT/2000/XP WinPcap auto-installer (driver +DLLs)

P.S. Using the auto-installer will save you the effort of compiling it yourself

3) a. Download and extract the latest Jpcap. i.e. Jpcap ver.0.4 (Released on 4/1/03) jpcap-0.4.zip from http://netresearch.ics.uci.edu/kfujii/jpcap/doc/index.html

b. Copy "lib\Jpcap.dll" into "[JRE directory]\bin" or "[JRE directory]\lib\ext\x86"

c. Copy "lib\jpcap.jar" into "[JRE directory]\lib\ext"

d. If you installed J2SE SDK, you also need to copy "lib\jpcap.jar" into "[SDK directory]\jre\lib\ext".

Note: [JRE directory] is usually "C:\Program Files\Java\j2re*".
[SDK directory] is usually "C:\j2sdk*".


Writing a packet sniffer

// 1.Import the jpcap library \\

import jpcap.*;

// 2.Create a class called JSniffer that’s implements JpcapHandler

// (This interface is used to define a method to analyze the captured packets,

// which is used in Jpcap.handlePacket()) \\

class JSniffer implements JpcapHandler

{

// 3.The handlePacket() method is called everytime a packet is captured

// and the parameter is the packet to be analyzed \\

*/

public void handlePacket(Packet packet)

{

System.out.println(packet);

}

// 4.The main comes now! \\

public static void main(String[] args) throws java.io.IOException

{

// 5.The getDeviceDescription() is a static method of class Jpcap

// and can be called using the class name itself!

// It returns the description of the interfaces which is saved in lists[] \\

String[] lists=Jpcap.getDeviceDescription();

System.out.println("\n\t\t***My Simple Network Sniffer***\n");

System.out.println("Start capturing on "+lists[0]);

// 6.The openDevice() is a static method of Jpcap class

// and returns an instance of this class.

// The parameters are in the following order:

// (i)device (ii)snaplen (iii)promisc (iv)to_ms \\

Jpcap jpcap=Jpcap.openDevice(Jpcap.getDeviceList()[0],1000,false,20);

// 7.We use the instance returned by the openDevice() methos to capture packets

// using loopPacket() that captures the specified number of packets consecutively

// The parameter list is: (i)count (ii)a Jpcap handler \\

jpcap.loopPacket(-1,new JSniffer());

}

}

  • Save the above file as JSniffer.java
  • Now go to the console window and move to the above directory where you have saved JSniffer.java
  • # javac Jsniffer.java
  • #java JSniffer

That’s it!

Your sniffer should now be able to sniff all packets on your network, if you are connected to one!

If you are not connected to a network, you could try ping-ing, telnet-ing or ftp-ing to local host to create a few packets that could be sniffed by the sniffer.

Next time, we will make a Simple Port Scanner!

Wednesday, April 20, 2005

Get Set ... Blog!

Here i am, trying my hand at it again....but this time more determined, more focussed, more blah blah blah....

One thing about blogging i cannot do is write about myself, its just not me!

But there are things i would like to write about like World Peace, Animal Rights...wotever!

I am a Geek and i stand by it.

I will be dedicating my blogs to technology.

Hang on, the rides about to begin...