WEB CRAWLER USING JAVA

Hi, today we’ll use java to create a simple web crawler which can be used to fetch webpages recursively untill it fetches 1000, this limit can be changed as per our need. This code fetches only the url links out the fetched pages. It can be customized to fetch other resources according to our need. Eg. Images or mp3 etc.

Here is the code

import java.net.*;
import java.io.*;

public class Crawler{
    public static void main(String[] args) throws Exception{
    String urls[] = new String[1000];
    String url = "http://www.nandal.in";
    int i=0,j=0,tmp=0,total=0, MAX = 1000;
    int start=0, end=0;
    String webpage = Web.getWeb(url);
    end = webpage.indexOf("<body");
    for(i=total;i<MAX; i++, total++){
        start = webpage.indexOf("http://", end);
        if(start == -1){
            start = 0;
            end = 0;
            try{
                webpage = Web.getWeb(urls[j++]);
            }catch(Exception e){
                System.out.println("******************");
                System.out.println(urls[j-1]);
                System.out.println("Exception caught \n"+e);
            }

            /*logic to fetch urls out of body of webpage only */
            end = webpage.indexOf("<body");
            if(end == -1){
                end = start = 0;
                continue;
            }       
        }
        end = webpage.indexOf("\"", start);
        tmp = webpage.indexOf("'", start);
        if(tmp < end && tmp != -1){
            end = tmp;
        }
        url = webpage.substring(start, end);
        urls[i] = url;
        System.out.println(urls[i]);
    }   
    System.out.println("Total URLS Fetched are " + total);
    }
}



/*This class contains a static function which will fetch the webpage
  of the given url and return as a string */
class Web{
    public static String getWeb(String address)throws Exception{
    String webpage = "";
        String inputLine = "";
        URL url = new URL(address);
        BufferedReader in = new BufferedReader(
        new InputStreamReader(url.openStream()));
        while ((inputLine = in.readLine()) != null)
        webpage += inputLine;
        in.close();
    return webpage;
    }
}

1. Save the code in the file Crawler.java

2. javac Crawler.java for compilation # this is for compilation

3. java Crawler for running # this is to run the program

If in doubt, feel free to post your queries.
Twit this with your twitter acount :

3 thoughts on “WEB CRAWLER USING JAVA

  1. Pingback: terrance

  2. Pingback: Gilbert

Leave a Reply