Hi, today we’ll use java to create a simple web crawler which can be used to fetch webpages recursively untill it fetches 1000, this limit can be changed as per our need. This code fetches only the url links out the fetched pages. It can be customized to fetch other resources according to our need. Eg. Images or mp3 etc.
Here is the code
import java.net.*; import java.io.*; public class Crawler{ public static void main(String[] args) throws Exception{ String urls[] = new String[1000]; String url = "http://www.nandal.in"; int i=0,j=0,tmp=0,total=0, MAX = 1000; int start=0, end=0; String webpage = Web.getWeb(url); end = webpage.indexOf("<body"); for(i=total;i<MAX; i++, total++){ start = webpage.indexOf("http://", end); if(start == -1){ start = 0; end = 0; try{ webpage = Web.getWeb(urls[j++]); }catch(Exception e){ System.out.println("******************"); System.out.println(urls[j-1]); System.out.println("Exception caught \n"+e); } /*logic to fetch urls out of body of webpage only */ end = webpage.indexOf("<body"); if(end == -1){ end = start = 0; continue; } } end = webpage.indexOf("\"", start); tmp = webpage.indexOf("'", start); if(tmp < end && tmp != -1){ end = tmp; } url = webpage.substring(start, end); urls[i] = url; System.out.println(urls[i]); } System.out.println("Total URLS Fetched are " + total); } } /*This class contains a static function which will fetch the webpage of the given url and return as a string */ class Web{ public static String getWeb(String address)throws Exception{ String webpage = ""; String inputLine = ""; URL url = new URL(address); BufferedReader in = new BufferedReader( new InputStreamReader(url.openStream())); while ((inputLine = in.readLine()) != null) webpage += inputLine; in.close(); return webpage; } }
1. Save the code in the file Crawler.java
2. javac Crawler.java for compilation # this is for compilation
3. java Crawler for running # this is to run the program
If in doubt, feel free to post your queries.
Twit this with your twitter acount : Tweet
Pingback: terrance
Pingback: Gilbert
Ah finally a web crawler source code without 50+ Classes or over 10000 lines of code. I hope it works well. Thanks nandal 😀