delphi3000.com - the free delphi knowledge platform
delphi3000.com - the free delphi knowledge platform
Have a look at your member-status

connecting people's knowledge


  - Recent ArticlesRSS feed for Recent Articles on delphi3000.com
  - List of All Articles
  - Top Viewed Articles
  - Articles (+Attachem.)
  - Articles Of Interest
  - Categories
  - Top Uploader
  - Search
  - Index

  - My Home
  - Submit an Article
  - My Articles
  - My Personal Data
  - My Bookmarks
  - Activities
  - Login/Logout

  - Sign Up
  - Why Sign Up
  - Newsletter

  - Press
  - Advertise

  - Contact
  - Feedback





Community
Borland
ClubeDelphi
Dr. Bob
UK-BUG
Delphi Meetings
Planeta Delphi







Startblatt.de






Share this article with friendsShare this article with friends
Rate this articleRate this article - to keep the quality of delphi3000.com !
Comment this article or read through previous comments (3)


Extracting complete list of URL's from the web serverGo to Clever Components's websiteFormat this article printer-friendly!Bookmark function is only available for registered users!
Product:
Delphi 3.x (or higher)
Category:
Internet / Web
Skill Level:
Scoring:
Last Update:
02/27/2003
Search Keys:
delphi delphi3000 article borland vcl code-snippet download extract parse url extractor components Delphi C++ Builder VCL clever internet components send beta internet collections library RAD fast Non Data Aware sources resources HTTP HTTPS URL asynchronously demo help samples
Times Scored:
8
Visits:
3809
Uploader: Clever Components
Company: CleverComponents
Reference: CleverComponents.com
 
Question/Problem/Abstract:
This article describes how to extract a list of all web resources (URL's) from the web servers like http://www.clevercomponents.com, http://www.borland.com using the Clever Internet Suite components.
Answer:



There is big demand for applications that allow look up through the web server (URL) and collect list of all web resources available from that web server. Most famous programs with such functionality probably would be Teleport Pro and Flash Get.

The general idea however is very simple - download a web page, parse it and extract all links and urls.

In this article we will discuss very simple and generic algorithm based on recursive downloading and parsing web pages in asynchronous mode.

As we mentioned above the main steps of algorithm are:
1. Download web page / URL
2. Parse downloaded page and extract all URL's (you can use any method you like to parse pages)
3. Save extracted URLs into the URL list
4. Take next URL from the URL list
5. Go to the first step until end of the URL list

The first step can be implemented with Clever Downloader component. Clever Downloader component in addition to base functionality provided by another popular libraries (such as Indy, IPWorks and so on) allows you to download web page / URL in asynchronous mode without interfering with main application process. After downloading process completed the OnProcessCompleted component event occurs. In order to implement recursive downloading we also need to use OnIsBusyChanged component event. OnIsBusyChanged event will protect us from stack overflow during crawling through server URL's.

Here is the code for the first step of our algorithm:


clDownLoader: TclDownLoader;
memURLList: TMemo;  

...

procedure TForm1.btnStartClick(Sender: TObject);
begin
   if clDownLoader.IsBusy then Exit;
   memURLList.Lines.Clear();
   memURLList.Lines.Add(edtRootURL.Text);
   FCurrentURLIndex := -1;
   ProcessNextURL();
end;  

procedure TForm1.ProcessNextURL();
begin
   repeat
      Inc(FCurrentURLIndex);
   until (FCurrentURLIndex >= memURLList.Lines.Count) or (Pos('.asp', memURLList.Lines[FCurrentURLIndex]) > 0)
      or (Pos('.htm', memURLList.Lines[FCurrentURLIndex]) > 0);
   if (FCurrentURLIndex < memURLList.Lines.Count) then
   begin
      clDownLoader.URL := memURLList.Lines[FCurrentURLIndex];
      clDownLoader.Start(); end else
   begin
      ShowMessage('Process Completed');
   end;
end;


Main loop within the ProcessNextURL method is searching for next URL in URL list and the important thing is that URL should looks like html page (in this example we just check for page extension). We simplified this method but if you need more advanced analysis you can easily modify it according your needs.

After downloading completed the OnIsBusyChanged event occurs. Here is a code for this event:


procedure TForm1.clDownLoaderIsBusyChanged(Sender: TObject);
var
   List: TStrings;
begin
   if clDownLoader.IsBusy then Exit;
   if FileExists(clDownLoader.LocalFile) then
   begin
      List := TStringList.Create();
      try
         List.LoadFromFile(clDownLoader.LocalFile);
         ExtractURLS(List, memURLList.Lines);
      finally
         List.Free();
      end;
      DeleteFile(clDownLoader.LocalFile);
   end;
   ProcessNextURL();
end;  


You can extract URLs from downloaded web page with any method you like the most. In our example we used simple page parsing. Full source code can be downloaded at urlextractor.zip.


procedure TForm1.ExtractURLS(APage, AURLList: TStrings);
var
   i: Integer;
   List: TStrings;
begin
   List := TStringList.Create();
   try
      ParsePage(APage, List);
      for i := 0 to List.Count - 1 do
      begin
         if (AURLList.IndexOf(List[i]) < 0) then
         begin
            AURLList.Add(List[i]);
         end;
      end;
   finally
      List.Free();
   end;
end;


When parsing URL content please pay attention to the following issues:

1. Check for duplicate URL entries before adding a URL to URL list. Most web pages have cross linked references within web site.
2. Check for links to foreign web sites. This will prevents from jumping to another web server while crawling through the specified web site.

In order to check if link belongs to the web site on which this web page is hosted you can simply compare host part of both URL's (for example http://www.site.com and http://www.site.com/index.asp). Clever Internet Suite has TclURLParser class specially designed for this purpose.


function TForm1.IsURLNative(const AURL, ABaseURL: string): Boolean;
var
   URLParser, BaseURLParser: TclURLParser;
begin
   Result := False;
   URLParser := TclURLParser.Create();
   BaseURLParser := TclURLParser.Create();
   try
      BaseURLParser.Parse(ABaseURL);
      if (URLParser.Parse(AURL) <> '') then
      begin
         Result := (URLParser.Host = BaseURLParser.Host);
      end;
   finally
      BaseURLParser.Free();
      URLParser.Free();
   end;
end;  


Given example is far from perfect and might reqire additional alterations. You can always enhance this code to desired functionality.

We provided two examples for your convenience:
1. Delphi urlextractor.zip
2. C++ Builder urlextractor_bcb.zip

Enjoy,

Sergey S
Clever Components team.
Please write to info@clevercomponents.com





Please rate this article!
Skill level:
BeginnerExpert

Useful:
No!Very!

Overall rating:
PoorExcellent



Comments to this article
Write a new comment
Another Spam?
    Benoit Standaert (Sep 25 2003 5:11AM)

With the source of the sample, described in this article, you will need a component NOT included inside the sample. I SAID NO... stop this commercial attitude.
Respond

Posting Rules...
    Stefan Stefanov (Feb 26 2003 3:49PM)

Is this article OK with posting rules? Maybe it is, I never red them. But it is 100% advertisement of commersial component. With sure it is not OK with my concept for this site...
Every commercial component has a demo - why not post all of them here?
Respond

RE: Posting Rules...
Sergey S (Feb 26 2003 5:18PM)

The main purpose of this article is to provide an alhorithm of recursive url extracting. You can implement it with any other internet library you like. In this article I used the Clever Internet suite because it was more suitable to me.
Respond














 
Sign up to consume product discounts for Bronze memberships !

read more


  Visit our Sponsor

 

  Community Ad of
A. B. Talal
 
   














 







     
  Copyright © 2000 - 2007 delphi3000.com - All rights reserved. Terms of use. || Privacy
delphi3000.com is a service by bluestep.com IT-Services GmbH (Vienna)