I’m working on a spider/bot type program that needs to gather information from the web.
Given a few links…
[info 1]
[info 2]
[info 3]
.
.
.
It should follow those links and extract information from the html that they are linked too. These html pages themselves contain information queried from an SQL database.
Basically, I want to “intercept” each page so that it passes through my program for parsing. Any ideas on how to do this?
That shouldn’t be too hard of a task - what platform/language are you trying to use?
I’m working on a program to grab my library card data from the epl website. Can’t decide whether to use perl or python, but python looks something like this:
Just make sure you know enough about the robots.txt file to keep everyone happy.
I’ve done this in Cold Fusion, but wouldn’t want to try it with VB or other windows programming languages.
There is an object called cfhttp that will go to any address you define and capture the entire document. The document will come back as a string.
Once you have that string, all you gotta to do is find reliable start and end points for your data. You just do a string find for the first start point, set a marker, continue searching past that point, set another marker, then grab all the text between those two markers. Store that info (may require trimming) and continue on until you have all the data elements you need.
You can do this in other languages, but I can do it in 50 lines of Cold Fusion.
Actually, this is pretty simple in VB. You can use the wininet.dll functions InternetOpen, InternetOpenURL, and InternetReadFile to open a connection, specify a URL, and read the contents of the data returned from that URL. Once you’ve got the return page in a string, you can easily parse it for other links using the VB string functions like Instr, Mid, etc. There are also a lot of third-party components usable from VB which make HTTP queries somewhat more transparent (e.g. Catalyst Sockettools) but they don’t do anything you can’t do using plain VB and wininet. FTR, one version of the wininet code I wrote to read and parse a URL is 47 lines, and most of that is boilerplate, so this compares well to CF if the number of lines is your criterion. You can do the same thing in Java using java.net.URLConnection. In Perl, you could use LWP or several similar libraries. My code samples in both Java and Perl are single-digit line counts.
I dare say any language purporting to be useful in this decade has a method for making HTTP queries. For specifics syntax, we need to know the language the OP would like to use.
This is a very important note. Especially if you’re retrieving content from a dynamically generated site (pages generated from a database), a bot that isn’t well-behaved can hammer a site in short order by trying to query every link. There have been several cases where database-driven sites which were never intended for heavy use were crashed by a bot, which is just rude when it’s so easy to make a bot respect the rules and/or throttle its queries to avoid overloading the target site. You want your bot to be unobtrusive so you’re not accused of launching a denial-of-service attack.
I don’t want to get into CF vs. VB debate. I’ve developed in both for a long time now, and for a beginner, I have found that someone who knows HTML will pick up CF much faster than VB.
For example, in CF, it takes exactly 1 line and no components to load a URL into a string. Once you have the info out, it takes just one more line to toss it into a database. The middle stuff is variable. (FTR, I’ve done this in 5 lines with no extra components )
With ASP/VBScript, it’s two lines. One to instantiate Microsoft’s XMLHTTP object and another to load the content from a URL. This is intended to facilitate XML transfer and web services, but it will work with any URL and any return data type.
I’m not trying to debate the advantages of any language either. Quite the contrary, I’m trying to point out that almost any modern language will have this functionality easily available. I only replied in the first place because your post gave the impression that CF had some sort of advantage in this area.
The minimalist example is two lines, but here’s a code sample I use that has a bit more. This is VBScript for use in ASP:
set http = Server.CreateObject("Microsoft.xmlhttp")
'set http = Server.CreateObject("msxml.xmlhttp")
'set http = Server.CreateObject("msxml2.xmlhttp")
'set http = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")
url = "http://www.yahoo.com/"
req = "field1=" & myField1 & "&field2=" & myField2
method="POST"
if method="POST" then
http.open "POST", url, False
http.setRequestHeader "Content-type", "application/x-www-form-urlencoded"
http.setRequestHeader "Content-length", Len(req)
http.send(req)
else
http.open "GET", CStr(url & "?" & req), False
http.send
end if
ret = http.responseText
set http = nothing
Note the first block has three commented-out instantiations with different object names than the top line. Which one you use depends on the version of MSXML you have installed.
I included both GET and POST samples, but the choice to use POST is hard-coded in this case.
The “req” (for request) variable contains any data (post or get) you want to submit in the form of key/value pairs, just like you would put it on a URL querystring. The content of the page query is returned in the “ret” (for return) variable. In this case, if you did a Response.Write(ret), you see the Yahoo! homepage.
If you are using ASP, you can write your script in Perlscript instead of VBScript and gain access to an enormous number of Perl libraries. If you have Perlscript available, I’ll be happy to provide code samples for that. In fact, you can mix Perlscript and VBScript within an ASP page and get the simplicity of VBScript for some functions and the power of Perl for harder bits.
What are you already comfortable with code-wise? I’ve done what you need in perl, and it sounds like there there are other code tidbits available in other languages here for the asking.
I know you’re just being snide, but this control was available in VB6, which has been around long enough that I don’t even have a release date handy but circa 1998-9. The wininet functionality was available long before that, which has slightly more complicated syntax but more low-level control.