Cheap, simple, periodic web page sucker?

ccwaterback · November 12, 2003, 12:32pm

I need to capture the contents of a certain web page (single) every day at 7PM. What I need is a “script” that runs every week day (M-F) at 7PM, accesses the web page and saves the page on my PC as WP2003-11-12.htm (for example). I’m running XP with a DSL connection. Any ideas? Thanks.

Popup · November 12, 2003, 1:34pm

If you were running Linux it would be very easy to use a cron-script that calls wget.

Apparently it also exists in a windows version, and I believe that there’s some kind of scheduler by default in most windows of the NT branch, but I have no idea of how to use it.

Armilla · November 12, 2003, 2:23pm

The following script should do the job:



Dim oXMLHTTP, sHTML
Dim oFS, oTS, sFileName, sFilePath
Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP.3.0")
oXMLHTTP.open "GET","http://www.straightdope.com",false
oXMLHTTP.send
sHTML = oXMLHTTP.responseText
Set oXMLHTTP = Nothing

sFileName = "WP" & CStr(Year(Date)) & "-" & CStr(Month(Date)) & "-" & CStr(Day(Date)) & ".htm"
sFilePath = "c:\"
Set oFS = CreateObject("Scripting.FileSystemObject")
Set oTS = oFS.CreateTextFile(sFilePath & sFileName,True,True)
oTS.Write CStr(sHTML)
oTS.Close

Set oTS = Nothing
Set oFS = Nothing

Cut and paste it into notepad and save it with a “.VBS” extension. It’s pretty clear where the source URL and the destination path are set up, so just edit these to be what you need and schedule the run with the Windows task scheduler.

Note that it will save the files in the system unicode format, just in case you need to be scraping Chinese websites as well.

ccwaterback · November 12, 2003, 6:34pm

WOW WOW WOW!!!

That works like a champ! I just need to get around a couple Norton complaints, but that’s not a problem.

Many humble gratitudes in your direction Armilla.

ccwaterback · November 12, 2003, 6:38pm

Popup thanks for your suggestion too. I am an old Unix hack myself, I think I could have pulled this off on Unix fairly easy.

On Unix? It must be a problem that Perl can solve

Armilla · November 13, 2003, 10:46am

You’re welcome, ccwaterback.

If your software firewall is complaining you’ll need to give “cscript.exe”, or maybe “wscript.exe”, access to the internet. These are the two versions of the scripting host that runs VBS files (cscript runs as a command-line interface, wscript runs as a windows interface and shows messages as popups).

ccwaterback · November 13, 2003, 12:15pm

After I “authorize” the script with Norton, it seems to run just fine. The only problem I have now is that the embedded links in the web page are getting scrunched.

If a link in the web page is (for example) http://www.server.com/q/a=b when the page is saved to my PC that link becomes /q/a=b. So the http://www.server.com is getting stripped somehow. Any ideas on fixing this problem? I appreciate it.

MartinL · November 13, 2003, 1:21pm

My guess is that the links are relative. Although your browser may display the absolute URL of the link when you move the mouse over it, most probably only the relative part of the link is actually given in the HTML code (view the source of the page and check). By default, your browser will use the URL of the page in order to resolve relative links, but you can override this behaviour if you insert the following line into the HEAD section of the HTML file:
<BASE HREF=“http://www.straightdope.com” />

aerodave · November 13, 2003, 1:26pm

The links are being saved that way because that is likely how they are coded into the HTML. Nothing’s being changed during the save…rather, that is precisely how the code for the page looks. It’s conventional to use relative hyperlinks, pointing to a location based on the current page, like /q/a=b. As opposed to absolute hyperlinks, which give the full address, such as http://www.server.com/q/a=b/

It’s like in DOS, if you’re in c:\porn\britney and you want to get to c:\porn\jenna. You could type “cd c:\porn\jenna” or you could just type “cd …\jenna”

That kind of relative linking stops making sense away from the source location. Typing “cd …\jenna” is useless in c:\cars\porsche.

One reason for relative hyperlinks is that it helps in site portability. If the contents of your web stay the same, but you put them on a different server, the relative hyperlinks still make sense. If you had coded with absolute links, you’d have to go through and change the server name on every singel link on every page.

So the relative hyperlinks in the webpage only fail because there’s no page in the proper relative place on your machine. But the page your downloading is right.

Hope that helps…

ccwaterback · November 13, 2003, 2:53pm

I do a “save as” from the browser, the links look fine.

But if I save the web page with the VBScript my links get scrunched.

Strange.

ccwaterback · November 13, 2003, 2:56pm

I opened the saved html files with my text editor.

ccwaterback · November 13, 2003, 2:58pm

Here’s the target site:

http://finance.yahoo.com/a3?o=l:0&d=t

aerodave · November 13, 2003, 3:19pm

Well, lif you View Source on that page, the hyperlinks are definitely relative.

for example, the “NASDAQ” links look like this, obviously a relative link:


<a href="/a0?d=t&o=l:0">NASDAQ</a>

Your browser may automatically put the base server on the relative links when you save the page to your computer, so that the links still work.

What you want is for your script to do the same thing.

Popup · November 13, 2003, 3:28pm

If you use wget, it has a command-line option --convert-links, which will make sure that the links point to whatever necessary (If you download multiple files, it will keep relative links between the files, but change links to files not downloaded to absolute, and add hostname etc.)

ccwaterback · November 13, 2003, 4:03pm

I added the following replacement, seems to work good now.

sNewHTML = Replace(CStr(sHTML), “/q”, “http://finance.yahoo.com/q”)

Thanks to all who scratched their heads

ccwaterback · November 13, 2003, 4:11pm

Some of the links on the top of the page still don’t work, but I only care about the links in the “tables”.

Armilla · November 13, 2003, 5:42pm

I’ve done a couple of small tweaks to the script:



CONST URL = "http://finance.yahoo.com/a3?o=l:0&d=t"
CONST DEST_FOLDER = "C:\"

CONST RE_URL = "([a-z].*?)://([^/\?]*)((.*?)(\?|$)){0,1}(.*$){0,1}"
CONST RE_HREF_ROOTREL = "(<a(\s*?)href=)(""|')(?!(([a-zA-Z].*?):))/(.*?)\3"
CONST RE_HREF_REQREL = "(<a(\s*?)href=)(""|')(?!(([a-zA-Z].*?):))(?!/)(.*?)\3"

Dim oXMLHTTP, sHTML, oRE, oMatch
Dim oFS, oTS, sFileName, sFilePath
Dim sURL, sQS, i
Dim sProtocol, sHost, sPath

Set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = RE_URL
If Not oRE.Test(URL) Then
	WScript.Echo "Please make sure to include the protocol in the url (http://, https:// etc)"
	WScript.Quit
End If
Set oMatch = oRE.Execute(URL).item(0)
sProtocol = oMatch.SubMatches(0)
sHost = oMatch.SubMatches(1)
sPath = oMatch.SubMatches(3)
sQS = oMatch.SubMatches(5)
If Right(sPath,1)<>"/" Then sPath = sPath & "/"

Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP.3.0")
oXMLHTTP.open "GET",URL,false
oXMLHTTP.send
sHTML = oXMLHTTP.responseText
Set oXMLHTTP = Nothing

Set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = RE_HREF_REQREL
sHTML = oRE.Replace(sHTML,"$1$3" & sProtocol & "://" & sHost & "/" & sPath & "$6$3")
oRE.Pattern = RE_HREF_ROOTREL
sHTML = oRE.Replace(sHTML,"$1$3" & sProtocol & "://" & sHost & "/" & "$6$3")

sFileName = "WP" & CStr(Year(Date)) & "-" & CStr(Month(Date)) & "-" & CStr(Day(Date)) & ".htm"
sFilePath = DEST_FOLDER
Set oFS = CreateObject("Scripting.FileSystemObject")
Set oTS = oFS.CreateTextFile(sFilePath & sFileName,True,True)
oTS.Write CStr(sHTML)
oTS.Close

Set oTS = Nothing
Set oFS = Nothing

It should correctly expand relative links into their full path now. I ran it against the page you specified and it seems to bring back a version with all the links working.

I’ve also moved the URL and destination folders into constants at the top of the file, but it could just as easily accept them as command line parameters if that’s necessary.

Hope this helps.

ccwaterback · November 13, 2003, 11:49pm

Thanks Armilla, you rock.

Topic		Replies	Views
Doper programmers: little help please? In My Humble Opinion	6	789	July 13, 2004
Techie question: Internet Explorer Factual Questions	3	774	September 15, 2000
Barebones simple web solution Factual Questions	9	995	January 1, 2007
Automatically saving webpages for offline use Factual Questions	3	695	February 27, 2004
Free(-ish) tool to pull downloads from all links on a web page Factual Questions	9	1862	February 21, 2010

Cheap, simple, periodic web page sucker?

Related topics