I need to capture the contents of a certain web page (single) every day at 7PM. What I need is a “script” that runs every week day (M-F) at 7PM, accesses the web page and saves the page on my PC as WP2003-11-12.htm (for example). I’m running XP with a DSL connection. Any ideas? Thanks.
If you were running Linux it would be very easy to use a cron-script that calls wget.
Apparently it also exists in a windows version, and I believe that there’s some kind of scheduler by default in most windows of the NT branch, but I have no idea of how to use it.
The following script should do the job:
Dim oXMLHTTP, sHTML
Dim oFS, oTS, sFileName, sFilePath
Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP.3.0")
oXMLHTTP.open "GET","http://www.straightdope.com",false
oXMLHTTP.send
sHTML = oXMLHTTP.responseText
Set oXMLHTTP = Nothing
sFileName = "WP" & CStr(Year(Date)) & "-" & CStr(Month(Date)) & "-" & CStr(Day(Date)) & ".htm"
sFilePath = "c:\"
Set oFS = CreateObject("Scripting.FileSystemObject")
Set oTS = oFS.CreateTextFile(sFilePath & sFileName,True,True)
oTS.Write CStr(sHTML)
oTS.Close
Set oTS = Nothing
Set oFS = Nothing
Cut and paste it into notepad and save it with a “.VBS” extension. It’s pretty clear where the source URL and the destination path are set up, so just edit these to be what you need and schedule the run with the Windows task scheduler.
Note that it will save the files in the system unicode format, just in case you need to be scraping Chinese websites as well.
WOW WOW WOW!!!
That works like a champ! I just need to get around a couple Norton complaints, but that’s not a problem.
Many humble gratitudes in your direction Armilla.
Popup thanks for your suggestion too. I am an old Unix hack myself, I think I could have pulled this off on Unix fairly easy.
On Unix? It must be a problem that Perl can solve
You’re welcome, ccwaterback.
If your software firewall is complaining you’ll need to give “cscript.exe”, or maybe “wscript.exe”, access to the internet. These are the two versions of the scripting host that runs VBS files (cscript runs as a command-line interface, wscript runs as a windows interface and shows messages as popups).
After I “authorize” the script with Norton, it seems to run just fine. The only problem I have now is that the embedded links in the web page are getting scrunched.
If a link in the web page is (for example) http://www.server.com/q/a=b when the page is saved to my PC that link becomes /q/a=b. So the http://www.server.com is getting stripped somehow. Any ideas on fixing this problem? I appreciate it.
My guess is that the links are relative. Although your browser may display the absolute URL of the link when you move the mouse over it, most probably only the relative part of the link is actually given in the HTML code (view the source of the page and check). By default, your browser will use the URL of the page in order to resolve relative links, but you can override this behaviour if you insert the following line into the HEAD section of the HTML file:
<BASE HREF=“http://www.straightdope.com” />
The links are being saved that way because that is likely how they are coded into the HTML. Nothing’s being changed during the save…rather, that is precisely how the code for the page looks. It’s conventional to use relative hyperlinks, pointing to a location based on the current page, like /q/a=b. As opposed to absolute hyperlinks, which give the full address, such as http://www.server.com/q/a=b/
It’s like in DOS, if you’re in c:\porn\britney and you want to get to c:\porn\jenna. You could type “cd c:\porn\jenna” or you could just type “cd …\jenna”
That kind of relative linking stops making sense away from the source location. Typing “cd …\jenna” is useless in c:\cars\porsche.
One reason for relative hyperlinks is that it helps in site portability. If the contents of your web stay the same, but you put them on a different server, the relative hyperlinks still make sense. If you had coded with absolute links, you’d have to go through and change the server name on every singel link on every page.
So the relative hyperlinks in the webpage only fail because there’s no page in the proper relative place on your machine. But the page your downloading is right.
Hope that helps…
I do a “save as” from the browser, the links look fine.
<a href=“http://www.server.com/q/a=b”> link </a>
But if I save the web page with the VBScript my links get scrunched.
<a href="/q/a=b"> link </a>
Strange.
I opened the saved html files with my text editor.
Here’s the target site:
Well, lif you View Source on that page, the hyperlinks are definitely relative.
for example, the “NASDAQ” links look like this, obviously a relative link:
<a href="/a0?d=t&o=l:0">NASDAQ</a>
Your browser may automatically put the base server on the relative links when you save the page to your computer, so that the links still work.
What you want is for your script to do the same thing.
If you use wget, it has a command-line option --convert-links, which will make sure that the links point to whatever necessary (If you download multiple files, it will keep relative links between the files, but change links to files not downloaded to absolute, and add hostname etc.)
I added the following replacement, seems to work good now.
sNewHTML = Replace(CStr(sHTML), “/q”, “http://finance.yahoo.com/q”)
Thanks to all who scratched their heads
Some of the links on the top of the page still don’t work, but I only care about the links in the “tables”.
I’ve done a couple of small tweaks to the script:
CONST URL = "http://finance.yahoo.com/a3?o=l:0&d=t"
CONST DEST_FOLDER = "C:\"
CONST RE_URL = "([a-z].*?)://([^/\?]*)((.*?)(\?|$)){0,1}(.*$){0,1}"
CONST RE_HREF_ROOTREL = "(<a(\s*?)href=)(""|')(?!(([a-zA-Z].*?):))/(.*?)\3"
CONST RE_HREF_REQREL = "(<a(\s*?)href=)(""|')(?!(([a-zA-Z].*?):))(?!/)(.*?)\3"
Dim oXMLHTTP, sHTML, oRE, oMatch
Dim oFS, oTS, sFileName, sFilePath
Dim sURL, sQS, i
Dim sProtocol, sHost, sPath
Set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = RE_URL
If Not oRE.Test(URL) Then
WScript.Echo "Please make sure to include the protocol in the url (http://, https:// etc)"
WScript.Quit
End If
Set oMatch = oRE.Execute(URL).item(0)
sProtocol = oMatch.SubMatches(0)
sHost = oMatch.SubMatches(1)
sPath = oMatch.SubMatches(3)
sQS = oMatch.SubMatches(5)
If Right(sPath,1)<>"/" Then sPath = sPath & "/"
Set oXMLHTTP = CreateObject("MSXML2.XMLHTTP.3.0")
oXMLHTTP.open "GET",URL,false
oXMLHTTP.send
sHTML = oXMLHTTP.responseText
Set oXMLHTTP = Nothing
Set oRE = New RegExp
oRE.IgnoreCase = True
oRE.Global = True
oRE.Pattern = RE_HREF_REQREL
sHTML = oRE.Replace(sHTML,"$1$3" & sProtocol & "://" & sHost & "/" & sPath & "$6$3")
oRE.Pattern = RE_HREF_ROOTREL
sHTML = oRE.Replace(sHTML,"$1$3" & sProtocol & "://" & sHost & "/" & "$6$3")
sFileName = "WP" & CStr(Year(Date)) & "-" & CStr(Month(Date)) & "-" & CStr(Day(Date)) & ".htm"
sFilePath = DEST_FOLDER
Set oFS = CreateObject("Scripting.FileSystemObject")
Set oTS = oFS.CreateTextFile(sFilePath & sFileName,True,True)
oTS.Write CStr(sHTML)
oTS.Close
Set oTS = Nothing
Set oFS = Nothing
It should correctly expand relative links into their full path now. I ran it against the page you specified and it seems to bring back a version with all the links working.
I’ve also moved the URL and destination folders into constants at the top of the file, but it could just as easily accept them as command line parameters if that’s necessary.
Hope this helps.
Thanks Armilla, you rock.