Can I use excel (with Access?) to pull data online?

Let me explain what I need.

I’ve been asked to pull down some data from a website. It’s easy to access here. The individual pages are contained in a simple .html document. The data I need is there (it’s text strings), and it’s easy to pull it if I can get something which searches for a specific word (Title, Date, etc.) Is there a way to script something to automatically generate the excel document I need? I’m lazy and in any case I’m no fan of manual labor. More importantly, I don’t think there’s a chance I get this done manually in the time allotted.

Could you use the “text to columns” function of Excel to put it into a usable format? This function takes text that is in a single column (such as if you paste it in from a website) and separates it into columns each time it sees a comma, a colon, even a space - whatever you tell it to use.

In Excel 2007 this function is under the “data” menu. I can’t remember where it is in older versions, but if you can’t find it, then Help could certainly guide you there.

Is the data on the web page already tabular data? If so, you should be able to easily convert the data into an Excel spreadsheet.

Sadly, the data is not collected into one webpage. It’s on hundreds of seperate ones, one each for a different entry. However, they are labelled in an extremely easy-to-understand manner, so if there’s a way for the script to jump down the list (I have a list of the pages I need and/or could generate it in any format)…

Once it gets in there, I only need three pieces of data from each page. However, I would need something to sift through for a specific text-trigger, as the data is irregular. That is, there are is a variable amount of text. It is arranged into neat columns.
Try this:

http://www.lib.utk.edu/spcoll/manuscripts/

http://www.lib.utk.edu/spcoll/manuscripts/1764.html

This is what I’m using. It has the specific manuscript number (the data I have), the TITLE, DATE, and the number of boxes. The last I may have to fiddle with, but if there are a few I must go back and check, no problem.

Ugh, that sucks.

I’ll poke around the VBA forums for you. I might have a similar project (though with a lot fewer webpages) coming up, so I’m going to have to figure out how to do this too. I’m not seeing an easy way to do it without VBA programming, though. It might just be faster, although in no way easier, to just brute-force the whole project using Duckster’s linked instructions.

I would use the XMLHTTP object to make HTTP requests in VBA.

Here’s a little tutorial:
http://scriptorium.serve-it.nl/view.php?sid=40

Roughly: You have the manuscript number, so you can generate the URL that you need. From there, it looks like you could just use an InStr() to find, say “Title”, another to find the next </td>, and a Mid() to pull out the data that you need.

I’m not sure if he’s looking for the table of collections at the bottom of the link he gave, but I’d go with “Container List” for the InStr() command.

I noticed that the manuscript numbers are not consecutive, so unfortunately he can’t do a For x to y command to seed them. Guess those will have to be seeded by hand.

Hmm… I don’t see a way to grab each page. Is there a way in Excel 2007+ to use the import data command on the manucripts index, then draw the data I need? Sorry, I know queries using raw data but I don’t have a background for this except some C++ a while back.

Edit: Sorry, do I just create a webpage with this scriptorium code (properly modified to fit) and then run it?

Rereading the OP, “Title” and “Dates” look to be in the table at the top, then the number of boxes would be a row count of the “Container List” at the bottom.

The first part could be accomplished like I said in my last post. For the second part, it looks like you’d have to grab the HTML between “Container List” and ‘<div id=“footer-orange-outer”>’ and use an InStr(), Mid() loop to count the tables (each row appears to be a whole table).

That’s VBA code, it needs to be entered in Excel in the VBA editor.

We can probably drop that. Turns out the box count is possibly wrong in two seperate places, so the only way to be sure (assuming the THIRD is also not wrong) is count the listed boxes plus any extra materials which is a box but not labelled in a box.

Yeah… that part we’ll have to do by hand. :wink:

But at least I can get the first two sections. And do this for everyone, so it should be much easier all around.

Right. Only potential catch is that not all of the webpages have a “Container List” section. The one that smiling bandit linked to does, but I looked at a couple other ones which don’t. An error exception handler should take care of that, though.

OK, trying to work this into vba. So far, so good… except.

Uh, how do I set it to loop through the list of the hypoerkinks I need. I’ve got it extracting the text, which is fine (it put out a huge list of every MS number when I set it to use the base directory hyperlink. OK, good. Now that I have all those links, how can I get it to run through a list of each and every hyperlink and grab the text? This might take a bit to run, but I’m ultimately fine with grabbing the text: I can extract what I need with simple forumlas at that point.

I am using Mbetter’s linked data vba script. The key seems to be in the line…

sURL = “http://www.lib.utk.edu/spcoll/manuscripts/

I can’t just using a range value, it seems. Can I set it to look at a specific cell and then set the cell as a variable which I loop? Sorry, Again, I haven’t programmed in a long time, and think most programming syntax is written by autistic psychotics anyhow.

Yup. You can put your list of links into cells, and then use the Range() function and a For-Next loop to loop through them.

It will be something like:



For Each z in Range(A2:A200)
  z.Activate
  sURL = ActiveCell.Value

   blah blah blah the rest of your script

Next z


That’s off the top of my head, so don’t expect it to work verbatim. There are other VBA methods to do the same thing, too. But basically you’re going to have to stick your code in a loop to step through the list of links.

Thanks. Haven’t programmed in years, and I think there’s a loop example in the code itself. SHould be able to deal with this when I get back there Monday.

I was bored last night:

http://angband.org/~erasmus/tmp/catalog.xls

I did it in Perl.

Rather than deal with trying to pull down the pages from within the script, I just used wget to grab all the HTML files in the directory and then parsed them as local files. One less thing to deal with on the coding front.

The kzw page took forever to parse for box information (there were over 3000 tables in the HTML), so I skipped parsing that bit. Generally, if you see a -1 or a 0 in the box count field, consider it suspicious.

The gap from 0096-0114 is because those pages seem to be in a completely different format as far as the HTML goes. Instead of using tables, it uses dd elements. Rather than worrying about a separate parser for a handful of cases, it’d be quicker to do those by hand.

Anyway, the files weren’t always consistent, so the results may be less than trustworthy. I often took the quick-and-dirty approach to parsing, but it seems to have paid off. If this is useful, I can go into more detail about what the limitations on parsing were and post the code.

Dude, you rock!

I’ve got to think about a way to reward you people.