Can I use excel (with Access?) to pull data online?

smiling_bandit · June 28, 2010, 4:35pm

Let me explain what I need.

I’ve been asked to pull down some data from a website. It’s easy to access here. The individual pages are contained in a simple .html document. The data I need is there (it’s text strings), and it’s easy to pull it if I can get something which searches for a specific word (Title, Date, etc.) Is there a way to script something to automatically generate the excel document I need? I’m lazy and in any case I’m no fan of manual labor. More importantly, I don’t think there’s a chance I get this done manually in the time allotted.

CCYMan · June 28, 2010, 4:41pm

Could you use the “text to columns” function of Excel to put it into a usable format? This function takes text that is in a single column (such as if you paste it in from a website) and separates it into columns each time it sees a comma, a colon, even a space - whatever you tell it to use.

In Excel 2007 this function is under the “data” menu. I can’t remember where it is in older versions, but if you can’t find it, then Help could certainly guide you there.

Duckster · June 28, 2010, 4:42pm

Is the data on the web page already tabular data? If so, you should be able to easily convert the data into an Excel spreadsheet.

smiling_bandit · June 28, 2010, 4:47pm

Sadly, the data is not collected into one webpage. It’s on hundreds of seperate ones, one each for a different entry. However, they are labelled in an extremely easy-to-understand manner, so if there’s a way for the script to jump down the list (I have a list of the pages I need and/or could generate it in any format)…

Once it gets in there, I only need three pieces of data from each page. However, I would need something to sift through for a specific text-trigger, as the data is irregular. That is, there are is a variable amount of text. It is arranged into neat columns.
Try this:

http://www.lib.utk.edu/spcoll/manuscripts/

http://www.lib.utk.edu/spcoll/manuscripts/1764.html

This is what I’m using. It has the specific manuscript number (the data I have), the TITLE, DATE, and the number of boxes. The last I may have to fiddle with, but if there are a few I must go back and check, no problem.

Duke · June 28, 2010, 4:51pm

Ugh, that sucks.

I’ll poke around the VBA forums for you. I might have a similar project (though with a lot fewer webpages) coming up, so I’m going to have to figure out how to do this too. I’m not seeing an easy way to do it without VBA programming, though. It might just be faster, although in no way easier, to just brute-force the whole project using Duckster’s linked instructions.

mbetter · June 28, 2010, 4:53pm

I would use the XMLHTTP object to make HTTP requests in VBA.

Here’s a little tutorial:
http://scriptorium.serve-it.nl/view.php?sid=40

Roughly: You have the manuscript number, so you can generate the URL that you need. From there, it looks like you could just use an InStr() to find, say “Title”, another to find the next </td>, and a Mid() to pull out the data that you need.

Duke · June 28, 2010, 5:08pm

I’m not sure if he’s looking for the table of collections at the bottom of the link he gave, but I’d go with “Container List” for the InStr() command.

I noticed that the manuscript numbers are not consecutive, so unfortunately he can’t do a For x to y command to seed them. Guess those will have to be seeded by hand.

smiling_bandit · June 28, 2010, 5:10pm

Hmm… I don’t see a way to grab each page. Is there a way in Excel 2007+ to use the import data command on the manucripts index, then draw the data I need? Sorry, I know queries using raw data but I don’t have a background for this except some C++ a while back.

Edit: Sorry, do I just create a webpage with this scriptorium code (properly modified to fit) and then run it?

mbetter · June 28, 2010, 5:49pm

Rereading the OP, “Title” and “Dates” look to be in the table at the top, then the number of boxes would be a row count of the “Container List” at the bottom.

The first part could be accomplished like I said in my last post. For the second part, it looks like you’d have to grab the HTML between “Container List” and ‘<div id=“footer-orange-outer”>’ and use an InStr(), Mid() loop to count the tables (each row appears to be a whole table).

mbetter · June 28, 2010, 5:51pm

That’s VBA code, it needs to be entered in Excel in the VBA editor.

smiling_bandit · June 28, 2010, 6:10pm

We can probably drop that. Turns out the box count is possibly wrong in two seperate places, so the only way to be sure (assuming the THIRD is also not wrong) is count the listed boxes plus any extra materials which is a box but not labelled in a box.

Yeah… that part we’ll have to do by hand.

But at least I can get the first two sections. And do this for everyone, so it should be much easier all around.

Duke · June 28, 2010, 6:12pm

Right. Only potential catch is that not all of the webpages have a “Container List” section. The one that smiling bandit linked to does, but I looked at a couple other ones which don’t. An error exception handler should take care of that, though.

smiling_bandit · June 30, 2010, 5:32pm

OK, trying to work this into vba. So far, so good… except.

Uh, how do I set it to loop through the list of the hypoerkinks I need. I’ve got it extracting the text, which is fine (it put out a huge list of every MS number when I set it to use the base directory hyperlink. OK, good. Now that I have all those links, how can I get it to run through a list of each and every hyperlink and grab the text? This might take a bit to run, but I’m ultimately fine with grabbing the text: I can extract what I need with simple forumlas at that point.

I am using Mbetter’s linked data vba script. The key seems to be in the line…

sURL = “http://www.lib.utk.edu/spcoll/manuscripts/”

I can’t just using a range value, it seems. Can I set it to look at a specific cell and then set the cell as a variable which I loop? Sorry, Again, I haven’t programmed in a long time, and think most programming syntax is written by autistic psychotics anyhow.

redtail23 · July 1, 2010, 2:25pm

Yup. You can put your list of links into cells, and then use the Range() function and a For-Next loop to loop through them.

It will be something like:



For Each z in Range(A2:A200)
  z.Activate
  sURL = ActiveCell.Value

   blah blah blah the rest of your script

Next z

That’s off the top of my head, so don’t expect it to work verbatim. There are other VBA methods to do the same thing, too. But basically you’re going to have to stick your code in a loop to step through the list of links.

smiling_bandit · July 1, 2010, 11:34pm

redtail23:

Yup. You can put your list of links into cells, and then use the Range() function and a For-Next loop to loop through them.

It will be something like:
For Each z in Range(A2:A200)
  z.Activate
  sURL = ActiveCell.Value

   blah blah blah the rest of your script

Next z
That’s off the top of my head, so don’t expect it to work verbatim. There are other VBA methods to do the same thing, too. But basically you’re going to have to stick your code in a loop to step through the list of links.

Thanks. Haven’t programmed in years, and I think there’s a loop example in the code itself. SHould be able to deal with this when I get back there Monday.

Erasmus_Darwin · July 3, 2010, 8:44pm

I was bored last night:

http://angband.org/~erasmus/tmp/catalog.xls

I did it in Perl.

Rather than deal with trying to pull down the pages from within the script, I just used wget to grab all the HTML files in the directory and then parsed them as local files. One less thing to deal with on the coding front.

The kzw page took forever to parse for box information (there were over 3000 tables in the HTML), so I skipped parsing that bit. Generally, if you see a -1 or a 0 in the box count field, consider it suspicious.

The gap from 0096-0114 is because those pages seem to be in a completely different format as far as the HTML goes. Instead of using tables, it uses dd elements. Rather than worrying about a separate parser for a handful of cases, it’d be quicker to do those by hand.

Anyway, the files weren’t always consistent, so the results may be less than trustworthy. I often took the quick-and-dirty approach to parsing, but it seems to have paid off. If this is useful, I can go into more detail about what the limitations on parsing were and post the code.

smiling_bandit · July 7, 2010, 2:28pm

Dude, you rock!

I’ve got to think about a way to reward you people.

Topic		Replies	Views
How to grab data from a website into a table? Factual Questions	6	1967	February 23, 2007
website/excel questions In My Humble Opinion	9	843	September 1, 2018
MS Access or Excel Q: Particular rows into a column. Factual Questions	1	604	October 2, 2007
Access 2007 question (XP Pro) Factual Questions	5	1090	February 4, 2010
Converting text to table in Excel Factual Questions	1	898	August 6, 2015

Can I use excel (with Access?) to pull data online?

Related topics