Trip’s Access Q #417: Converting HTML to XML (or importing HTML).

Yes Ladies & Gentlemen, it’s another fantastic episode of “Tripler’s got another MS Access/database Question!” :smiley: [sub]oooh, aaaaah![/sub]

There’s a list of properties (real estate and vehicles) out on the Web up for public auction, posted on an HTML-based webpage. I would like to import that information into MS Access for easier working/sorting, as I understand Access will import XML data. I can download the HTML page, basically by “Saving As” onto my desktop. From there, I can go in and delete the header/title page lines/etc., and then save with an “XML” file extension. Then I try to import that into Access, and that’s when I run into problems. . .

First part of the question: Am I doing this right? Does anyone have experience with this, and am I doing it the easiest/most straightforward way? Second part: Is XML the way to go? And if so, is there a more automated way to do it (i.e. software)?

Why Access? Because I know it well enough to generate forms and lists, and can add data–like my personal comments, or a Google Maps location to it easily. I could theoretically call the office and ask them for a downloaded copy burned onto CD-ROM, but that costs $20 a pop, and frankly, they keep updating the page so often that I’d end up spending a $20 for a new CD every time they do—dang people keep paying their taxes. :mad:

Any ideas?

Tripler
For you, I offer a .05% consultant’s fee on any hot screamin’ deals I pick up on. :smiley:

I don’t think you can just make HTML into XML. XML looks like this:



<property>
  <address>123 House Street</address>
  <acres>.5</acres>
  <exterior>Vinyl</exterior>
</property>


HTML looks like this:



<table cellpadding="2" cellspacing="0" border="0">
  <tr>
    <td>
      <b>123 House Street</b><br/>
      <b>Acres:</b> .5<br/>
      <b>Exterior:</b> Vinyl<br/>
    </td>
   </tr>
</table>


Actually the HTML is probably way more complicated/different than that. How is Access supposed to know what’s an address and what’s a acerage and what’s an exterior and what’s everything else?

XML is sort of a database written out in HTML. HTML is just…mess :slight_smile:

Does the site have an RSS feed? That would come over in XML and you could just import it into Access. That would be proper XML.

I think you’re down the wrong track. Don’t confuse the XML that defines the structure of the data you’re importing with the data itself. If you use XML to import your data, what that means is that you have an XML document which contains, inside it as data, the HTML you want to import. This means you’ll need to “escape” the text (in this case HTML) that you want to import so that it can be put in XML. It is distinct from the structure of the containing XML. This escaping is a very mechanical process and consists of replacing XML special characters like ‘<’ with their “entity” forms like ‘<’. You would do this even if your original data were itself XML.

On re-read it seems you actually want the database to interpret the contents of the HTML files? If so, it’s even worse than I thought. Getting meaningful data (such as addresses) out of HTML is a bottomless hole known as “scraping,” and is very difficult to do. If you can guarantee that the format of the HTML is predictable and simple and doesn’t rely on any sort of client-side scripting (e.g. because you own the process that generates it), then it’s doable but still a pain in the butt, although if you have that much control over the HTML, you probably have access to the raw data that went into creating the HTML pages, in which case you wouldn’t have this problem.

http://www.dapper.net/ is a site that lets you define scrapers for web pages that give you back data as XML, but last time I tried it, it failed miserably on some pretty straightforward stuff. Like I said, it’s a difficult problem.

Uffdah. :smack:

What I wanted to do was have Access extrapolate out the values of the text of the webpage, and import it into the main table. I would interpret/modify the data later. . .

See, the problem is that the list is delineated by horizontal lines–an easy layout for me, as a human, to distinguish, but maybe not for Access. Anyone can save it in a text format, an Excel file, or a Word document, but then you have to manipulate each record into a format to easily push it into Access. Man, it’d be much easier just to do this data pull right from the webpage.

I dunno, anyone got any ideas? I’m open to suggestions. I’ve utterly convinced myself it’s possible to do a data pull from a web page.

Tripler
I’ll offer a free coffee cup, in addition to a bonus .025% on the consulting fee. :smiley:

Do you want just address and price? Do you need the “Tax map number” (TMN)?

I just copy-pasted the first handful of records from the source code, starting at the first HR and pasted them into OpenOffice Calc (Excel) and got the TMN, names, and prices all in a single cell, the street address in another, and the delinquency date in another one.

You could import that into Access (though I’m not sure how to NOT import the HR cells, unless you can come up with a script for that?)

Basically the most problematic cell is the first one; TMN, Name, Price, which I’ll call Problem. I’d put that all into one table, one column. Since the TMN is always the same length, you can then do an update query into another table by extracting left(“Problem”, 13), and then just store the first 13 characters into the new table with TMN as a column.

Then do right(“Problem”, 8) and get the last 8 characters from the problem cell, and that can go into the new table under Price. Likewise, you could extract Mid(“Problem”, Start, Finish) where start and finish are somewhere around the Name in the Problem cell… you can just use 14 and something like len(“Problem”)-9 to be sure to not extract the TMN and Price.

Then you can strip out all the blank characters at the beginning and end of the strings using LTrim(Name) and RTrim(Name)… note that you probably want to do this to Price, too, since you pulled out 8 characters but I doubt any of the prices are that much! Microsoft Support

All of this is done using update queries. Just add the Street address cells from excel to the same table.

I think, once you have the queries set up, it might still take a bit of work for you to get the data into Access first, but after that you can just run the queries and get new tables with the latest data, which you can then use forms on to your heart’s content.

Note: I have never tried this, and I don’t even have Access on my computer at home. It might turn out to be a much bigger PITA than you care to do, and it might not even work; but it’s a start!

Cue next poster coming up with a 3-step method…! :slight_smile:

Good luck!

ETA: you might be able to copy-paste from the source code directly to an Access table, too… I haven’t tried that!

My sum total knowledge of XML is the following joke:

This is possible, but not without writing your own program. I recommend Python. It should only take you about 10 hours above and beyond the time needed to learn a language.

I agree with ntucker that you probably want to scrape the HTML, and that this probably requires some programming to do right. At any rate, every time the HTML creator decides to change the page format, there’s a chance that you’ll have to modify the scraping steps.

Perl is a good choice for simple text-file processing. Here is an example Perl script that reads the HTML file (“taxname.html”, saved locally) and creates a CSV file “taxname.csv”, one record per row. You may be able to modify this to suit your needs.


#!/usr/bin/perl

open( HTML, "taxname.html" );
open( CSV, ">taxname.csv" );
select CSV;

# skip headers
while ( <HTML> ) {
  last if /OWNERS, PROPERTY LOCATION AND DELINQUENT YEARS/;
}

# print header line, CSV format
print '"TAX MAP NO","AMOUNT","YEARS","OWNERS","ADDRESS..."', "
";

# read records
while ( <HTML> ) {
  if ( /^\<HR /i ) {
    if ( @lines ) { &parse(@lines); }
    @lines = ();
    next;
  }
  next if /^<A NAME=/i;
  push @lines, $_;
}

close HTML;

# parse and print one record
sub parse {
  my @lines = @_;
  my $line = shift @lines;
  my ($taxid,$own,$amt) = $line =~ /^([0-9-]+)\s+(\S.*\S)\s+\$([0-9.]*)\s*$/;
  my @own = ($own);
  while ( @lines ) {
    $line = shift @lines;
    $line =~ s/^\s*//;
    $line =~ s/\s*$//;
    push @own, $line;
  }
  grep { s/\"/\'\'/g; s/^/\"/; s/$/\"/; } @own;
  my $years = pop @own;
  $own = join ",", @own;
  print "$taxid,$amt,$years,$own
";
}

One question is whether you have to do this one time, on that particular page, or if you’re going to need something flexible enough to do this as a regular process, e.g. by scrounging through that page and/or a bunch like it on a periodic schedule.

If it’s a one-time thing, Omphaloskeptic’s approach is exactly what I’d do (although I don’t know perl, so I’d probably do it in ruby or maybe javascript that runs right in the browser and takes advantage of the fact that it’s already in a DOM). Quick n dirty and gets you your data, but probably not workable long-term, since the owner of that page is liable to decide the data looks better in a table layout instead of between <hr>'s. That’s the sort of tail-chasing that scraping leads to, unfortunately.

ETA: careful how you present the results if it’s a one-off. They’ll be so impressed with your ability to code up a way to pull data from that web page that they’ll expect you to keep doing it, and then your solution which was good enough for a one-off will have gotten you into a long-term problem. :wink:

For fun I decided to write some javascript to do this in-place in the page. If you have firefox, you can install the firebug extension, then open up the firebug console and paste this code in and say “run” and it will work for a few seconds and then replace the page body with a simple pipe-delimited row of records that looks like this:



309-00-00-066|A AND M CAROLINA HOLDINGS CORP / RE: MARSH OAKS MEDICAL PARK LLC / 1007 PHYSICIANS DR / 2007 / |$3920.45
470-08-00-136|A'MARIS LLC / OAKWOOD AVE / 2007 / |$1890.87
473-13-00-024|A'MARIS LLC / 1816 SHELTON ST / 2007 / |$825.01
473-13-00-025|A'MARIS LLC / 1824 SHELTON ST / 2007 / |$1439.13
473-13-00-148|A'MARIS LLC / 5519 DOBSON ST / 2007 / |$1315.67
320-00-00-023|ABBAPOOLA HOLDINGS LLC / RIVER RD / 2007 / |$309.41


The “convertlines” function adds the slashes to indicate where the address spanned lines, for lack of a better suggestion. Mold it to your will. :slight_smile:



function try_hr(elem)
{
    function try_elem(elem)
    {
        function strip(str)
        {
            return str.replace(/^ */, "").replace(/ *$/, "");
        }
        function convertlines(lines)
        {
            var f1, f2, f3;
            var i;
            for(i=0; i<lines.length; i++)
            {
                var pattern = new RegExp("(.{18})(.{63})(.*)");
                var result = pattern.exec(lines*);
                if(result)
                {
                    f1 = strip(result[1]);
                    f2 = strip(result[2]);
                    f3 = strip(result[3]);
                    i++;
                    break;
                }
            }
            for(; i<lines.length; i++)
            {
                if(strip(lines*).length > 0)
                    f2 += " / " + strip(lines*);
            }

            return f1 + "|" + f2 + "|" + f3;
        }
        if(elem.nodeType == 3)
        {
            lines = elem.nodeValue.split("
");
            if(lines.length > 2)
            {
                 return convertlines(lines);
            }
        }
    }

    var sib = elem.nextSibling;
    while(sib && sib.tagName != "HR")
    {
        var result = try_elem(sib);
        if(result)
        {
            return result;
        }

        sib = sib.nextSibling;
    }

    return null;
}

var hrs = document.getElementsByTagName("HR");
var allresults = "";
for(var i=0; i<hrs.length; i++)
{
    var result = try_hr(hrs*);
    if(result) allresults += result + "
";
}

document.body.innerHTML = "<pre>" + allresults + "</pre>";


ntucker, and Omphaloskeptic, I do thank you guys for your help!

I’m not very familiar with Perl script (although I did save it), I just don’t know how to use it. ntucker, I did download firebug, but where exactly am I to paste your code? I like how your code puts out the data, and after a few tries, I may use yours only because I’ll use a character like “&” or “+” which ought to make it simpler for me (at least mentally).

This is going to be a multiple-run data pull. That county website is update every few days or so, and once I do one pull, I can migrate whatever comments to a new table in Access–I think I can figure it out.

I’m just lost on exactly how to use the awesome code you guys provided. Can you dumb it down a shade for me?

ETA: Ooooh, wait! I figured it out. Gotta paste the javascript into the <head> statements. Thanks!

Tripler
You guys: 0.06% consultant fee. :smiley:

It’s actually simpler than that…you don’t need to edit the HTML doc at all. Just load up the original page in your browser, then click on the little firebug checkmark in the bottom right of the window. A panel should pop up from the bottom, and there should be a “console” tab. Click on the console tab, put your cursor on the prompt line at the bottom, and paste. That should cause the prompt line to switch to being an input area over on the right (weird, UI, I know, but it makes sense), at which point you click “run” to execute the code in the input area.

ETA: if you’re going to do any web development, Firebug is The Bomb. The console can be used for debugging and doing little explorations, but I actually find myself writing a decent amount of code in it, then once something works, saving it to a JS file that gets loaded by the page.

Here’s a way you can use that code that doesn’t require firebug, and theoretically works in IE too: tripler’s data munging button. I say “theoretically” because I just tried it and it didn’t work, but I’m too lazy to debug it now. Basically, the trick just smashes that code all into one line and sticks it in a “javascript:” link you can stick on your bookmarks toolbar.

Missed the edit window, but when I say I tried it and it didn’t work, I meant in IE. It works in firefox with or without firebug.

If you want to get the Perl code working, you need a Perl interpreter installed. The easiest is probably ActiveState Perl (it is a free download; click on “Get ActivePerl” in the right sidebar, then “Download” the standard distribution). Rename the Perl file with a .pl extension so that it is automatically associated with ActivePerl, and put it in the same directory with the HTML file “taxname.html”; then just double-clicking on the Perl file should create the .CSV.

This is less clean than ntucker’s JavaScript applet, but it allows more choices for output format and makes automation easier, if that’s an issue. (The script could be extended to automatically download the HTML file first, for example; you could also set it up to run periodically.)