How Do I Write This Computer Program? (Simplistic Question)

Yes, VB. It’s fully integrated into the family of dot net languages. The platform has a rich set of web services tools, including complete and startlingly simple XML classes. Today’s VB is fully object oriented, complete with inheritance, polymorphism, interfaces, abstract classes, and so on.

First, let me say that what you are trying to do is not trivial. I would use Perl if I were writing it, but then again, I’ve pretty familiar with it. There are plenty of other languages out there that would work, some of which might be easier for you to use. Assuming you go with Perl, there are two packages in the Perl distribution that you should use; the LWP package and the HTML package. The tricky part is dealing with the possibility of the other website changed anything. Even if things look exactly the same, the text under the hood can change, with unpredictable results. You’ll need to get the LWP routines to get you the right pages with the right parameters, and get the HTML to parse the results, after which you will need to know how to pull the right data out of the parse structure that is generated, which itself might require more LWP calls to get more web pages.

If you really want to try to do this, and want to tackle Perl, the canonical Perl book is the “Camel” book, available from O’Reilly Press. I’m not sure how easy it would be for someone in your situation to tackle. It seems there are also books out on LWP, but you probably want to make sure you understand Perl before you get into LWP.

Bill H., micco: Precisely what I was saying. Different sites offer different things, and different tactics may be employed on a site-by-site basis.

Small Clanger: Your first post was too short, and while I share your distaste for screen-scraping, better APIs are far from universal.

Precisely why I (and others) have such a distaste for screen-scraping in general. Your code’s functionality is entrusted to people who don’t know you, don’t know what you’ve assumed, and don’t care if they break your code.

Don’t try to learn Perl from the Camel (aka “Programming Perl”). Buy the Llama (that is, “Learning Perl”) first, and decide if you need anything else after reading that. It’s much better organized, and it’s surprisingly complete (in that it describes a very usable subset of Perl, covering nearly all of the Perl you see in most Perl programs).

I want to add that Perl might very well not be the language for you. Let me put it this way:


/(\d+)\s+(\S.*)$/ && $_[$1]{$2} =~ s/\S//;

is a perfectly reasonable Perl statement.

Punoqllads wrote

In general, yes, but it depends on the exact goals. For example, here’s a quick shell script that will print out all the current topics in GQ:



#!/bin/sh
wget -o /dev/null "http://boards.straightdope.com/sdmb/forumdisplay.php?f=3" |
  grep showthread.php gq |
  grep -v member.php |
  grep -v multipage\.gif |
  grep -v amp\;page= |
  sed -e 's/^.*t=[0-9]*\">//' \
      -e 's#.....$##' \
      -e 's/"/"/g' \
      -e 's/&/\&/g'


This is obviously less than beautiful in approach and cleanliness. But it took 5 minutes to write, and if/when SDMB changes, it’ll take another 5 minutes to fix.

And I daresay someone who’s written code before could read the man pages for sed, grep, and awk, and start writing in relative short order.

Duh, five minutes -> ten minutes



#!/bin/sh
wget -o /dev/null -O gq "http://boards.straightdope.com/sdmb/forumdisplay.php?f=3"
grep showthread.php gq |
  grep -v member.php |
  grep -v multipage\.gif |
  grep -v amp\;page= |
  sed -e 's/^.*t=[0-9]*\">//' \
      -e 's#.....$##' \
      -e 's/"/"/g' \
      -e 's/&/\&/g'


You guys left me in the dust about 22 replies ago, but many thanks: I’m certain that (once I figure out exactly what I want and how to configure it), the alternatives and references you’ve supplied will be absolutely on point (once I look up the 30 or 40 acronyms of which I’m completely ignorant). Those who gently hinted that I might not have a clue about the subtleties of any of these programming languages/protocols/environments were all too right, so I’ll be trying to figure out which is the most neophyte-friendly approach that’s compatible with the functionality I’m trying to achieve . . . .

Okay, backing up a bit:

There are basically two different classes of web pages, ones that provide a web service interface and ones that do not. If they do not, then you need to do something called screen scraping which means you look at the raw HTML code a webpage outputs and pick out the relevant bits.

Since the vast majority of pages out there do not provide one, but the ones that do provide one make coding for your task much, much easier, you need to decide whether it’s worth your while learning about web services. Have a look at the sites you want to retrieve and see how many provide a web service interface. Another avenue is to look at 3rd party sites who might have already built a screen scraper and provide a web service feed on their page.

Now, assuming you go the screen scraping route, I would say doing it all in one language is not the best idea. My choice would be to learn Perl for the actual parsing bit to get webpage data into some usable form and to then layer another language on top like VB to write the GUI functions. Frankly, Perl is horrible at writing GUI’s in and VB is still painful to do all but the most rudimentry string processing. Using a two layered approach like this will save you a lot of grief.

Now, about regexs, people seem more intent on scaring you with them than showing you actual useful information. Regexs are simply rules about strings which help you match stuff inside them. What you do is you provide a “rule” to the regex which intelligently tries to match the input string as closely as possible. for example: the following regex: “\s*{…}\s+” means, find any number of alphabetical charecters followed by a brace followed by three charecters of any kind followed by another brace followed by one or more alphabetical charecters. Theres practically a one-to-one mapping of regex codes to plain english like that so, while it seems daunting to look at, it really isn’t all that hard to learn.

Well yeah I had very little time (like now) All I really had time to say was that web services exist and the OP might consider learning a modern language. Shalmanese post seems to be levelling things out a bit.

Gotta go - board is going down.

Well, no. The ‘\s’ metacharacter means whitespace, not alphabetical. The ‘\w’ metacharacter means alphabetical (and the underscore, too, usually). Other than that, you’re dead on.

And I never tried to scare anyone. I just said that they need a bit of thought to work right, and that they are more prone to single-character errors than other kinds of programming. If anything, you tried to scare him away from Perl for no good reason. (The snippet you provided is valid, but not typical. Little Perl code looks like that.)

(The best way to learn regular expressions is to get an implementation of grep for your system and learn by doing. Perl’s regular expressions are different, but only because they have more features. I think all grep regexes will work unchanged in Perl.)