How Do I Write This Computer Program? (Simplistic Question)

Liberal · June 23, 2004, 8:24pm

Yes, VB. It’s fully integrated into the family of dot net languages. The platform has a rich set of web services tools, including complete and startlingly simple XML classes. Today’s VB is fully object oriented, complete with inheritance, polymorphism, interfaces, abstract classes, and so on.

Punoqllads · June 23, 2004, 8:31pm

First, let me say that what you are trying to do is not trivial. I would use Perl if I were writing it, but then again, I’ve pretty familiar with it. There are plenty of other languages out there that would work, some of which might be easier for you to use. Assuming you go with Perl, there are two packages in the Perl distribution that you should use; the LWP package and the HTML package. The tricky part is dealing with the possibility of the other website changed anything. Even if things look exactly the same, the text under the hood can change, with unpredictable results. You’ll need to get the LWP routines to get you the right pages with the right parameters, and get the HTML to parse the results, after which you will need to know how to pull the right data out of the parse structure that is generated, which itself might require more LWP calls to get more web pages.

If you really want to try to do this, and want to tackle Perl, the canonical Perl book is the “Camel” book, available from O’Reilly Press. I’m not sure how easy it would be for someone in your situation to tackle. It seems there are also books out on LWP, but you probably want to make sure you understand Perl before you get into LWP.

Derleth · June 23, 2004, 8:37pm

Bill H., micco: Precisely what I was saying. Different sites offer different things, and different tactics may be employed on a site-by-site basis.

Small Clanger: Your first post was too short, and while I share your distaste for screen-scraping, better APIs are far from universal.

Derleth · June 23, 2004, 8:43pm

Precisely why I (and others) have such a distaste for screen-scraping in general. Your code’s functionality is entrusted to people who don’t know you, don’t know what you’ve assumed, and don’t care if they break your code.

Don’t try to learn Perl from the Camel (aka “Programming Perl”). Buy the Llama (that is, “Learning Perl”) first, and decide if you need anything else after reading that. It’s much better organized, and it’s surprisingly complete (in that it describes a very usable subset of Perl, covering nearly all of the Perl you see in most Perl programs).

Punoqllads · June 23, 2004, 8:59pm

I want to add that Perl might very well not be the language for you. Let me put it this way:


/(\d+)\s+(\S.*)$/ && $_[$1]{$2} =~ s/\S//;

is a perfectly reasonable Perl statement.

Bill_H · June 23, 2004, 9:15pm

Punoqllads wrote

In general, yes, but it depends on the exact goals. For example, here’s a quick shell script that will print out all the current topics in GQ:



#!/bin/sh
wget -o /dev/null "http://boards.straightdope.com/sdmb/forumdisplay.php?f=3" |
  grep showthread.php gq |
  grep -v member.php |
  grep -v multipage\.gif |
  grep -v amp\;page= |
  sed -e 's/^.*t=[0-9]*\">//' \
      -e 's#.....$##' \
      -e 's/"/"/g' \
      -e 's/&/\&/g'

This is obviously less than beautiful in approach and cleanliness. But it took 5 minutes to write, and if/when SDMB changes, it’ll take another 5 minutes to fix.

And I daresay someone who’s written code before could read the man pages for sed, grep, and awk, and start writing in relative short order.

Bill_H · June 23, 2004, 9:23pm

Duh, five minutes -> ten minutes



#!/bin/sh
wget -o /dev/null -O gq "http://boards.straightdope.com/sdmb/forumdisplay.php?f=3"
grep showthread.php gq |
  grep -v member.php |
  grep -v multipage\.gif |
  grep -v amp\;page= |
  sed -e 's/^.*t=[0-9]*\">//' \
      -e 's#.....$##' \
      -e 's/"/"/g' \
      -e 's/&/\&/g'

Huerta88 · June 23, 2004, 9:46pm

You guys left me in the dust about 22 replies ago, but many thanks: I’m certain that (once I figure out exactly what I want and how to configure it), the alternatives and references you’ve supplied will be absolutely on point (once I look up the 30 or 40 acronyms of which I’m completely ignorant). Those who gently hinted that I might not have a clue about the subtleties of any of these programming languages/protocols/environments were all too right, so I’ll be trying to figure out which is the most neophyte-friendly approach that’s compatible with the functionality I’m trying to achieve . . . .

Shalmanese · June 23, 2004, 10:23pm

Okay, backing up a bit:

There are basically two different classes of web pages, ones that provide a web service interface and ones that do not. If they do not, then you need to do something called screen scraping which means you look at the raw HTML code a webpage outputs and pick out the relevant bits.

Since the vast majority of pages out there do not provide one, but the ones that do provide one make coding for your task much, much easier, you need to decide whether it’s worth your while learning about web services. Have a look at the sites you want to retrieve and see how many provide a web service interface. Another avenue is to look at 3rd party sites who might have already built a screen scraper and provide a web service feed on their page.

Now, assuming you go the screen scraping route, I would say doing it all in one language is not the best idea. My choice would be to learn Perl for the actual parsing bit to get webpage data into some usable form and to then layer another language on top like VB to write the GUI functions. Frankly, Perl is horrible at writing GUI’s in and VB is still painful to do all but the most rudimentry string processing. Using a two layered approach like this will save you a lot of grief.

Now, about regexs, people seem more intent on scaring you with them than showing you actual useful information. Regexs are simply rules about strings which help you match stuff inside them. What you do is you provide a “rule” to the regex which intelligently tries to match the input string as closely as possible. for example: the following regex: “\s*{…}\s+” means, find any number of alphabetical charecters followed by a brace followed by three charecters of any kind followed by another brace followed by one or more alphabetical charecters. Theres practically a one-to-one mapping of regex codes to plain english like that so, while it seems daunting to look at, it really isn’t all that hard to learn.

Small_Clanger · June 24, 2004, 8:15am

Well yeah I had very little time (like now) All I really had time to say was that web services exist and the OP might consider learning a modern language. Shalmanese post seems to be levelling things out a bit.

Gotta go - board is going down.

Derleth · June 26, 2004, 12:31am

Well, no. The ‘\s’ metacharacter means whitespace, not alphabetical. The ‘\w’ metacharacter means alphabetical (and the underscore, too, usually). Other than that, you’re dead on.

And I never tried to scare anyone. I just said that they need a bit of thought to work right, and that they are more prone to single-character errors than other kinds of programming. If anything, you tried to scare him away from Perl for no good reason. (The snippet you provided is valid, but not typical. Little Perl code looks like that.)

(The best way to learn regular expressions is to get an implementation of grep for your system and learn by doing. Perl’s regular expressions are different, but only because they have more features. I think all grep regexes will work unchanged in Perl.)

Topic		Replies	Views
About to embark on SQL and PHP. Some advice needed (a bit long). In My Humble Opinion	19	1441	March 31, 2005
Looking for a program that will get records from a web form without my constant input Factual Questions	6	806	June 14, 2005
Internet reaptability??? Factual Questions	5	1338	January 11, 2012
What web sites do YOU love? Miscellaneous and Personal Stuff I Must Share	28	12183	October 4, 1999
easy + eclectic programing language - why no? Factual Questions	43	1653	May 19, 2000

How Do I Write This Computer Program? (Simplistic Question)

Related topics