The only computer programming I know dates back a long way to BASIC and so forth and isn’t helping me even understand what I need to use, let alone how to write what I want to.
I want to write a program that will do repeated automatic searches of online databases and output the results in a particular format. I guess it’s a sort of data mining I have in mind. The websites I want to query have all the data I want, but they’re not set up to do multiple queries/output of the sort I want, so I’d like to custom configure something to do it for me rather than have me do multiple searches.
Example: I want to define a set of cities and have “my” program go fetch the weather.com forecast for all those cities and display it in a given comparative format. (Assume for the moment that even the best custom-preferences setting in weather.com’s own options doesn’t allow me to get the type of output I want).
Or I want to run a bunch of searches to see what flights are actually available through Travelocity within a specific range of dates, between a given pair (or multiple pairs) of cities, but the Travelocity engine would require me to do multiple multi-step guesswork-queries to get all this information.
Or, my frequent flyer program allows me to check if award travel is available between two cities on a specific date, but their website isn’t configured to display, in one query, all the dates on which a given type of award is available, so I currently have to manually enter a bunch of arbitrary dates and compare the results.
What’s the language or program or toolkit with which I have to familiarize myself to have a shot at writing the type of automated multiple-search-and-display program I want? Which “For Dummies” book should I be buying?
The technology you need to be looking at is web services If you’re a novice programmer you’ve got a fair amount of work to do. Learning some Java couldn’t hurt, it is a good fit with network/internet usage. Gotta go now before the SDMB goes down, maybe back later.
Er, that’d be Visual Basic, wouldn’t it, Lib? Got the impression that this guy was originally talking about basic BASIC.
Web Services is the way to go, here. A co-worker was recently working on the very Weather thing you mention… he made a Flash application that connected to a web service that provided current weather conditions for a given zip code, and would cycle through them, displaying them.
Visual Basic would probably be the easiest route. Assuming you know some programming, VB isn’t too hard to grasp, although the name is decieving, you will use almost nothing you learned from Basic.
When you say “web services” you typically mean that a site exposes a set of features above and beyond what you’d see just browsing the site. That is, they would publish an API which would allow you to conveniently query and get raw data back, and they’d probably publish something like a WSDL descriptor. That means the server has to cooperate if you want to do web services, and most sites don’t.
What the OP really wants is some screen scraping (which is just a term which means your program is grabbing the same output a person would see on the screen). This is pretty easy to do in almost any language. As other posters have said, VB or Java would be easy. But you could do it in Perl, Lisp or Cobol. No language really has any particular benefit here as long as it makes it easy to issue an HTTP query and parse results. The problem with screen scraping is that the data you get back from your query is a mess of HTML and other formatting and you may have to do a lot of coding to parse out the bits you care about. If that’s the case, you have to customize pretty heavily to individual results, so you have to do more work when you want to query a new site (or if an old site changes format). However, it’s all pretty straightforward. Start like this:
[ol]
[li]pick your favorite programming language. Almost any one will do as long as it’s new enough to have some means of making an HTTP query.[/li][li]Write an app with a query hardcoded which dumps the results to output (screen or file) so you can see what you get from your query. You can probably get sample code to accomplish this just by doing a google search on the name of your language-of-choice and “HTTP” or somesuch. If you use VB, you have several alternatives ranging from low-level wininet calls to almost transparent third-party components. If you use Java, there are several methods in the java.net library. If you use Perl, there’s LWP. [/li][li]Add in some string processing on the results to filter out the bits you care about. First, you could strip all the HTML. Or you might search for something that indicates the start and end of the data you care about so you can clip out the good stuff.[/li][li]As you get comfortable with the filtering required, make the queries more flexible and add other sites.[/li][li]Call Opal and tell her you’re done.[/li][/ol]
micco’s right. Web services are a different beast.
If I were tackling this (and I have many times on other sorts of projects), I’d use shell scripts (bash, awk, sed, etc.) or perl. But you could just as well do it in straight basic.
and may I politely add that people who don’t know the answer to a question really have no business answering it. Most of the answers here are just flat wrong.
If you’re willing to write some regular expressions (not that hard, I promise), Perl is a wonderful language for screen-scraping because it can not only parse text easily (and HTML is text), but it can use the great CPAN modules dedicated to getting and parsing HTML.
Perl is a good language to know in general, because it’s such a polyglot. Learning Perl can be a short trip through multiple programming methods, from simple procedural to OO to simple functional and declarative. (Regular expressions, for example, are a very useful declarative language.) Being a polyglot also means that if you know any other language, you have a toehold in Perl. Going from BASIC to Java will be much more of a leap. (Java was written with C and C++ people in mind.)
I’ll second this. I didn’t push Perl in my post because I really wanted to make the point that there is no “right” answer when choosing a language for this project, but if there were a right answer, it would be Perl.
I clipped off the conditional “if you’re willing to do regular expressions” because as daunting as they might seem, they’re easier to use than the brute force string parsing in other languages. Thankfully, VB has some rudimentary regexp now so you don’t have to do everything with string functions like Instr and Mid. I’ve done screen scraping in VB and Java and there’s no reason not to, but it is a lot easier in Perl. I constantly annoy my Perl-deficient collegues by pointing out how much easier/shorter/faster/better/elegant a given piece of code would have been in Perl.
You may politely add that - but no one here’s given a definitively wrong answer - except maybe you. Straight Basic? I thought MS abandoned support for regular Basic with the advent of Visual Basic. Does “straight” Basic even have web-interactivity capability?
For the specific example of the weather for the cities, the Web-Service approach is laughably easy to do, and provided to the general public.
Yes, he could write programs to hit various Weather Channel pages, parse the text, and extract the relevant information, only to have to re-build the program with the Weather Channel changes it’s page layout - or it could be done with a smart web-service interface that returns results in a standard, pre-parsed format.
Phew! Thanks for the support (I think). I work on self contained applications and where we communicate with web sites they are either our sites or customers sites that have to match spec’s. From my point of view an “online database” that required you to do screen scraping would be beyond daft.
micco: Yes, regexes are much better than the alternatives, if only because the alternatives try to reinvent them when they need to attain full flexibility. They are also, like life in a Hobbsean state of nature, nasty, brutish, and short, at least to the untrained eye. A simple Perl regex I recently crafted looked like this:
{([^{}]+)}
(Stick a forward slash on both sides of that and a semicolon on the end and you have a valid piece of Perl, BTW.) To you or me, that’s simple: Match a pair of braces with one or more non-brace characters between them, and capture the non-brace characters in a backreferece register. You can go character-by-character and figure that out. But, of course, you need to go character-by-character sometimes. And that can get confusing. Even if you’re a wizard.
Regexes are math.* They are a bunch of goofy symbols that go together in simple ways that can be finessed into complex relations that do powerful things. They save gobs and gobs of code every single time they are used. They are a problem-solver in the best, most general sense. But there’s a steep hill before the fundamentals sink in, and if the student isn’t prepared, he won’t make the climb.
*(Literally. Look up the work of Stephen Kleene. Wikipedia is its usual helpful self here.)
It honestly depends on the specific data sources the OP is interested in. Some are available via web services. Some have their own custom interfaces to deliver that info. Some… for instance, the IMDB … you may have to screen-scrape.
That’s a pretty narrow perspective. I work on hundreds of sites. Every single one of them is database driven. Only one exposes a web services API. It depends on who your audience is, what they want to do with the data, and what you want them to do with the data. Why should someone pay to develop and support a web service interface to their website data when none of their customers gain any benefit from it. It might benefit their competitors who wanted to do easy price comparisons on the entire catalog, but that’s not a reason for the site owner to pony up for development. I’ve build a lot of custom web-service-like interfaces to websites to allow customers to upload PO documents instead of placing orders in a cart and stuff like that, but the customers in these cases had neither the technical skill or inclination to support the client-side of a real web services interface, so it made no sense to build it. It’s not daft; it’s customer driven.
I’m sorry, but there’s just way too much ignorance here. For those that don’t get it (and I won’t name names), here is how it works:
A “web service” is a provider of information. A larger application may loosely be referred to as a web service, but inside the application, the “web service” is the part of the application providing information to someone or something else. Web services typically deliver information in a format known as XML, which is designed to be read by machines. Humans can read it (even novices can usually figure out what information is contained), but it’s not like reading a web site. It’s just data in a format that other computers can read. The web service may be front-ended by some gui code that will display the data in a human readable format, but that’s a different subject.
So how does a web service fit into the picture here? There are two possibilities. The OP could design his application as a web service to provide data to other entities. But that doesn’t seem to be his intent. Or he could be reading from sites that offer data as web-services. But he was very clear that this wasn’t his intent either. He gave 3 specific examples where data was clearly not being delivered as a web service, but rather where data (flight request info for example) is plugged into a form, and an html page is generated with flight information.
The OP is clear that what he wants is to inject desired parameters, and scrape resultant screens, then deliver the information in a format he desires. Any programming language could do this, and I stand by my previous recommendations.
For more information on web services, please go read the work of the appropriate standards bodies here. In case you lose this page, you can use google. It’s the first entry for “web services”.
Again, if you don’t know, don’t answer. You’re not doing anybody any favors, and you’re making yourself look silly.
Bill H., the only one who has a problem here is you. Maybe that’s meaningful.
What you appear to be missing is that sometimes, screen-scraping is rather absurdly wrong because a much nicer API exists. Google is the prime example here, and perhaps some of the examples the OP mentioned have similar services. APIs aren’t limited to RDF feeds.
If no API exists, you are reduced to screen-scraping. That is very true. But the OP thought (apparently) that screen-scraping was the only possible option for all sites, when this is clearly not the case.
Nobody here is wrong. You, however, are not being as polite as possible.
Accurately and succinctly put. If we knew the specific online databases the OP was interested in, we could give the specific best recommendation for each site.
I was making a point of staying out of it since Bill H. is more than capable of explaining his position, but I’d say he has a point. Telling the OP that the answer is web services is like telling someone who needs to change their oil that all they need is an 8mm socket wrench. It might be true on a few cars (sites) but it’s not generally true and it’s going to do the OP more harm than good to start down that path without a more complete answer.
Web services are great where they’re accessible, but they’re far from a general solution at this point. They certainly are implemented on a few large sites, some of which may be what the OP is after, but they’re not likely to solve the whole problem for the OP. On the other hand, it’s pretty simple to write a general-purpose screen scraper that will work everywhere with just minor modification.
As far as I can tell, no one in this thread has given a wrong answer, but telling someone to use web services on the client side without first checking to see if the server even supports it is so incomplete as to be misleading at best.