suppose we wanted parser with better SQL syntax errors - how to go about that?

imagine you are a grad student working on a “make better SQL parser errors” thesis. You start out with the observation that errors being reported right now pretty much suck. Syntax error next to token so-and-so - not as informative as what we are used to seeing from Java compiler syntax check. So it would be nice if the parser had a grasp of the issue closer to human level and could recognize errors in human understandable terms. Right now it seems to just do early bailout.

So presumably for this project it would be nice to get a corpus of erroneous queries. A corpus of valid, legit queries (such as for validating a parser) is a separate issue in itself - I am not sure where even that could be obtained. But where do you get the erroneous queries?

Well, an obvious answer would be “log actual queries being input by human users into an RDBMS while writing their programs”. It seems natural to seek to understand human-relevant mistakes by studying what sort of errors real-life humans end up making. But, if you are that grad student, where do you go to get a corpus of logged queries from an RDBMS server that has enough human users to accumulate lots of queries, some of them erroneous?

Make a web page that shows a young woman who says she will remove her top if you can create a query the selects the correct results from a set of data.

Or you could just write a better parser that doesn’t bail at the first unacceptable character. Most syntax errors in highly used systems will just be of the ‘typo’ variety, and a library of similar mistakes will annoy the user just as much as the useless ‘syntax’ messages. But if you skip an offending character, or assume a missing delimiter was included, and can keep parsing through some limited number of syntax issues to produce a valid result, you can present the entire query with the mistakes highlighted. There are many techniques for ‘smart’ parsing.

you know what, TriPolar, I like your approach. Not the pron aspect of it (people who know SQL probably know easier ways to obtain such images without solving puzzles), but the recruit the teeming millions certainly makes sense. E.g. let’s consider all those “sir, sir, I am so experienced in all frameworks simultaneously” types of people, underemployed and easily accessible online for a low fee. Write up a bunch of SQL query puzzles / tasks, like “given so-and-so database, obtain so-and-so result set”. Get them to solve them by feeding sql queries into my RDBMS via a web interface and pay for each correct solution. Ergo, before they find the solution we will end up with lots of erroneous queries for research purposes. As long as effective hourly salary paid for this is kept low, this should be doable without being the second coming of Jimbo and Larry Sanger in terms of funding required.

One possible problem with this approach may be that sufficiently incompetent people make different kinds of errors from the competent ones. But, maybe they will gradually get the hang of it and become more competent in the process. Or, maybe with enough attempts even incompetent ones will generate all the sufficiently likely human errors.

As for the rest of your post, where the parser bails out is a secondary issue. The primary issue is that the parser needs to generate meaningful errors. It is of no benefit for me if the parser internally “understood” the entire query except for a few tokens, because what I really want is an understandable explanation of “what do I do now?” “what is my error and how do I fix it?”. E.g. it used to be in VC++ 6.0 that a single missing semicolon would lead to tens of meaningless errors after the line it happened on. Now Java/C# compilers just tell me “put missing semicolon on line so-and-so”. That’s the kind of improvement that is worth striving for.

If you can parse past the offending character, and make a reaonable statement based on a generic element, then you can present the entire statement with something like a drop down box to select valid elements. So f’instance, if you can scan past an unrecognized operator, you can present the whole statement with a drop down box where the operator goes to select from a list of valid operators. Some editors now offer auto-complete for individual elements, so in a similar manner, unrecognized table or field names could give a ‘did you mean’ type message to select from valid ones.

As you’ve seen in Java and C#, compilers are getting a little smarter. And I abhor compilers that stop dead on parsing when they hit the first error, especially when it’s so easy to pick up valid syntax following the next ; or eol. I’ve written a lot of parsers so maybe I’m too nitpicky on some of there concepts, but one of the failures in computer language development is a lack of testing on bad syntax. Typically only good syntax is tested, and bad syntax tested at a minimal level to see if it’s detected, and that’s it. Another pet peeve of mine is editors or studios that use a different parser/analyzer than the actual compiler. I’m working on a project now that involves resolution of a case like that.

Doing these things is worthwhile, because they can be useful as a pre-processing adjunct to existing systems.

Also, never underestimate the power of porn.

hmm, incidentally, I suspect that part of the problem is that an SQL statement is more “tangled” than a bunch of source code. So skipping over a single malformed line or even a single malformed method in Java may well be much easier than untangling just want went wrong with a bunch of joins, subqueries and so on. All the more so given that syntax rules of Java are easy enough for everybody, compiler writer included, to basically grasp whereas syntax rules of SQL might be unique to each major RDBMS and not well documented.

The above is to some extent a bunch of WAG, but the very fact that the Java error verification is so much ahead of the SQL one suggests that regular programming languages may be the “low hanging fruit” while SQL is more of a “who wants to climb the Everest” type of task.

Oh well. Programming language parsing and automated rewriting is a specialty of mine, but too bad that Craigslist is not peppered with wanted ads for it. So for the time being the CRUD seems to be where the money is at.

Some sort of Google search might give decent results. If anyone’s had to resort to public help, they’re probably having the type of problem that’s not immediately fixable.

I’ve actually been much impressed with the recent MSSQL Analyzer / studio whatever. I haven’t used it a whole lot, but it provides great type-ahead support that understands your current context. I think bringing the IDE in to your thesis and giving better syntax highlighting and typeahead support like MSSQL would be an improvement. You could go even further and have some sort of hint panel that showed what was available at any point in your subselects, autoindent them for better legibility, etc. Squash those errors before they get typed. :slight_smile:

Nanoda, I am not referring to “public help”. I am talking about an automated tool that would generate meaningful error messages for malformed SQL queries, much like Java compiler generates meaningful error messages for malformed Java code. The “public” aspect of it has to do with creating a corpus of erroneous queries that could be used to study the typical errors that the validating parser ought to be able to catch and explain to the user.

I agree that a sufficiently convenient and powerful IDE type tool that steers you through construction of complex queries is a good idea in the same vein and ought to exist alongside the other tools and techniques. Of course, here the quality of implementation is very important - it all too often happens with projects that sound like “let’s help a beginner” that the outcome ends up being a piece of useless crud. Plus maybe Bill Gates’ marriage to a charity-preoccupied lady :slight_smile: http://en.wikipedia.org/wiki/Microsoft_Bob

There’s a big gap between the idea of the syntax checker detecting the problem and it providing “easy” end-user feedback like “just change *this *and it’ll work”. The latter involves understanding the user’s intent, which only a mind reader can do. Computers are crappy mindreaders. The “just insert a semicolon between these two syntactically valid Java/C statements” message is little more than a child’s sleight-of-hand trick by comparison.

In general parsing problems end up in one of two error states:

  1. The structure tree cannot be built *at all *past a certain syntactical point.
  2. The structure tree cannot be built *unambiguously *past a certain syntactical point.

As you note, SQL syntax is rather messier than typical procedural language syntax. So beyond the simple case of pointing out rogue tokens in the statement, good error reporting is darn tough.

A technique I’d propose would be to abandon the “error message” mentality, and instead have the parser show you in a semi-graphical sense, how it has constructed the partial parse tree and the syntactic point where it first got lost. In this way the human might better understand the parser’s “thinking” and be able to apply the human’s knowledge of the actual intent to compare that to the machines erroneous understanding.

MS’s current “intellisense” and syntax highlighting are baby steps in this general direction. And their SQL query profiler is much farther down in the bowels, not really useful for this purpose.
Actually, for very complex queries (also also very complex spreadsheet algorithms), it’s often the case that while the procedure executes & returns results, the results are not in fact answering the question the operator was intending to ask. Or at least cannot be relied to do so for all possible future inputs.

So some sort of diagramming tool would be an aid to validating syntactically successful queries also.

This would be very far from “easy” to use, but users would quickly develop a better understanding of what they’re doing. I have plenty of experience debugging code & SQL written by people that just kept trying different things until something worked sorta for their trivial and unimaginative test cases. A tool like this would perhaps make understanding mandatory & weed out coders like that. Or not.

This is a pretty good idea, there some specific mistakes I make (leaving out the END for the CASE) that would be very apparent with that method.

The other thing would be to alter the syntax. Complex queries with multiple nested sub-queries and repeated queries can get very ugly. I can imagine some different syntax that allows for separately specified statements with alias that are all pulled together in a more organized fashion to produce the one actual query. Also, if I’ve specified some complex calculated field in the select, let me use that fields alias in the group by instead of having to copy the expression.

I think it could improve readability, decrease key strokes by not repeating some sub-queries, reduce errors and improve the ability for query engine to understand and optimize the query.

Well, since there are open source sql parsers (hello MySQL among others), how about having a look-see, figure out where the error messages you are not satisfied are generated, and perhaps why, and create new messages? Seems unlikely given the definition of the SQL language that you are going to be able to implement some radically different syntax checking and parsing scheme, so you are pretty much stuck with what you have.

Note this is why modern languages are designed to be parsable in a clean and regular fashion.