is copy-paste in a stored procedure a common thing?

a memorable feature of some of the India-originating code I have seen were stored procedures that were handling let’s say 2 or 3 related cases using 2 or 3 blocks of copy-paste SQL code. Sometimes pretty big blocks, so I used diff to actually figure out which couple of words or lines differed.

How easy or hard it would have been to write it more compactly for the same cases I am not sure - maybe sometimes the CASE statement would work, and in others it would have taken dynamic SQL. Using a different set of cases apparently did not occur to the programmer.

Question 1 - is this a common problem in low quality stored procedures, or is this an uncommon signature style of whoever wrote that?

Question 2. If, hypothetically, this is a common thing and reflective of some incompatibility between SQL syntax and coding skills of lesser programmers (well, judging by some of responses I have seen in other threads, some cases of the syntax are not a walk in the park for the best of us) would it make more sense to generate all stored procedures dynamically from “stored procedure markup language” (let’s say a dialect of JSP, I don’t like the poor PHP syntax verification) using a separate “stored procedure management studio”? That way no more worries about finer points of syntax preventing a single query from handling multiple cases - you just write the code that generates separate query for each case based on a template and then the system reports if any of these queries are malformed. Then you fix that and you store the scripts into the version control instead of the stored procedures themselves.

Incidentally, I realize that the “studio” is not really needed and this could be done on the architectural level in the codebase itself. But I think that specialized tools are always a good idea for complex problems. Besides, the type of shops that would benefit most from this are also likely to be those deficient in architecture department.

  1. It has nothing to do with stored procedures. Incompetent programmers the world over love copying and pasting code all over the place. Hard ideas like abstraction and factoring out common bits are alien to them.

  2. There’s rarely a good reason to generate any code from other code. If you find yourself doing that, it’s probably because you’re not using the appropriate abstractions. Or maybe because you’re trying to do something inappropriate given the tools you are using.

who is the you, may I ask? I personally don’t like stored procedures and prefer using dynamically generated queries. But the folks before me put nearly every access to table into a stored procedure.

It may (or may not) be the case that SQL is especially prone to copy-paste due to being harder to understand and hence abstract. E.g. in what I have seen the copy paste in stored procedures was a lot more egregious than in the codebase itself.

So this particular problem has not been salient in your experience? Or you just fear the sheer hellish consequences of empowering the Indian programmers to unleash the insufficiently abstracted, dynamically generated stored procedures upon the world?

Queries are one thing, but stored procedures are another beast entirely. When one talks of database queries, in my experience, that is restricted to purely- (or mostly-) declarative SQL. I don’t mind generating SQL dynamically for simple situations; though anything requiring more than a little bit of logic to assemble I would rather make into a view or stored procedure.

Stored procedures are generally written in a fully imperative super-set of SQL, like PL/SQL or Transact-SQL, and you should be able to avoid almost all instanced of copy-and-paste by abstracting the common parts. (This is common to any programming language, and has nothing to do with databases specifically.)

In my experience, bad programmers will copy-and-paste anywhere, whether it be in SQL queries, stored procedures, front-end code, etc. Crucially, they also tend to make it more difficult to avoid copy-and-paste, by writing systems that don’t provide any modularity or well-defined APIs to allow for code reuse.

I don’t think I’ve ever seen an actual stored procedure that was dynamically generated, but I don’t think I would like it. :slight_smile:

hmm, on second thought I think that those ugly stored procedures I am referring to were actually just consisting of a single query, sort of like a view and not true stored procedure. So the case issue could have been handled just by getting rid of SP and replacing it with dynamically adjusted query sent from business logic layer. At least, unless they happened to be referenced in other stored procedures.

If this particular type of problem is common in poorly developed stored procedure codebases, dynamically generating these SP queries could be thought of as a minor bandaid. It would not require any revolutionary refactoring but locally the code would become a bit more readable and maintainable.

Overall, regardless of the particular issue, I guess this points to “maintaining stored procedures” being a distinct type of activity, on the same level as “maintaining a java or c# codebase”. And, interestingly enough, one for which tools do not seem to be as powerful as Eclipse or Visual Studio.

This doesn’t answer your primary question, but I would be careful about getting too clever with the generating sql for the sake of refactoring out redundancy at all costs.

It seems people have been arguing about stored procedures and their use for decades now, but off the top of my head I will say there are at least two advantages of using sp’s:

  • compile-time checks when compiling your proc. Don’t have update permission on that table? You know right away. With generated sql code, you don’t find out until runtime. Which could be very bad if in your testing, you missed the particular corner case that gave rise to the generation of that particular update statement. In a stored proc, all possible branches get compile-time validated (except for dynamic sql within the proc, of course)
  • ease of query tuning for dba’s, and general code readability.

Given the choice between a few stored procedures that are almost the same except for some differences in the where clause, versus some code that has this tortured logic that generates the where clauses piecemeal, where you really can’t be sure what the final sql statement will be, I’d probably choose the former.

arseNal, your item 1 sounds interesting, all the more so for a codebase that is short on unit tests. It suggests to me that since there may exist a constituency that likes generated queries (let’s say myself and the people who wrote phpBB and maybe various others out there) the development methodology for such people ought to emphasize verification of each and every possible query situation. The methodology can of course exist just in the head / in the notes todo list or it can be embodied in special purpose tools developed for this purpose.

Item 2 may be very valid, but not being a DBA I am not competent to evaluate it. As a programmer, I tend to think of computer resources today as sort of infinite whereas development time and code maintainability is always at a premium. But it may well be that in the real world, absent the commodification of google-style distributed databases, query tuning beyond just putting in indexes in the right places may indeed be a big issue in many cases.

Regarding generation of stored procedures (or any other code generation): beware the Inner Platform Effect.

Code generation is often, but not always, a “smell”, as mentioned above.

It reminds of the following pattern people often go through in their career of writing database applications. I’m not addressing the OP here, as he seems to be doing the right thing, which is thinking about this in a mature way (and consulting The Dope!) I loves me some abstraction, design patterns and generic code, but I’ve seen the following a lot, as young padawan make their way through the ranks to become a Jaded Old Geezer like me.

First system: Wow, this is cool! I can create tables with loads of data types, proper relations and foreign key constraints, everything. Normalization rocks! Stored procedures FTW!

Second system: (sophomore - just enough knowledge to be dangerous) You know, one issue with that first system, every time I wanted to make a schema change, I had to write scripts, update the production database, recompile code, sometimes a DB reorg. Wouldn’t it be cool to have an ultra-generic schema, where I can change anything I want at runtime? I’ll just have a single table, with name-value pairs, and maybe a varchar field for “data type”! This will rock – I may even publish. No one has ever thought of this before.

Third system: That Second System sucked. Every query ended up being a mess of joins. There was no validation of data by the RDBMS because everything was just name-value pairs. The thing ran slow as molasses, and when I hollered at the DBA, he looked at my schema and shook his head. I thought he’d be happy because there was only one table to manage, but he chewed me out because my name-value thing had defeated any attempt he made to add proper indexes to the table and rendered it impossible for the database to ensure any kind of data integrity. I’m going back to using the RDBMS the way it was designed to be used. I Am Enlightened.

As is always the case in software, or in life in general probably, the worst thing you can do is to have one style of doing things and insist on using that style for every application rather than judicious use of different techniques where appropriate. So I never said that I am always for the use of sps, not by a long shot.

I’m trying to think of what data is involved in something like phpBB and honestly, you have a little bit of user data, maybe some tables for private messaging, and then a whole mess of posts. So really there probably isn’t that many queries that need to be tuned compared to say, an airline reservation system or a trading platform. These systems will have tons of different types of data, with literally thousands of different queries that may need to be tuned for extreme performance. And in mission critical situations (at least more so that some bulletin board, most likely).

So I’m really stating the obvious mantra, be sure to use the right methods for the right situations. And to be fair my initial post was more from the point of view of business apps rather than tools whose ends users are often other developers.

As DarrenS also pointed out, up and coming developers often think they’re very clever in coming up with some ingenious way to do something that no one has ever thought of before. And their shit works nicely at first but then as they have to add a join here or throw in a conditional correlated subquery there, it often ends up becomes an unreadable, untuneable, and possible error prone mess.

I also very much appreciate the comment of DarrenS, but for a different reason. A fundamental feature of my worldview is that “choice between A and B will usually suck one way or the other whereas a synthesis of A and B can be the best thing since sliced bread and more”. So let’s see what we can learn about A and B here.

DarrenS is proposing two systems, System 1 and System 2 which both have distinct pros and cons. From programmer standpoint System 2 sounds pretty darn cool “in the here and now”. It is inherently more powerful than System 1. OTOH System 1 will run faster and be less error prone.

If we want to achieve a synthesis, we should pick one of these systems as a basis and then try to incorporate new features into it to achieve some of the benefits of the other. Why settle for running and troubleshooting as slow as molasses (System 2)? Why settle for initial development and subsequent modification as slow as molasses (System 1)? How about we try achieve both at the same time?

To take a well known example, imagine you are having a flame war argument between System 1 Lisp and System 2 C language. C is easy to write in but can be pretty error prone. Lisp is not so easy to write in but has automated memory management and various other powerful (though not always actually relevant) goodies. Well, so which of these systems should conquer the world? How about, drumroll, PHP/Java/C# which are all dialects of C with a bunch of ideas imported from Lisp? That is, the synthesis won while the pure precursors (all of which had their strong proponents) lost.

Now back to our database backed systems. Based on this thread so far, it sounds to me like System 2 is the more natural basis for the hypothetical “perfect synthesis”. It also, obviously, is a basis for a whole lot of not-so-perfect attempts at getting things work; e.g. AFAIK the SAP framework is sort of System 2 with a big company making big bucks from supporting its limitations.

However, the failure of attempts so far does not necessarily mean that the synthesis is inherently impossible. E.g. AFAIU it is a lot easier to automatically transform System 2 into something closer to System 1 than backwards. Correct me if I am wrong, but such a transformation just might be called “OLAP” (or “data cube”, or something like that). Perhaps after this transformation it could start running a lot faster and become more amenable to automated data integrity verification.

Or, perhaps automated data integrity verification for System 2 has nothing to do with OLAP whatsoever and is simply a non-yet-existing technology that will eventually get built, to accommodate all those sophomores out there.

Notice an important point in my above musings - my goal is first and foremost to optimize the process for the human user. If the human user is too dumb or lazy to build and maintain stuff according to System 1 pattern, then I say, let us make a pattern that he would be able to use effectively (maybe not precisely System 2 but maybe some System 2A), and then let us delegate to the smart, tireless computer the task of handling the dirty laundry. Just like, as we can see from the discipline’s history, if the user is too dumb to manage his memory, we let the garbage collector do it for him.

hmm, on second thought I don’t think I understood DarrenS’s System 2 correctly. A key-value table sounds kinda crazy. It seems that I was thinking more along the lines of System 2 being more highly normalized than usual, but not quite messed up like that.

Maybe there is a continuum here, from the no normalization case all the way to the single table case, with reasonable stuff in the middle.

The truly interesting issue here would be, is there in fact the sort of tradeoff that I have hypothetically posited above, where adopting an unusually high level of normalization coupled with additional (possibly now non-existent) tools would provide for higher programmer productivity than the more traditional level of normalization done today.

If you’re interested in reasonable database designs that don’t conform to the standard normalization models, take a look at Dimensional Modeling, which is used in data warehousing systems. In that case you’re dumping data out of a transactional system into deliberately denormalized tables which are optimized for read-efficiency.

niiice, so here comes the Canossa moment for this particular hypothesizer :slight_smile:

Apparently in my writings above I have demonstrated a fundamental misunderstanding of the arrow of normalization. We cannot automatically increase normalization of a database since that would mean computer creating information out of nothing. Automated tools like OLAP decrease normalization. So there is no hope that those “easier to deal with” System 2 type setups could ever be better automatically verified for data consistency, data types or anything. All such inferences require good normalization, and System 2 just ain’t that. Unless we effectively build up our very own RDBMS from scratch running on top of the old one, kind of like the I think I will call them “Transactions” guy.

Part of the reason for misunderstanding is that apparently I used to associated “poorly normalized databases” with tables with lots of columns. Whereas the key-value table superficially looks very different from that, even though in reality it may be isomorphic to some degenerate and especially egregious case of it. Don’t remember the details, but now I even remember the database theory prof talking about this sort of stuff. Except back then, without SDMB driving me to fight my ignorance, I found that stuff boring and pointless.

Nothing like good solid theory to clear up the ignorance of muddled mental models, huh.

continued from above, and no, Toto, err, code_grey, SAP and similar business process automation systems do not run on denormalized database models either. LOL :slight_smile:

Which raises an interesting issue of, just how do they in fact run and what is their place in the database structure theory - both in their present form and in whatever hypothetical future form to which they may develop in due time (if, in fact, the theory allows them any further space for development at all, unlike the denormalized NoSQL stuff).

ERP systems generally sit at (on average) about 2.75 normal form. Pure 3rd normal would be too inefficient in some cases, so shortcuts that seem reasonably safe are taken.
Regarding your OP: can you provide a more concrete example? Sounds like poor coding. With CASE statements, nested queries and temp tables, I don’t think I’ve ever duplicated a query of any decent level of complexity.

Dynamic “where”, “group by” and “order by” are all pretty easy to do elegantly in static code.

While that may not be a good example, there are many situations in which a higher level of “soft locking” above the level of DB transaction, is valuable. For example: long running processes that rely on specific columns to remain unchanged while at the same time allowing some other columns to be updated.