Oracle Query help-subquery or join blowing up.

So I can’t remember my password at the Oracle sites, and My email is too jacked up to get a reset, so I was hoping someone here might be able to help.

To simply distill the essence of my problem, since the actual data model would take hours to explain.

Lets say I have a base data type of WIDGET. I have a base table of T_WIDGET with a primary key of WIDGET_ID and other data. I have hundreds of other tables foreign keyed to use WIDGET_ID.

SO I created a view that returns rows of key data for one type of OLTP transactions V_CALC_KEY_WIDGET_DATA, with the foreign Key of WIDGET_ID.

If I write

Select * from V_CALC_KEY_WIDGET_DATA where widget_id = 500;

it returns quickly and has an explain plan cost of 2600.

But if I have anything other than a literal numeric value it blows the hell up.

Even a stupid simple tautological subquery is absurd.
for example
Select * from V_CALC_KEY_WIDGET_DATA where widget_id =
(select WIDGET_ID from T_WIDGET where WIDGET_ID = 500);

takes over 24 hours(never finished it) and has a plan cost of 45 billion.

Does anyone know how to instruct oracle to finish run the damn subquery and treat the return value as it does a literal using the same lookup plan?

A subquery isn’t really the main way I’d normally write it, but it seemed the best way to explain it here. But everything I try; joins, joins with WITH clauses, functions that return a number all have the same issue. It’s wants to build the entire V_CALC_KEY_WIDGET_DATA for millions of widgets doing millions of table scans before picking out the rows for the single widget I need.

Thanks, before I go crazy.

I was once an Oracle database expert and I am an expert in relational databases in general. There isn’t anything wrong with your query assuming that the data isn’t absurdly large and it definitely should not take 24 hours to run.

I can tell you the answer with 99% confidence. Your view isn’t written correctly which is easy to do with hundreds of tables. You are missing a join or 100 of them and that causes the database to try to produce the result using a Cartesian Product which is basically a massive loop that will eventually return a result but it could take anywhere from a day to the heat death of the universe.

You have to go back through every single step of your View and rewrite it so that all of the joins are correct. I used to do it for other database developers and it generally takes hours even for the most experienced database developers. Oracle will not do that part of the work for you. Once that is done, also look at creating intelligent indexes because those can also make a huge difference in performance.

Creating effective views and indexes in Oracle is not a click and go task. It requires true analytical work and that is the reason that the people that are good at it get paid the big bucks. You can do it too but you will need a schema with all of the tables to figure out what fields you need to join in your view.

I have personally brought queries that ran for several hours down to a few seconds many times just by getting the joins correct. Yours are not correct at the moment so you have to figure them out.

yep. its not smart enough to connect the two tables as a first step.

You use join or connect/with …

Join and connect tell it to do an indexed or at least cached join…

Tried it, it doesn’t work.

How does the subquery run just by itself?

Alone the subquery is instant. Obviously that subquery is useless and nothing I would use. But just to illustrate that any subquery or join at all blows it up.

I presume that because you are using a view and a subquery there is no effective index for the optimizer to use to join the two sets of results and it is resorting to full table scans.

Without knowing what your output looks like my first thought would be produce a table rather than the view and index it or do the same with the subquery and join it into the view. If that makes sense.

Whenever things go strange like this, check table stats.

(check DBA_TABLES)

When was the last time stats were collected?
How many rows?
What was the sample size of the stats?

ETA: I realized your WIDGET_ID is a FK column. In this case, as others have said, make sure there is an index.
But you also should know if the column data is skewed (uneven distribution) and also if there is a histogram present for the column (sometimes good, sometimes bad).

Finally, what version of Oracle are you using?

It’s not a problem with the base view or indexes, that is fine, it’s an issue with the optimizer. Anything other and a numerical literal it refuses to consider it a filter and is determined to do a full view build of the million row calculate view before doing a join. I’m trying to find a hint to tell it to return the sub query and build an execution plan exactly the same way it does for a literal.

just simply
Select * from V_CALC_KEY_WIDGET_DATA where widget_id = 500;
takes about one second.

Select * from V_CALC_KEY_WIDGET_DATA where widget_id = (select 500 from dual);
takes days.

What timing do you get if you use a SELECT DISTINCT clause or a SELECT TOP 1 clause with your database?

Since you mentioned the cost, you must have the explain plan … can you post that in its entirety? Also, the stats for the table in question (user_tables where table_name = ‘<>’)

Are you sure? I have seen the problem you are describing a couple of times and I admit that it has left me scratching my head for a couple of days but the answer has always been an architectural problem with the underlying query with indexes playing a role as well.

You don’t give enough details for anyone to be very specific in their recommendations but I believe it is time to examine everything from bottom to top level in a step-by-step, logical manor until you identify the problem. I would stop depending on the optimizer telling you what is wrong because that is for fine tuning, not huge architectural problems like you apparently have.

I can fix it for you easily for a large fee or possibly even for free if you post a lot of detail and the specific SQL that makes up your view but otherwise, you are going to have to go back and take a look at everything from start to finish again with a firm understanding of what every table, subquery, join and index is doing.

I can tell you that you didn’t find a bug in Oracle. There are plenty of those but this type of thing is hardly ever one. It is some developer design mistake that could be as simple as a single missing join buried deep in some query or subquery but those can be catastrophic to performance.

Again, I spent years fixing other Oracle developers mistakes just like this one as a major part of my job. This is a classic one and I have seen and fixed different versions of it many times (yours is extreme but again, I have seen that too and the fix usually only took a few lines of codes to transform it from a disaster to an incredibly efficient beast of a query but it takes time and lots of experience to know where to put the bullseye). The main problem is that many people expect Oracle to do most of the work for them and it doesn’t work that way. It is really just a giant set theory problem that can be architected in many different ways and the Optimizer provided is so crude and unhelpful that I rarely use it at all. A competent human analyst is much better especially if you have basic architectural issues because it can’t really address those because it isn’t a mindreader. You just get odd behavior and never-ending run times like you are seeing now.

What version of Oracle are you running?
The optimizer has substantial differences between 10g, 11g, and 12c. Even flavors of 11 have differences.

Have you run stats on the base tables of the views?
Whenever I see crazy stuff happening on a perfectly decent database, like you describe, it has been bad stats (especially if any of the tables are part of an ETL truncate-and-load process)

And, finally, as others have said:

What does your explain plan output look like?
You can’t begin to figure out what is happening until you look at an explain plan. People who are saying there is a missing join might be right, but you won’t know until you look at the explain plan and see “cartesian join” buried somewhere within.
Look at the cardinalities and cost estimates the optimizer has assigned at each level of the plan, if you see “100” and you know that a index lockup in that table should bring 50,000 rows, then Oracle has missed the boat.

The optimizer is an amazing thing, but it is up to us to provide as much good information in the form of stats for the optimizer to choose the optimal plan.

Is there something you haven’t mentioned that is critical information, such as going over a database link or running parallel query mode?

Don’t think that just because a view is tuned that your usage of the view won’t cause the optimizer to choose a bad path–Oracle evaluates the whole stack, not using a plan baked into the view, but figuring out how to run the combined query+view.

And finally, those stats can bite you if you have funky distributions of data, such as a index on a column that has 100,000 distinct values, but 2 of them account for 50% of the rows. Imagine if that table had a million rows…without a good histogram on that column, Oracle would possibly choose a nested loops join with the million row table being the driven table, assuming that each key will return 10 rows. But if someone’s query references one of the popular values matching 500,000 rows, Oracle ends up doing 500,000 index lookups followed by 500,000 scattered reads to get table data that would be far simpler and faster to get by a full table scan.

My 2 cents:
1 - When I used Oracle (9i), it’s optimizer was by far the worst of the 3 DB’s I’ve used (as400’s integrated db, Oracle, SQL Server). Even though I really like Oracle (due to it’s underlying architecture), I did not appreciate how frequently the optimizer got confused despite having good stats and indexes.

2 - All optimizers have limits - query complexity can exceed it’s capabilities to figure out a good method even with good stats and indexes. I guess in theory you could build one that iterated over enough possibilities that it always chose well, but the amount of time to do that for every query would create problems.

Your view could be completely correct, but with enough complexity in the joins criteria (meaning not simple one table to the next but complex web of connections) that the optimizer just can’t figure it out.

To solve these types of problems, I’ve typically broken down views and/or queries into multi-layered subqueries that operate on the data in a way that I know will give good results (inner most subquery is very selective/small set of precisely found data, then at each higher level of sub-queries branching out to larger tables but with specific index in mind, etc.).

Well in case anybody is curious, the solution to the problem was… magic :confused:.

The whole damn thing was to support a business crosswalk lookup that was discovered late before an upgrade from oracle 10 to 11. The wasn’t time to fix the actual crosswalk process, so it was a bandaid to provide a manual report for a month or so. The optimizer was shitting on it for some reason in all the 10 environments , and the 11 dev and QA environments.

But when the exact same 11 upgrade got deployed to Prod Saturday the optimizer there did it’s job great, and had no problem figuring out what it needed to do to return it instantly. Then when that DB image was backloaded to the QA and dev, they don’t see any problem with it anymore either.

I’m tempted to think it was the upgrade re-gathering stats or something related, but the QA DB was build with the exact same process eventually used on prod, and it had the problem originally.

but…Ehh screw it, problem gone. :wink:

Stats. Whenever something magically goes crappy and then can be magically healed, it’s almost always stats.

11g has a better optimizer than 10g. The stats are better.
Besides, as part of the upgrade process some stale old stats from 2009 might have finally been gathered fresh, so Oracle now knows that there are 3M rows where it always thought there were 100K.

QA was completely rebuilt from Prod before the upgrade was run there. And nothing was changed that anybody knows of for the prod upgrade. And I had the stats re gathered on QA before the first post, as early diagnosis. I’m just at a loss for any explanation other than evil magical elves.

I know why your elves have left your database…

…they came to my database this week. And no thank you for that!

Enjoy your elf-free DB!