Database questions (do I really need one?)

I have some questions for the database guys. My current project uses Access as a back end. I’ve got about 20 tables in a beautifully normalized relational setup with referential integrity enforced and all that good stuff. But the more polished the app gets, the less I actually interact with the database. (As it should be.)

All calls to the database are in a single module, which keeps shrinking. Data demands are unusually low, which is one of the reasons Access is such a good choice. Of the 20 tables, I expect virtually all of them to have, at most, 50 records, with one or two maybe pushing a couple thousand.

Because of this incredibly low amount of data, I’ve been using UDT arrays like a madman. The data screens load the entire table into a UDT array when opened, let you save to the array, then when you close the form it commits any changes from the UDT array back to the table. All data integrity issues are validated before allowing the user to save to the UDT, well before the database gets written to.

At this point I’m considering bailing on the backend altogether, and just using unix-style file operations to dump the UDT arrays directly to and from a file. I figure the easiest way would be to make a “system” UDT whose elements are all the UDT arrays, and dump that system UDT to a file in a single statement. It would certainly result in more compact code (in the data module) and saved files. I’m thinking the saved files would go from 800k to about 50k, tops.

Question #1 is are there any hidden issues that might screw me up attempting this approach? I’d hate to get halfway through setting it up only to run into a brick wall and have to restore old code. I hate doing that.

It’s obviously more of a processing app than a data storage app. The processing used to be done in the data module, with recordsets and whatnot. (SLOOOOWWW.) I’ve since moved all the processing out of the data module and am using UDTs instead, where I just load (basically) all the data into memory and go nuts. (Peppy!) Long ago I transcribed a sorting algorithm, and just recently added the ultra-simple binary searching capability, so it’s a pretty functional data model as is. I just use two-dimensional arrays as indexes on my UDTs, and all I ever really need to do is sort and search. Other than the code looking a touch less readable, I see no real loss in functionality. It’s rather nice having all data available without needing to open any tables.

On the plus side, I could certainly lose a few references and components, reducing the install footprint and making me feel cool in the process. Anybody here ever write their own backend?

Question #2 applies specifically to Access as a backend. I’ve been on a real XP-compliant kick lately, trying to use all the proper folder locations and registry entries as per the Microsoft official guidelines. If I install the app as a shared application, then user rights may become an issue. Currently, an empty copy of the database resides in program files, which gets copied to the user’s application data area on first use. Since limited users can’t create files in the shared application data area, I’d need to copy the empty database there on install. Not a problem. The question is this: Can limited users read Access databases in All Users\Application Data\MyApp? (How does the *.ldb file get created if they don’t have rights to create a file there?)

Powerloss. Threading. Downtime. Debugging.
If you’re bored and it doesn’t matter, certainly. But beware of threading and locking data, as well as figuring out ways to make sure your data doesn’t get lost due to things like sudden powerloss and so on. Or as the little boy said, “Mom, Dad, don’t touch that! It’s EEEEVVVVVVIIIILLLLLLLL!!!”
Generally my feeling would be that the reason of “I’ve got it working and tuned and it does everything I want perfectly without sacrificing speed nor data security!–so now I’m bored.” Isn’t a good way to start a project.
So unless you are still getting horrible transaction times, I would say be happy when things work correctly. And if you are getting slow times, then I would say you’re better off to fiddle with your SQL, and make sure you remove all cases of “NOT IN.”
But indeed, it does appear that when all else fails, if your can’t optimise your SQL queries any farther is to just slurp it all out and process yourself–but when possible you are better not to, just as a SQL query is a nice easily read format for generically requesting a specific data subset. While as your optimised mystery function of pwnz0r may not be so clear.

Haven’t the foggiest.

If all your doing is reading data from the DB at the start and pushing it back at the end, then no, there seems to be little point in using a DB to begin with. The real power of a DB is in being able to form complex relational queries as well as niceitys like atomic actions and multi-user support.

If your database queries are slow, then something is very wrong with your program. You ideally shouldn’t be seeing any performance with such tiny databases unless your doing serious data crunching. My instinct would be to figure out what your performance bug is and then move everything back into the database. After all, your code is already written.

Also, in case you ever decide to debug your data, access may or may not provide a better GUI than a text editor for what your looking for.

Well, the program is uniquely single-user, though there could conceivably be multiple XP users who could all use it on the same machine with limited shared data. But that’s on the “future versions” wish list.

It is some serious crunching. While I suppose I could open all the necessary recordsets before starting the main loop, I’m uncomfortable with that many recordsets, many of them pointing to the same tables. (In theory, I could set it up. I just don’t trust Access that much.) So it not the recorsets that are slow, it’s the opening and closing of them causing the bottleneck. (But how much faster is it, really, to refresh all the recordsets after one of them commits a change to the underlying tables?)

There was a “nice touch” calculation loop that took 10-20 seconds, depending on machine speed,which was unfortunate, as I wanted to run it inside a loop that iterated several thousand times. (On my P2 300 XP test machine it took almost two minutes!) But your instinct is correct. I rethought the concepts of how I was approaching the calculations, which removed the need to do the 10-20 second loop altogether, and now the whole thing takes a couple seconds. (Unfortunately, it still isn’t as nice as the ‘nice touch’ loop had it, but I think I can tweak it. I can’t reproduce the original method in a memory-only approach, as it generates much too much data to keep in memory.) Speed could be greatly improved by adding a couple dozen indexes, and opening everything in a table-type recordset, but that kind of defeats the purpose of recordsets to begin with, as it would mirror my UDT array approach exactly, but with the overhead of disk reads/writes. On top of that, the few indexes I do use already make me nervous in their reliability. Are Access indexes reliable? I know at one point they weren’t, but that may have been in the Windows 3.1 days.

Anyway, the current speed I’ve achieved with UDT array approach is a good thing, as I want the user to be able to recalculate ad nauseum until they are happy with the results. Making a user wait more than a few seconds to look at a result they may frequently want to redo would result in a poor user experience indeed.

That’s pretty compelling, and I think I’ll stick with Access for that reason alone. One of the original reason for acccess is that it facilitates a document setup. As in, the “state” of the entire database - including all 20 tables - can be “saved” (copied) to a single file and given some unique extension, which is useful for emailing and whatnot.

I’m still not entirely comfortable using a non-native file extension for a known file type. But when they double-click one of these files, it really needs to open my program, NOT MS Access. (And why doesn’t NotifyIcon do anything from my install program?! grrr It annoys me to have to include a “run once” API call in my program.)

Perhaps I could split the difference and keep the Access database for the system data, and instead of copying the database, use a unix dump for creating the saved documents. That would still give me the benefit of being able to debug data while developing, and offer the smallest possible file size for the documents. And it would silence that voice in the back of my head whining about how the MS XP Logo requirements explicitly forbid using non-native file extensions. (And fuck them with their ‘design nice icons’ requirements. I’m a programmer, not a graphic designer.)

Love the reference, and thanks for the reply. You were both quite helpful.

One advantage to the UDT array approach, by the way, is that it is (in some respects) much more relational-friendly than flat recordsets. For example:


Private Type CityType
    CityID As Long
   ...etc...
End Type

Private Type CountyType
    CountyID As Long
    ...etc...
    Cities() As CityType
End Type

Private Type StateType
    StateID as Long
    ...etc...
    Counties() As CountyType
End Type

That is just way cool. While I’m not doing anything involving geography, this example illustrates the point pretty well. You can imagine the loop nesting and whatnot that is going on, and how the recordset approach would need several indexes and table-type recordsets to be reasonably fast at all.

They’ve added Object Orientedness to Oracle (?) it was I believe, for this reason. So I think you can do this. Just that looking at the reference, it seemed like should you actually want to figure out how to do it, you may as well prepare to tear your toenails out with twezers for the ease of understanding their explanations.

Hm. That may be an issue of your design, if you are individually opening and closing things transparently to yourself in the code, within various functions and such. Not certain what you are doing, but in my experience, the only time I had any sort of slowness in the actual getting data from there to here part, it was because I had done a SELECT * on a large table, and it was downloading large quantities of data. So you might want to see if there aren’t places where you are needlessly closing connections where you could have just opened once and reused thence after.

But I’ve only dealt with MySQL, PostgreSQL, and Oracle–so not quite sure about the status of recordsets, which don’t exist in those.

So my two bits of generic DB advice:

  1. In general, I figure that if Google can store the entirety of the internet and give you instant results–yet my DB app is slow, probably I’m doing something wrong not the DB.

  2. All of the various magical keywords for table layout that all DB optimization books explain–tthough they will get you a job as a DB guy–won’t speed your DB up anywhere near (if at all) the amount you will get by changing your SQL and stored procedure code. Replacing NOT IN with SELECT A MINUS SELECT B cut one thing from ten seconds to 0.1.

Given my generic example above, let’s say I wanted to process States, then Counties within each state, then Cities within each County. The original design kept opening recordsets at each level filtering out any records that didn’t meet the criteria, on the theory that the less data I pull, the better off in general I’ll be. Very slow.

So my other option is to open all 3 tables at the beginning, set the appropriate indexes as active, and seek the first match, then iterate through until the key field no longer matches. That’s fine, I’ve used that approach before, but that was way back in xBase world. And as I said before, I don’t really trust Access indexes to be 100% reliable.

With the UDT array approach, I get the speed of memory, (my sorting and searching functions involve minimal comparisons, so they’re actually pretty effective) plus the added bonus of intrinsic filters, making looping much easier and reliable.

Again, this is all predicated on the minimal data aspect of the project. With any real amount of data, of course I’m stuck using database calls.

Could you elaborate? In Access, a recordset is simply a dataset based on a SQL statement. As opposed to a table-type recordset, which opens (and points to) the entire table, including indexes. As such, if you need to seek, you can’t use a SQL statement. While the “ORDER BY” clause is quite useful, it is worthless if I need to sort by two different key fields at two different places in code, especially if I have to keep switching back and forth.

Here’s a generic question for you. Let’s say I’m in the interior of 3 nested loops. I have a single table holding “Transactions”, which has 50 fields and 10,000 records. Inside this nested loop, I need to do two things:

  1. Identify all (or a specific) transactions made by a given customer, so that I can update something in the customer file if and only if the customer has made at least one transaction per day for the last 5 business days,

and

  1. Identify all transactions made by any customer, ordered chronologically, on the date of the most recent transaction of the customer referenced in 1), so that I can update something in the Branch file. (or whatever.)

Don’t worry about the details, this is just a (poorly constructed) hypothetical. But it has the requirement of needing to run thousands of times in a row, and involves small sections of a large table, but using two different filters and sort orders.

What’s the best way to accomplish this in Oracle? Opening and closing small recordsets (SQL statements) with the WHERE and ORDER BY clauses doing the work? Or can you just open the entire table and change the active indexes? Or is there some other method I’m unaware of?

Despite their massive server farm, this is a good and proper attitude to have. I trust you won’t mind me adopting that slogan?

Sounds like your trying to re-invent the wheel. If you really think your smarter than Microsoft, then fine, go ahead and do that. But the entire point of databases is that you don’t need to bother with sorting and searching through your data. That all happens in the SQL backend. If you absolutely must have that last ounce of speed and you know something that the database does not (which is pretty difficult if you’ve set it up right) then manually pawing through the data might be faster. Otherwise, look at how you’ve structured your SQL and try and give as many hints to the DB as possible.

Again, this is all predicated on the minimal data aspect of the project. With any real amount of data, of course I’m stuck using database calls.Could you elaborate? In Access, a recordset is simply a dataset based on a SQL statement. As opposed to a table-type recordset, which opens (and points to) the entire table, including indexes. As such, if you need to seek, you can’t use a SQL statement. While the “ORDER BY” clause is quite useful, it is worthless if I need to sort by two different key fields at two different places in code, especially if I have to keep switching back and forth.

And if you don’t trust your backend software, then either get some new software or learn to trust it.

Hrmm… I’m not sure of database specifics but if you need to do something computationally expensive inside a triple nested loop, then your going to be slow no matter how you cut it. In this case, possibly the overhead of calling the DB might overwhelm the actual data processing. I don’t know about access. Try doing a 1000 loop of empty requests and see how fast it can process them. If it turns out to be unacceptably slow, then you probably have no choice but to move it into memory.

Then again, if you know the customers before you enter the loop, can’t you structure the SQL in some way to make it one big request that will give you everything?

Oh no, not me, I don’t have the brainpower for that. I simply transcribed the QuickSort technique from a book, and wrote my own binary search, as that’s about as simple as it gets. Both of these methods are the actual wheel, not reinventions. So to speak. How many different ways to sort and search are there, really? Not many good ones.

The real killer (for the SQL approach) computationally is the “last 5 business days” requirement. That’s not the actual requirement, but it is an on-point example. Occasionally you just run into situations where SQL doesn’t have an answer for you. I’m just thankful that the data demands in this instance are so low.

That was my question to you. Can that be done? Given that two different sort orders are required at different times in the same loop?

I suppose the best solution would be to make it different loops, with each one only working with one sort order/filter combo. Unfortunately, due to the specifics involved, that would significantly increase the size of the code and the complexity of the task. (Results of calculations involving the two sort orders are dependant on each other.)

God help me if the data demands ever increase. (Uh, buy more memory? hehheh)

This reminds me of the time I wrote a VB class to read (writing wasn’t needed or desired) clipper DBFs that had thousands of fields, as no standard MS-approved approach could handle more than 255 fields in a a single table. That was an eye-opening experience into how truly simple databases can be, and still have pretty good performance. (Certainly faster than Access.) Love reading those tables with unix-style file operations!

As an aside, I really hated those clipper tables. Anytime you have 150 records with 2000+ fields, that’s just annoying as shit. (The original designer had apparently never heard of one-to-one relationships, and his approach was entrenched in thousands of resulting applications spanning over a decade. Damn DOS developers.)

Databases in General
Remember the reason for the existence of relational databases is not speed but rather data integrity, especially in multi-user transaction oriented business systems.

The more you normalize, the slower querying data becomes, it’s a trade-off, but one that is worth it because you only want to store a piece of information 1 time, otherwise you risk inconsistent results.
Indexes
An index is what allows you to get good performance in a database (generally). So yes you should build them. I haven’t used them in Access because my Access databases are typically small (<100,000 rows) so I can’t say for sure that Access makes good use of them. If it was Oracle, SQL Server or DB2, then of course, build them.

Misc Database/SQL Performance Rules
Perform actions in one SQL statement rather than looping through rows if possible.
Build indexes.
Align select’s and order by clauses with indexes so they will be used, or build more indexes.

Hypothetical Scenario
If you really need loops, as opposed to performing updates in single SQL statements, then open multiple cursors for the multiple order-by. This is not uncommon.

If You Really Want Guidance
Give us details.

What about the “in the last 5 days” causes you grief with SQL?

It’s sort of possible. Consider a table called “transactions” with a column “access_date”. In Oracle we can create a view thusly:


  1  create or replace view last5bizdaytx as
  2  select dt, count(*) cnt from
  3    (
  4    select trunc(dt) dt, dy from
  5      (
  6      select sysdate-rownum dt, to_char(sysdate - rownum, 'DY') dy
  7      from all_objects
  8      )
  9    where dy not in ('SAT','SUN') and rownum < 6
 10    ), transactions t
 11  where dt = access_date
 12* group by dt
SQL> /

View created.

SQL> select * from last5bizdaytx;

DT               CNT
--------- ----------
27-APR-05        626
28-APR-05        337
29-APR-05      12733
02-MAY-05       9801
03-MAY-05       8771


So here you have something that can dynamically take the system time and produce a resultset based on that. Granted, this is a naive definition of business days that doesn’t take into account holidays and such, but that’s going to be a problem for any solution, DBMS or otherwise. (Also, it’s not ANSI.)

At any rate, this isn’t exactly what you asked but you said it was hypothetical anyway. I’m just trying to show that there are sometimes non-obvious SQL solutions.

In general, I would agree with what RaftPeople said, especially wrt multi-user. It’s up to you to consider how likely “future wishlist” items can become requirements. If they do, using flat files as storage can get very messy if multiple users ever need to access the same ranges of data.

Incidentally, how do you propose storing your data as files, even in the single-user context? One big transaction file? A file for each customer?

It’s more akin to the latter. But thanks, that solution is pretty cool.

Flat files? No, the arrays are techincally normalized, and would save as such. Unfortunately, I don’t seem to be able to find the binary dump function I remember from C in VB. So that shoots that idea all to hell.

You misunderstand the multi-user thing. It will never be a multi-user database, as that would be illogical for the application. The future wishlist comment referenced XP users, and the issues raised when installing for “All Users” as opposed to the “Current User Only”.

The entire access database simply gets copied and given a funky extension. These files are around 600k, with maybe 50k being actual data. (It amazes me how much space table definitions, indexes, and relationships take. The empty one, without a single record in any table, compacts to over 500k.) Each “document” (access database) the user saves/loads/whatever contains all the tables in the system populated with all the data.

Even though that may sound like a mess, it’s actually a pretty elegant solution. I’m just annoyed that I can’t dump a UDT array to a file in a single statement. That seems like such a useful thing to be able to do. I used that approach in a C project oh-so-many years ago, and it was beautiful. IIRC, my UDT array contained around 10,000 coordinate pairs. Dumping the whole thing to a floppy disk took less than a second.

I have no idea what you are talking about. You seem to be describing some sort of manual access to the database, instead of SQL queries. With a SQL query there is no opening tables or iterating through them individually, nor requirement to access those tables separately.



SELECT names FROM Cities WHERE
myCounty IN (
   SELECT countyID FROM Counties WHERE
   myState IN (
      SELECT stateID FROM States WHERE
      abbreviation = "WA"
   )
)


With the above query, all searching, matching indexes, and filtering would be handled for you instantaneously and only the exactly correct values returned. Even on a personal PC with a million entries, this would be a near-instantaneous task for any database I have worked with.
If Access doesn’t allow such stuff, I would recommend switching to a proper one. But I would suspect that it does.

Well I was working with PHP, and there wasn’t any sort of magical object representing my data.

I’ve never "seek"ed for data, and my belief had been that relational databases killed such. Apparently Access left this open as an option, but probably if you are, you are best to kill that section of code–or at least to move it in as a stored pocedure so that data isn’t getting transmitted out of the database until it is done processing. Personally I would just kill it, seeing as you were already contemplating throwing the entire database away anyhow.

I’m not sure about your “order by” difficulty, though. You may be saying that you reuse the same data, ordered differently in the same page. Or you might mean that you are ordering by the results of one table when you do your seeking and searching on your second and third tables you process. If it is this second–well if you throw away the manual seeking and searching, this becomes moot.



SELECT alpha FROM omega WHERE
mergatroID IN (
   SELECT mergatroID FROM beta WHERE
   zed = "dead"
   ORDER BY id DESC
)
ORDER BY id ASC


Well I wouldn’t have branch files or any files for storing data outside of the database if possible. But assuming we were doing so anyhow.



SELECT customers.file_path, customers.name, transactions.type
FROM transactions, customers
WHERE
transactions.date > (TIME_CURRENT - TO_TIME_DIFF("5 days")) AND
customers.id = transactions.customer_id
GROUP BY transactions.customer_id
HAVING COUNT(transactions.id) > 5


Something like that. Don’t have a database to check against. More fiddling with it and you could probably get it where instead of saying more than 5 over five days, specifically saying at least one per day for five days–but as said I don’t have a database I can use, nor all of my reference material.

Guaranteed that you can get it down to one query, and close enough to instantaneous that it’s not worth bothering about though.

It can be done. Know this now, and hit thy reference book.

Basic SQL is a bit light, but once you look at the Access specific stuff, you will find that there are a good number of magical functions and constructs you can tie together to very specifically get exactly the data you want, regardless of how tight a constraint it may be. And I would note that a good deal of those magical items exist in other daabases as well, they are just called different words.

Essentially queries work where you say give me data matching this filter, then filter it based on this, this, this, and this–until you are down to the correct data. As opposed to taking all of the data and looping it through all of the filters at once as you would in standard code.
One way to speed up your queries is to rearrange your filters such that the easier processed (lke integer compare instead of string compare) occur earlier, or by performing the filters which will remove the most slough earliest in the command.