Agreed. I wouldn’t count code that has been copied by hand. But if they’ve managed to keep their complete source control history, it’s possible to determine exactly when a given line of code was first introduced, even if it moved around later.
The oldest code I’ve written that’s still in our codebase is about 20 years old at this point (just a couple years after I started). The code is simple, but absolutely unique to us–neither C&Ped from somewhere else nor something so obvious that it’s been rewritten from scratch numerous times. I’m sure Windows has some code from nearly the very beginning.
Quicksort and other performance central logic will almost certainly have been rewritten several times over the years by people trying to optimize it even more. Something boring and arcane, like printf(), is something that no one would want to rewrite. You’d much rather insert new formattings into the existing logic than refactor and enter the hell of having to validate backwards compatibility with the previous version of printf().
I see plenty of (different!) documented Windows quicksorts; one of them, in the C Runtime Library, is labelled 06-22-84 RN author with a bunch of subsequent revisions and optimizations.
Just an aside here, I’m always astounded to see sorts in use. I keep asking the same question, why didn’t you store the data in the right order to begin with? My work heavily involved B-trees and indexing, it’s a completely different philosophy of data storage.
Non-programmers might be surprised at the suggestions that a sorting routine is the oldest in-use code in MS-Windows.
But if you go all the way back to the ENIAC, John von Neumann’s first, handwritten, program for it was a sorting routine. One might think something to calculate sines or logarithms or maybe solve a differential equation. Nope. Sorting.
Because there might be many different “right orders”. To take a simple example, suppose you’re displaying a list of the files in a directory (“folder”, for you young’uns, and get off my lawn). The user might want to see the files in order of name, or of date created, or of date modified, or of size, or probably a number of other possibilities. If you store them in all orders, then you’d need n different copies of the data, and if files are created quickly, it might be resource-intensive to insert each file in its proper place as it’s created (a slow-motion insertion sort).
To be slightly more precise, Dave Cutler was the architect of the Windows NT kernel which became the basis of all Windows releases since Windows XP. Where this is relevant to the OP question is that Bill Gates demanded that, right from the start, the NT OS had the same GUI as the then-current Windows 3.1. Which in many ways crippled the capabilities of NT and all subsequent versions of Windows, which to this day still suck at multi-tasking except to the extent that they’re assisted by the hardware itself with multi-core and multi-threading. VMS (and before that, the PDP-10, and even RSTS/E and RSX-11M and D and IAS) managed just fine with just a single CPU. So I’d bet that there is still a lot of ancient GUI code in Windows 11 that has its origins in the very earliest versions of Windows.
B-trees have a lot of overhead if you really just need to sort once at the end (or beginning). I use them all the time–they’re great if you have continuously incoming data–but for anything more batch-processing oriented, the sort will be faster (and use less memory).
Indexes are for maintaining multiple right orders. Having multiple right orders makes sorting even less efficient.
It’s the same issue with the batches. For instance you can do a sort-merge operation with unsorted batches, or you can just do a merge from sorted batches. Either one can be very efficient because you avoid sorting the big set of data. In the modern world there will be multiple indexes for any large amount of valuable data and the comprises the most processing time spent. And data is rarely stored contiguously in any true sense of the word, it will be spread all over the place and accessed through indexes anyway.
Sure, if you can control your data sources. But that data came from somewhere, and that somewhere probably wasn’t naturally sorted/indexed the way you want.
Spreadsheets are popular because there’s all kinds of data that came from god-knows-where, and massaging it into something useful (via sorting and otherwise) is a key feature.
Of course it doesn’t matter much for small data sources. The processing time is insignificant compared to the time spent on access and maintenance of large data sources. When I retired we were developing the means for unthinkably large databases spread across 10s, 100s, or even 1000s of massive machines in non-stop operations. The future is going to be fun, data engines will be competing like Uber drivers to be the source fulfilling every request.
In my line of work, we have single processors with tens of thousands of processing units. However, the work is what’s called embarrassingly parallel. Small sorts are done all the time; possibly a billion times a second. But it’s used for things like computing histograms. B-trees wouldn’t be useful because although the scale is there, there’s never enough at once to warrant more advanced structures. Even quicksort is out because these simple processors don’t handle recursion well, but there are other sorting routines that do work.
That’s the problem with massive distributed databases, you can’t do the job with embarrassingly parallel processing, but there will be some sideways glances at the multiplication in size of the data.
There are many interesting things going on with Big Data. However, not everything is Big Data . In my line of work, a millisecond is an eternity, microseconds are low-hanging fruit, and even nanoseconds are worth counting. That excludes most forms of distributed processing.
OK, so you’ve got a database, and now someone decides to sort that data based on something that it never even occurred to anyone to sort it by. Sure, you can create a new index for that new whatever-it-is… but to assign that new index, you must do a sort, using quicksort or something equivalent.
Saying “indices mean you don’t need to do sorts” amounts to “problems never need to be solved, because you can always assume someone else already solved it”.
Now, maybe you can argue that any particular sort should never be performed more than once, because every time you do a new sort, you should record the indices from it to re-use if needed. But you do still need that first sort.
And just to show even more diversity here… I ended up rewriting printf() about five years ago. Ok, really snprintf(), specifically to be used in certain panic scenarios. It’s actually shipping in a product right now. I didn’t have access to a basic interrupt safe version that could also handle the truly insane memory/cache/alignment needs for my system that was also interrupt safe. The version available from my vendor was adequate for user code, but they wouldn’t provide the source so I could fix/leverage it.
My version is limited and not optimized, but it gets the job done.
You still don’t need to sort if you create the index in the right order to start with. If you need a new index you build it in the correct order to start with. You might do something with every object in the dataset or select some of them based on a query of some sort, but either way your index with be a B-tree and the data is stored in the right order as you build the index. It’s just a matter of shifting the processing time around, either by maintaining the B-tree as you build the data, or collecting all the data out of order and then sorting it.
If everything works fine without the added overhead of indexes, and on-the-fly sorting then hey, everything’s working fine. If it’s not working fine, then people go in and use trees, heaps, etc.