How computer programs defrag your hard drive

I’m watching the display as my defraging program works on my drive. Simple person that I am, it seems to me what it should do is sort data by time last modified and start writing the stuff that gets modified the least at the beginning, clearing space as it goes and making all files continuous. However it’s obviously not doing that, it’s moving stuff all over the place and while it gradually seems to be grouping files in continuous sections it’s slow and I can’t figure out why it’s doing what it’s doing. When defragging a hard drive, what sort of algorithm does the program use? How does it chose what to move where? Why is my idea of how to defrag a hard drive not a good solution?

Thanks for all replies,
AllFree

One of the primary goals of defragmenting is limiting disk head movement. Thus, frequency of access is far more important than when the file was last updated. You want the most frequently-used files clumped together so the head travel is minimized.

Why should the modification time be a factor in where data is placed on the drive?

Also, given that most filesystems store only a creation and a last-modified date, there’s no reliable way to discover which files are modified most frequently. You could have a file created ten years ago and modified once yesterday, and a file created last week and modified a million times until it was last touched two days ago.

What’s important is where the information is physically located on the drive, in the end. You want the most frequently accessed files located on the outer rings of the hard disk drive because that’s the location with the fastest read times. The next goal is to group data such that data is located near other data accessed with it, to minimize the drive head’s travel. Diskeeper (XP) defragments based on this. I don’t know if Vista/Win7 and their constant defragging follow the same strategy, but I’d assume so.

It varies by software.

From O&O Software which I like.

You need to see what each defrag software says about it’s product. Sorting alphabetically doesn’t work out to the best way to store your data to speed up access.

Thanks this is exactly the kind of information I wanted, why my ‘common sense’ solution didn’t make sense. I’d love it if anyone has any more to contribute about factors that a defragging program considers when it moves data

One thing that slows down some de-frags is the constant updating of the directory information and internal indices. A crash in the middle of a de-frag can be catastrophic otherwise.

There is something to your initial thought about modification frequency. Files which do not change size could be moved to the outer tracks so that the next de-frag doesn’t have move them again. But if that isn’t combined with frequency of access, a rarely accessed file near the periphery is going to slow down access to all the other files.

The biggest factor is putting files in contiguous blocks on the disk. Files that are frequenltly accessed, and the entire file is read sequentially, such as executables are prime targets for this. In Windows, files with a .exe or .dll extensions could be treated differently, but I don’t know if any de-frags consider that. Even files accessed randomly will have many accesses to different places in the file occur in a short time, and the operating system will maintain a table of pending accesses so that the order they are processed will be based on minimizing head movement on the disk. So contiguous locations on the disk remain the greatest factor.

One thing I wondered about was this, what do you do with files that are frequently accessed (prompting you to put them on the part of the disk with the quickest access) but also frequently modified (which means they will fragment quickly and no longer be contiguous?

They won’t fragment unless they increase in size. Database systems will try to allocate a large amount of space when a file is first created, and if the operating system has the facility, contiguous space is requested. If they grow even larger, they will fragment, but the largest part of the file should have been contiguous at its creation, or following the last de-frag. When contiguous space isn’t an option at file create time, the database can be created right after a de-frag so that the file will be created contiguously.

Thanks! I love finding this stuff out.

There is also a modest increase in free memory space on the hard disk: fragmented file blocks not only contain file data, but also information to direct the read/write heat to access the next data block in sequence. An unfragmented file contains contiguous blocks of pure file data.

There are a few things to consider when optimising disk layout.

Access to blocks on a disk happens with two operations - seeking to the right track, and then waiting for the disk to rotate the data under the head. Seek time has a mix of fixed and variable overhead - there is a fixed time to start to accelerate the head and to deccellerate and settle it down over the new track, plus the time taken to move across the disk - longer seeks take more time. Rotational latency is fixed - it is determined by the rotational speed of the platters. You can pay any amount of time from close to zero up to an entire rotational period to wait for the data. In general, seek times are longer than the rotational latency.

The best place to put the data for a large file is so that you can read it all without additional seeking. Which means you put it all on one track - or if you have multiple surfaces (two per platter) across the tracks on all the platters when the heads are in one position (this is known as a cylinder.) Old disks (up until about 20 years ago) had a fixed number of blocks per track, and it was easy to reason about exactly where your data was. With modern disks you don’t have quite the fine grained control of yore, but the principles remain.

The fastest place to put your data isn’t on the outermost tracks, it is near the point that halves the area of the disk. This is because the disk heads will typically pass over that point twice as often on their travels as the inner or outer tracks.

Disks themselves will reorder the requests they receive, and take notice of exactly where the heads is , and what point the disk has rotated to when calculating which request to satisfy next. Disk manufactures don’t tend to discuss these algorithms.

Another thing that disks do is cache some of the data in the controller. A simple tactic is to cache more data then you have been asked to read - read more blocks of data after the one you were asked to - there is a good chance that that data will be requested pretty soon. So the file system has an incentive to work with the disk cache - and to try to allocate data storage in contiguous areas. Disks may also delay writes of data back to the media. This is more scary - and can result in inconsistent files systems - so it can only be done with additional care.

As noted above - a defragmenter must operate so that no matter when a failure occurs - even in mid-write - the file system must remain consistent. This means that it must do a great deal more work than just copying file blocks about. It must ensure that there is never a time when there isn’t a valid copy of all the data, and the meta-data that can be found from the file system’s base structures. Typically this means a lot more writing of data than you might hope for.

Not all file systems need defragmenting. Indeed most don’t. All the way back to the BSD FFS, there were tactics to prevent any defragmenting needing to be done. Modern journaling file systems make it even less relevant.

Once again, great info, that I love to know. Thank you.

In addition, defragmenting may not be advantageous, you need to understand the typical access patterns for the data.

For example, because the as400 is a business machine which means typically lots of reads of small amounts of data scattered randomly around the database, it’s strategy is to scatter data amongst all of the available disk drives. If you have 40 or 100 disk drives, it’s better to have all of them working in parallel instead of all of the reads to the orders table getting queued up on a single drive.

In this case, defragmentation is your enemy, not your friend.

How do I determine if my Hard drive is one that should or shouldn’t be defragmented? Also, when programs defrag your hard drive, do they analyze it, plan the whole thing, and then simply execute the plan? or do they do things ‘on the fly’ as they go?

Today’s drives are so crazy fast that trying to scheme out where to put files on the platters for faster access is pretty much meaningless on a desktop PC.

Just for giggles…let’s flip through the December 1991 issue of PC World magazine. (Look to the future with a 486SX!) Ah, here’s a 105 MB drive with a 25ms access time. Could be yours for $1,399. Today, Amazon has a 1 TB drive for $45. Access time is such a non-issue, you have to dig to find any mention of it. With today’s mechanisms, it’s probably about 3 ms. Physically moving the heads is only part of the speed - that drive in 1991 moved data in and out at 1.5 MB per second. Today, drives can blast bits down a SATA connection at 3 GB/second.

The primary point of defragging now is just to keep files in one piece so the heads don’t have to flail about to find all the pieces.

On specific applications that you won’t have at home, like AS/400 and enterprise database servers, there are entirely different strategies to optimize disk access.

Whilst today’s disks are fast, they are not nearly as fast as you might hope. Indeed compared to the rest of a computer system they have got slower and slower. 20 years ago drives ran at 3600rpm typically, so even the fastest drives available now have a rotational latency only 5 times better. Seek time (average) a similar amount better. So the drives are typically only 5 times faster for random accesses. Compared to CPU clock speeds that have increased well over a hundredfold. That means that your CPU can do (say) twenty times as much work (or is prevented from doing twenty times as much) during a wait for disk access now than it was twenty years ago. Only with the advent of modern flash based storage do we see another tier in the heirarchy that starts to fill the gap. Average transfer rates increases with the square-root of the areal density times the rotation speed increase. So this has gone up much more.

The difference between tactics in disk use is usually done in a speed versus throughput tradeoff. A transaction processing engine is interested in throughput - it is much more useful to increase the number of transactions that can be processed per unit time than to improve the speed of an individual transaction. So many spindles and spreading of accesses evenly across them provides for good scalability - even if any individual transaction is no faster. For a singe user PC however there is not a lot of value in this. Large file accesses - say for instance multimedia editing - remain a bottleneck. Here file system layout can and will pay significant performance dividends. RAID arrays provide higher transfer rates, but no better minimum seek times. In the end, if your program is doing a lot of random accesses to large or lots of small files, you are limited by seek times, and since they have only got 5 times better in 20 years, it pays to optimise disk layout.

Storage density is greater on disks now, so even without an increase in rotational speed, more data can be read or written in one revolution. For large scale applications involving more than one disk, strategies are going to vary, and be application dependent. These systems don’t use de-frag very much, for a variety of reasons. One of the main reasons is the time it takes to de-frag, which could make a system unavailable, or unreasonably slow, for many hours, while doing little to increase performance.

The center of the drive strategy for locating data works in some cases, but not others. For applications which are opening and closing many different files, and have to frequently access directory information in the outer tracks, keeping the most frequently used files in the outer tracks speeds performance. This is more important when there are a lot of writes and caching can’t substitute for disk access. When a few large files are kept open and accessed frequently the middle tracks are better, minimizing the average seek time.

Virtually all disk drives have multiple platters now, and the term ‘cylinder’ is more appropriate than ‘track’ since all the heads are moved together on the same actuator. Using multiple heads creates a virtual cylinder, multiplying the amount of data accessible without head movement. Head-per-track drives were once common, reducing seek time to zero. But the cost trade-off of multiple heads with the high density of the tracks on modern disks has made this mode rare.

Data compression is a new variable in the equation. Since processing speed has exceeded disk access time by so much now, there is plenty of time available to compress and decompress data on the fly (especially if its done in the disk controller without impacting CPU). This increases the density of data on a single track or cylinder, minimizing the need to seek a different track.

The OP seems to be about a PC which usually has a single disk, many files, very random access patterns, and relatively slow drives. De-fragging is still a holdover from earlier times though. Most modern PCs gain only small advantages from de-fragging in common usage, because very few users maintain large random access files. Most of the performance increases seen are from frequent access to small files which get scattered and fragmented.

Of course in a few years, flash-drives and other similar technology will relegate disk drives to the category of back-up storage, and eventually they will disappear in favor of totally non-volatile recording methods like DVDs provide.

That was my first PC. By fall of 1993 it had become the bargain entry-level machine at Radio Shack.

MB/GB or Mb/Gb?