Why are similar-length MS Word documents so different in file size?

I teach college, and my students all submit their paper online. I’m not interested in making them worship at the Church of Microsoft, so i offer the opportunity to submit in all sorts of file formats: .doc, .docx, .pdf, .rtf, .odt, and even .txt.

Even so, the ubiquity of Word means that most of the documents are submitted in .docx format.

One thing i find interesting is that two of these documents, which might appear to be almost identical when viewing the actual content, can have radically different file sizes.

Right now, for example, i have two of my students’ papers open. One is 1570 words long, and the other 1603 words. Each document contains about 15 footnotes. The only discernible differences are that one document contains a header with the student’s name and a page number, while the other does not. Yet one shows as 14KB in Windows Explorer, while the other is 126KB. And there’s a third document in the same folder that is only 620 words long, with no headers or footer, that weighs in at a whopping 487KB.

What gives? What sort of background information in the file causes that large a discrepancy?

Word used to only record the changes made to the document. So, if you had a large amount of text, and then deleted it, it would simply write the “delete” to the file, leaving the original text hidden in the document. I’m not sure if modern Word still does this, but I wouldn’t be surprised if it does.

There is a ton of stuff in Word documents that you just don’t see. The actual text of a document is the least of it. Take a Word doc, copy the text, paste it into something like Notepad, save that as a .txt file, and you’ll see the difference.

Styles, templates, graphics (if any), xml schema, etc., are all invisible parts of a Word document. A cluttered-up template or document with lots of extraneous junk in it will result in a larger file than a clean document, even if both have the same number of words.

Word is a junk collector and does not store information in the most efficient and compact manner.

Performing a “Save As” and saving to a new file usually causes Word to create a new file without the accumulated cruft. This is one thing to do to check on the source of some of the idiotic rubbish in the file. (Never ever give someone a Word file that you have not done this for - otherwise it is quite possible to uncover stuff that you thought you had deleted from the text. Sometimes with very unfortunate results.)

But even the accumulated rubbish of changes doesn’t account for all the madness. As above, someone could have used a very complex and messy template which includes all manner of rubbish.

It could have embedded images, fonts, styles, or a whole lot of other non-visible things that would make a substantial difference to the file size. The docx format, being based on XML, isn’t really the most compact storage mechanism either.

Thanks folks.

I was aware that Word docs tend to be considerably larger than their plain-text counterparts. I also suspected a lot of what you said about templates and other background stuff. I guess i was surprised because i assume that most non-expert users would simply open a document and begin writing, using Word’s default settings.

487KB is massive for a 600-word .docx file (they’re compressed - Zip format). It might have embedded images or fonts.

We often do. But the default settings include a lot of baggage.

It’s been a while since I’ve torn apart a Word file down to the byte level to see what makes it tick, but years ago I did, and was surprised to find not only batches of text that had been deleted, but the entire font data file. Use 10 fonts, you have 10 embedded font files. Not sure if this still happens.

Also, Word does not link to images, but stores them internally in the data file, or at least it used to. All of this makes the file expand considerably.

It’s different between doc and docx; docs (and xls, and so forth) did that, docx doesn’t. A trick: if you’re working in doc, copy-pasting the whole contents of the document to a brand-new one produces a document in the same original format but taking up a lot less space (1/3 was pretty normal for the team where I learned that).

Also, pictures. Let’s say you want to add a cropped-up picture to a doc or ppt. If you crop the picture in another program and paste it directly, it will take a lot less space that if you crop it inside Word; same for any edition (resizing, adding numbers, whatever). Cropping it in another program is equivalent to taking scissors to a printout; cropping it inside Word is equivalent to painting the edges white.