what kind of malicious things can be distributed in .doc and .xls files?

and are there ways to purge any such malicious additions from a file that I suspect to be infected? Would purging them require opening them in their associated program (presumably on a safe machine dedicated specifically for that, to avoid damage) or can they be purged using 3rd party apps without exposing the machine to any threat?

You can imbed trojans and viruses into microsoft documents due to their ability to run active content.

A good anti-virus scan will look for these exploits.

This is mostly a theoretical threat these days; it’s been at least ten years since a virus exploited Office programs. The most common issue was a macro virus – a virus that created a program to change the .dot file so that it would add its macro to all new files. It was more a pain than anything malicious.

Since viruses and malware makers are interested in other things, they concentrate on other methods or propagation (a virus in a .doc or .xls would spread so slowly these days that antivirus would find it easy to keep up).

So it’s possible, but no one would go to the trouble.

You may embed damned near anything in DOC or XLS if you have the desire and the human resources to write a computer application of low to moderate complexity.
THOU SHALT SEPARATE DATA AND EXECUTABLE STREAMS was a good idea, but for the past 14 years, it ain’t been done anymore.

AFAIK “anti-virus scan” is about detecting not allowed things. Especially if the “not allowed” thing happens to match a known virus.

Now, could we make a scan, specifically targeted at .doc, that would only allow the “allowed” things? After all, I know what sort of things I expect to see in .doc file - the text, the pics, the tables, that’s it. Maybe footnotes. So could we make an app, whether running on top of Word or separately, that would sort of go through the document with a checklist and either detect everything that’s not explicitly allowed or also purge it on the spot?

The closest you’re going to get to that on DOC and XLS files is going to be using something that does solid heuristic analysis[1] on your files. It’s not a new concept.
You simply cannot whitelist MS Office document files and have users who continue to be able to modify them, so that would be your best bet.
Aside from that, you could disable certain functions in the documents including ActiveX, hyperlinks, vbscript, etc.
None of the above disabling could be done without reducing functionality.

[1] Heuristic analysis - Wikipedia

Mr_Slant,

well OF COURSE I want to reduce functionality! Like I said, the only functionality I want in a .doc file I get from somebody else is reading the text, looking at pictures and maybe tables. I don’t want the script functionality, the ActiveX functionality, the trojan functionality etc.

Ok, now how about we take a potentially infected .doc file, open it inside Microsoft Word emulator running on Linux in total lock down (i.e. a place where we don’t care about trojans), do CTRL-A CTRL-C CTRL-V on its contents and paste them into another new .doc file. Will that be equivalent to a total purge of all malicious embedded stuff (along with whatever non-malicious things that might not get copied over and hence axed)?

Without going into technical specifics, in broad terms what you’re talking about would work, yes.
You could achieve most of what you want by installing Sun VirtualBox, then installing a copy of Ubuntu Linux including OpenOffice.

Here’s a rough guess as to a solution:
Download the document using the Ubuntu install, open the document in OpenOffice, convert the file to a more primitive format than Word’s .DOC, close OpenOffice (probably a needless step here), re-open the document with OpenOffice, and save the file back into Word’s .DOC format.

That’s a starting point. I have not tested the above solution. It’s possible that the idea is unworkable or unsafe, or missing steps, or has needless ones.

I doubt this is what you want but if you only needed to get text out of the file (no images, tables, etc), in a Linux terminal try: strings oldfile > newfile. The strings command will pull text out of a binary, so it should pull text out of an infected, rich text file.

Fubaya,

doesn’t the .doc format encrypt the text? E.g. when I opened .doc files in Notepad, I don’t think I saw any legible text. So can the “strings” command actually parse text out of .doc?

Look like you’re right, they can be text or binary files. I just assumed they were some sort of rich text. It looks like there are several command line tools that can convert them to plain text, but I don’t know if they would have any advantage over a GUI tool like OpenOffice.

The potential hazards in doc (or similar) files are of two types.

  1. The file contains a macro which does evil when executed. All Word versions from 2003 on (maybe 2000 on) have an “are your sure you want to enable the macros in this file?” prompt during opening. Click [No] and you’re safe, period, amen.

Also, as noted by others above, this is a pretty obsolete attack vector these days. But if your dangerous doc dates from 1996, it could be harboring an evil macro. Again, click [no] & its not going to run. And you can then remove it by simply opening the macro editor feature in Word & deleting it/them.

  1. The file is malformed as a document and this malformation exploits a bug in the viewing app (e.g. MS Word) with the result of hijacking Word’s exe to execute the malcode buried in the document.

The cure for this is to open the document in any non-Word program which can read doc format files and then copy the contents to a fresh doc.

Or to run a trusted anti-malware tool against the file to remove, or at least denature, the malware.
Don’t forget that for suitably ancient versions of Wndows, jpgs can host malware too. So even if the doc is clean, any embedded images are not certain to be clean.

Once you decide to be suspicious of a file, there is really no way to ever get back to teh state of (ignorant?) blind trust we place in all the other files we use. Sorta like doing radioactive decontamination. How can you ever really be sure there isn’t a single radioactive molecule left in the area? You can’t, and expecting that decree of certainty is silly.