Recognizing File Type with Bogus Extension

Suppose someone wanted to protect sensitive information and simply changed the file extention to an incorrect or non-existent type. How easy would it be for another (tech-savvy) person to recognize the correct file type, and how would they go about doing this?

It depends on the original data type of the file. First, obviously you have to have some reason to suspect that the file contains interesting data.

Then, there are a number of simple heuristics that one can perform on the data to determine its true format. Every standard image format contains file headers with known values, so it’s easy to look at those. So do the formats for various word processor and office applications, compression formats (gzip, bgzip2), archive formats (tar) etc. Plain English ASCII text of course contains English words and no bytes above 127; it’s just as easy to detect extended ASCII or Unicode encodings. Many applications use XML-based file formats now, which are human-discoverable. The unix ‘file’ utility uses all these techniques and more to make best-guesses at file types.

At the end of the day, there’s always opening it up in the hex editor and examining it byte by byte.

Here’s one program that tries to figure it out for you.

Usually, if you go through a file with a hex editor (at least in my experience), quite often the file has a lot of information in the header that will tell you what type of file you’re dealing with. For example, all JPEGs start with the hex sequence FF D8 FF. Here’s more on file signatures. Sometimes, you don’t even need to know that, as the file header will contain plaintext that could be viewed with a hex editor that will give you clues to what kind of file it is.

Simpler than that, even: Anything at all can be opened in a text editor. The meat of most files will show up as incomprehensible gobbledygook, but the headers will often still be plaintext, so you’ll see something like <short section of chickenscratch>Created with Adobe Photoshop<more chickenscratch>.

Not hard at all. The standard Unix terminal program “file” will recognise a wide range of file types based on magic numbers in the file header. For instance, even after removing the extension from an arbitrary file on my desktop, file successfully recognises its file type:



dpm@dpm-laptop:~$ mv /home/dpm/Desktop/head.ps /home/dpm/Desktop/head
dpm@dpm-laptop:~$ file ~/Desktop/head
/home/dpm/Desktop/head: PostScript document text conforming DSC level 2.0