Comparing multiple documents

I’m looking for a tool that seems to fall between to chairs. I don’t want to check a document against an online database to prevent plagiarism, and I don’t want to run a side by side comparison of two chosen documents on my hard drive.

I want to run a plagiarism search on a small set of documents on my hard drive. Let’s say I have files A-Z and alpha. I’m, forever reason, sure alpha is plagiarising one or more of the files A-Z, and I want to know if which of the files A-Z alpha is a copy, in part or full.

Anyone know of a good tool for that purpose?

A half dozen work/study students.

Seems like Anti-Twin will let you specify a match %.

I expect a byte by byte comparison won’t work well when looking for duplicate word-processor document content.

There are a number of blacklining programs on the market, and most of them (or at least the ones with which I’m familiar) can compare one document to a number of other documents.

They’re probably really expensive – the market for these programs is law firms and corporations, not individual users. Two that come to mind are DeltaView and Litera’s ChangePro.

MS Word has blacklining capability built in, but I don’t think it can handle multiple (i.e., more than two) documents at once.

Were you planning on contributing something to the discussion?

You could probably set up a batch process to compare document alpha to each of the other suspected plagiarism sources one at a time. Since the OP just wants to find out if/whic one was plagiarized, that should be enough.

Or do it manually: concatenate all the documents into one big document, then us eMS Word’s comparison on that. Once you locate the plagiarism, you can find out which document it was in.

This sounds like what you want, but it’s for instructors and it’s web based so probably won’t help. However I’d send them an email. I’m sure they can help you.

As long as you don’t expect anything too robust (i.e. able to determine that a hastily reworded phrase is similar to the original) this is a pretty simple script to run on plaintext – .docs and .pdfs are a bit more complicated – but only insofar as you have to extract the text as plaintext. In fact, I wrote said script as a project in Haskell for a programming class. More specifically I wrote a script to compare two documents, but there’s absolutely no reason it would be difficult to extend the script to take in a list of file names and compare a single input against them in sequence and print out similarity (and in fact printed a local html document highlighting exactly where the similarities occurred) to each.

Sadly, I can’t offer more than saying it’s a rather simple school-level program, and say that I’d be absolutely shocked if there’s not one out there somewhere that does what you want for relatively cheap. (Obviously ignoring TurnItIn since you don’t want a web DB)

I recommend hiring this guy: xkcd: Regular Expressions