Figuring out an encoding error based on the scrambled characters

Okay, this is a rather technical and odd question, but bear with me.

I’m not going to name and shame the programmer, but I’m using a program to keep track of a whole lot of data about genealogy and DNA. I download a big file from a website from time to time and upload it into this program, and only some of the data will be new, the rest is duplicates that should just be merged or ignored.

At some point this program stopped importing data containing non-latin characters correctly. Bøb Båbcatcher becomes B√∏b B√•bcatcher, with all the issues that involves*. The programmer blames the platform, and I’m sure it’s possible they are to blame, but according to the XOJO webpage they use UTF-8 internally if nothing else is specified, and my text editor claims this is a UTF-8 file and displays it correctly as such.

I’m hoping there’s just one step of incorrect handling of the encoding and changing the encoding of the data to import is a simple thing, but after trying a handful of options I still haven’t found the correct encoding.

So my actual question is: Is there a way to look at the resulting scrambled characters and figure out the encoding the program is using during import?

I mean a UTF-8 ø is read, something happens, probably a single encoding mismatch, and UTF-8 √∏ is displayed.

Experimental use of other encodings give different scramblings, so I’m sure Alan Turing could work it out, but I’m hoping someone here can do it quicker than I can create my own Bletchley Park.

*It looks ugly, it prevents searches on names containing these characters, it makes it impossible for the program to merge duplicates when mass importing data, etc.

I don’t have a complete answer, but I note that the Unicode for ø is U+00F8 and the UTF-8 encoding of that is C3 B8. What you’re seeing is √∏, so I’m guessing that something is interpreting the UTF-8 bytes as individual characters. The second error is consistent with this: the Unicode for å is U+00E5 and the corresponding UTF-8 is C3 A5. You see √• instead – again the C3 is displayed as a square root sign. The only thing I’m unsure of is what character set is being used for that final display – it’s not Latin-1. Maybe some Microsoft code page.

If you have any examples of a character getting changed into three garbage characters, we could confirm that the UTF-8 for the original consists of 3 bytes. That would be good support for this theory.

Macintosh Roman.

The euro symbol € gets turned into three characters ‚Ǩ

I was about to post “No, that didn’t work”, and then I tried it again. Bingo!

The dope is just amazing.

Yep, the euro symbol is U+20AC, which encodes to three UTF-8 bytes: E2 82 AC.

More weirdness. Just adding this for completeness. Although this worked for a ton of “foreign” characters, it did not work for all. For instance cyrillic is, unsurprisingly not in Mac Roman, neither, apparently is ð. But it’s not my program to fix, so I’m washing my hands of it all.