Please help the Mac guy with a baffling Windows font problem

Grateful thanks in advance to those with the patience to wade through this story. I’ll try to keep it as succinct as possible.

  1. I set up a database in Filemaker Pro for one of our clients, which happens to be the Catholic high school the owners of my company are alumni of. (No doubt, you’re now hearing the phrase pro bono echoing from somewhere!)

  2. This database, intended to track recruitment efforts aimed at prospective students and their families, was populated from two sources: a purchased list, and the school’s existing database of elementary and middle-school students.

  3. Getting the purchased list into the database was no problem. The school’s student list, however, was maintained in an ancient and long-orphaned proprietary database system. All efforts to pry the data out of that system and export it into something that could be used in the new database failed.

  4. So in desperation, I printed out a list of the students and their addresses, scanned all 32 pages into .pdf documents, performed OCR and ultimately prepared a text file that could be imported into my database.

  5. This I did successfully, and on my Mac system, everything in the database I created looks and works just fine.

  6. However, when I put the FileMaker Pro database on the school’s Windows system, a portion of the records from the school’s database (the ones I scanned) appeared with gobbledegook (nonsense characters) on some of the school’s PCs. On others, the info is there, but there is a space between every character in each field of each record.

  7. Changing the font as it applies to the fields in the Layout section of Filemaker has no effect on this.

  8. What’s especially weird is that this problem occurs in only a subset (about 540 records) of the school’s records I scanned and imported. The other entries from the school appear normally on the school’s PCs (and remember, EVERYTHING appears normal on my Mac). I double-checked the source files, and ALL of these entries were scanned and prepared in exactly the same way, and appear to be uniform on my Mac.

  9. I re-exported the faulty records from my database into a plain text file, which again appears normal when I open it up on my Mac system. But when I opened it in NotePad on the school’s system, I again saw the same weird font with spaces between every character. So I didn’t even bother to import this back into the school’s Filemaker database.
    Is there some sort of Character Set setting or something like that on a PC that would account for this? I’ve been unable to fix this by doing anything within Filemaker Pro.

Any other ideas?

Thanks again for any help.

The computers don’t have the font?

But again, the majority of the records imported from all sources appear normally. It’s only this subset of 540 that show up weird. So I don’t think that’s it.

Hmmm…

Doing some more experimentation in my own database, I find if I change the fault in the layout to Arial, I see the same phenomenon (spacing between the letters) I do on the PCs. I also got it when changing the font to Times Roman.

I’ve also changed it to other fonts that have no effect on the spacing. Weirdness again…Arial is a sans-serif font, but Times Roman is serif. So it’s not a question of that.

In all cases though, when the spaces crop up, they show only in that subset of records. All other records appear normally…and if I pick the right font, every one of them appears normally.

I could have sworn I tried changing the fonts in layout on the PCs, but maybe I didn’t. Does anyone know what voodoo would cause some fonts to arbitrarily insert spaces between letters that aren’t actually there? These are real spaces, by the way. I can put my cursor in the field and move it between letters, and it moves across spaces, not just letters that are widely spaced. And they seem arbitrary…there’s one space between some characters, more between others.

Update #2 (obviously, I should have done some more experimentation before posting this!):

It turns out that the spaces really are there in the faulty records regardless of what font is used, as confirmed using the arrow keys to move through the text. However, some fonts by their nature apparently don’t tend to show them. Viewing them at 12-point type in certain fonts, the faulty records are indistinguishable from the ones that don’t have spaces between the letters.

So apparently for whatever reason, when these particular records (this group only) were scanned, the OCR inserted spaces between the letters of this subset…why, I have no idea. And the spaces are viewable only when you use certain fonts.

I will go back to the original scans and see what can be done to correct this. Even if I get the faulty records to appear properly by changing the font, obviously searches in the database won’t work on names or other information with spaces in them, so I’ll have to fix it on my end.

Sorry to trouble everyone with this!

Export the database as a CSV, go through it with a text/hex editor like Notepad++ or UltraEdit or whatever equivalent there is for OSX. Do a find/replace on the extra spaces (which is probably another character code, not a regular space) and just erase them all.

I apparently will have to do something like this.

Still more weirdness: I checked the original .txt files I imported into FileMaker, and there was no OCR error. Those files do NOT have the extra spaces as a part of them. Both the “good” records and the “bad” ones are identical in nature. So somehow, the extra spaces had to have been inserted by Filemaker Pro during the import process.

I’m now newly baffled as to how and why this would have happened.

You probably just don’t see the extra spaces in the text file. I bet if you look at them in a hex editor you’ll see differences.

If you are allowed to send me an excerpt of your file, I can look at it for you.

As for the reason why OCR software does this, sometimes it’s not entirely sure when a word is a word and it just inserts semi-visible spaces to maintain the character spacing that it thinks was there in the original. If there is a mode to scan as plain text (instead of formatted text, etc.,) it can help alleviate this because hopefully then the software won’t use special spacing characters.

You’re quite right! This is uncharted territory for me, but I downloaded a hex editor and looked at the good and bad files. The bad one does indeed have the extra characters that translated into spaces in it. Lots of “00”'s in the hex code that look as though they correspond to the extra spaces.

What’s weird, though, is that when you go through the file itself in a text editor or word processing program, the arrow keys go right by these extra characters and just move from letter to letter, as if there’s nothing between them.

So in the end it WAS an OCR error, but one that wasn’t at all apparent until you dug deeper.

Thanks very much for your help with this!

I suggest you download TextWrangler.
Then use it to “show invisibles” (View->text Display->Show Invisibles), and see if your mysterious characters show up. If so, you can search and replace them with “null” in one step.

That’s almost certainly data in Unicode UTF-16 (16 bit chars rather than 8-bit – for English chars, the lead byte is zero but they otherwise correspond to the ASCII values). It has nothing to do with the font; just some source was using Unicode data and the others weren’t, and something in the process didn’t notice or handle it correctly.

Sure. Most non-programmers’ text editors don’t handle invisible (or non-ASCII) characters very well.

Word has a rudimentary “show hidden characters” feature that can sometimes help with situations like this, but it’s still easier just to look at the raw data with a hex editor and find/replace at that lower level before FileMaker or any higher-level app touches it.

Funny, I already have TextWrangler but didn’t think to use it in this case.

I have used it in the past, however, to pry text out of very old files I can’t otherwise open — most notably, the late and lamented WriteNow, which I used right up until the moment I finally abandoned OS 9 for good. Still the greatest pure word processor ever made for the Mac.