My misfortune is needing to get data out of a big Excel spreadsheet (40 columns and 11,000 rows). I’m not an Excel user and can barely poke around a little as needed to get blocks of cells copied and pasted into a text editor, or export a sheet as a .csv file.
It seems important to include a column that is filled with numbers each of which has a circle immediately around it (the number itself, not the cell). Copy/paste and exporting to .csv just gives a question mark. I am guessing these are some kind of two or three byte constructs. I wish there were some kind of transform inside Excel that would turn these into distinguishable text, even if it’s the hex code for a Unicode character. What is most important is to be able to separate them into unique groups, i.e. to know that two sixteens in circles are the same thing and are different from a seventeen in a circle, regardless of knowing that there was a 16 or 17 per se in there.
I’m using SAS and can go into a binary or other file and read byte values if needed, though I bet the complete structure of a binary file is going to be more complicated than it’s worth.
I am using Excel 2013 and I do not know what version of Excel the file was created in. It also makes heavy use of Japanese characters, different text colors, different background colors, different fonts, and all kinds of other stuff that might be encoding information.
Thank you for any help or insights! This kind of thing comes up every few years and I wouldn’t mind having a better idea of what is going on.
I’ve never seen such a thing in Excel. I’d guess it’s simply characters in some font. But they can’t be the regular numbers in a different font or they’d just show up as numbers in a csv I believe. And AFAICT copy/paste simple strips out the font characterization when pasting into a pure ASCII file so that alpha’s become a’s, etc.
A question mark is usually how programs render characters they don’t have an image for so I’m guessing this is a higher order ASCII character in some dif
So I think they must be so file would then simply put the character in some font. If you click the mouse in one such cell, the entry box just above the spreadsheet itself should show you what standard character is there.
What’s the formatting of these cells? Does Excel think these are text, numbers, or something else?
You might try converting the column to either text or numbers to see how that changes things. The Text To Columns function is an easy way to do this. You’ll probably find a button for this on the Data tab.
Another thought…are these cells formulas or just values?
There are fonts that show circles with numbers inside - Wingdings is one such. You can try changing the font on the column to something standard and see if it’s readable. You can also use the CODE function in an empty column to convert it to ASCII: =CODE(A1).
Circled numerals are legitimate Unicode text, but definitely not ASCII. I suppose you could apply a transformation like ⑱ -> 18, but the OP said that was not necessary.
Re fonts, try something like NOTO or Unifont or Cyberbit, not Wingdings
Excel thinks they are text. Changing the column to number format doesn’t seem to have any effect. I have no idea how to use “Text To Columns” or any other function, or buttons on the Data tab. I really have little understanding of Excel.
Excel says the font is Cambria (Headings). I tried changing it to Courier New and saw no difference in the cells, though the description in Row 3 changed. I’m afraid I have no idea how to use the CODE function or use an empty column.
When I click in a cell, the empty box above the spreadsheet shows a circled number that looks similar (but a bit bigger).
Unfortunately I work in a corporate environment where admin rights to install software are locked down and there’s a lengthy process for getting software approved, even if it’s free.
However, the link you gave about converting Asian characters is very encouraging, and I will check it out! Thanks!
The font was Cambria (Headings) and I tried Courier New, to no avail. NOTO, Unifont, and Cyberbit do not appear in the font list.
Thank you everybody! I may need to find an Excel user to sit with to try some of the above suggestions…
Are we approaching this from the wrong direction? A quick search revealed this page. It says that you can change the ENCODING option in SAS so that it can deal with Unicode text (which includes circled digits and Japanese) correctly. That sounds better than mangling the original file in Excel.
After you export to csv, if there is doubt it was done correctly I would open up the csv in a text editor or hex editor to confirm the exact text encoding. ASCII is pretty foolproof, but Unicode has alternatives like UTF-8 versus UTF-16.
ETA it was not clear to me whether the problematic question marks were real ASCII question marks or not, but that should be clear via the text/hex editor.
Great idea. I had already tried looking at hex codes in a text editor for the copied text, and they are 3F, the question mark. I think I did this with the .csv as well but have to say I’m not sure.
SAS can do things with Unicode, but reading the binary version of the Excel file is a huge mess.
I finally won, with a combination of two of your suggestions!
This was the big key. I was able to save the sheet as tab delimited Unicode text per the linked article. I then did all my editing of the resulting text file with Notepad as the article suggests, which displays and handles Unicode characters.
This was the other piece. I didn’t apply a transformation (if you mean as an operation in Excel), but I found an instance of each circled number and did Find and Replace All. I made circle 18 become plain old 18 – not that I needed the numbers per se, but that was the easiest way to create a unique replacement for each.I did like wise with all the Japanese character strings, which perhaps unsurprisingly only occurred in a few different forms even though there were hundreds of them.
There were other problems. Exporting the sheet as text produced several more columns than are visible in the sheet; I guess they are hidden or something. Some of the cell values contain unprintable characters, which took quite a while to diagnose and weed out. Many of the numeric values had embedded commas, which cause Excel to then encapsulate that cell content with space delimited doublequotes because commas are also potential delimiters themselves, and since this only happened in values that earned commas I had to use some logic to sort all them out. Eventually I created two separate text files out of several of the leftmost rows, and out of a block of rows near the right side of the sheet, because there was some other problem midway in the width of the sheet. Then I read these and merged them together by their line number. This thing is just a huge mess.
I have SAS Access but had the same experience trying it in this project as I did the last few times. It throws a flurry of exceptions and tells the user to contact tech support.
Longer term I have to find a cleaner way of dealing with this type of source!
I have no idea how you’re gathering this data but if it’s in excel, there are some data validation tools you can use to force people to input values within constraints you set. Helpful if you need to or want to do automated analysis on it.
Good to know. At this point I am discovering and exploiting existing data sets, but I think in the future we need to drive how the recording happens.
In a related issue we also need to make the data accessible for years to come, and I don’t think a proprietary format like a given version of Excel is very good for this. I think human-readable low ASCII files are the way to go, but we’ll see…
check the format of those columns. If they are set to “number” or “general”, they shouldn’t add the double quotes when exporting.
text fields do add the double quotes if they include commas, inch symbols and others. When I have these, most commonly the inch symbol, I substitute it before export. What I do is use something which I know isn’t found anywhere else in the document. It can be the letter ñ, or the € symbol, or even a word: substitute every " by DOG (exact format).
Then I open the exported text file and substitute it back. Tadaaaa!
Otherwise and after multiple exports we end up having beauties such as materials called
“”“”““STEEL BALL 2” DIN”“”“”"