regexp help using quote as text vs quote as data identifier

I need to import (using DTS) some comma delimited text files into sql server 2000.
I have discovered that the quotation mark is getting dual use. Not only is some of column data surrounded by quotes, but sometimes the quote marks are part of the data itself, and that’s causing problems.

i.e.
1234, “some data here”, “John “jack” Smith”, 1234, 1234

What’s the conventional wisdom on how to handle that? It looks like something I could key on using regular expressions.

Any suggestions?

Thanks,

E3

I don’t think any regex is going to help you. If the data is coming out like that, the process creating the file is broken. Quotes inside the text fields need to be escaped in the file; I don’t think there’s any logical way of parsing the file in all circumstances if they aren’t.

I was afraid of that. I wasn’t sure. In my mind I thought if I could look for a pair of quotation marks not followed by a comma, we could assume I’d stumbled across an ‘inner quotation mark’. I had a few other thoughts too, that are a little to verbose to mention here… anyway…

If there are an odd number of quotation marks in any entry, this won’t work.

I think a regexp or two could identify problem records, but you would have to probably fix them by hand. You could look for any records that have a closing quote that isn’t followed by either a comma or end-of-line, or opening quotes that aren’t preceded by a comma or beginning of line. You could also look for records that have too many fields, which would indicate an extra quote mark and comma.

ETA: An odd number of quotation marks would be another indication of a bad record.

Well, if you know that the text data never has commas, you can split on them into an array and then escape quotes that aren’t at the ends of a string. If the text data can have commas, but each row has the same number of columns and extra commas are rare, you can take out the rows with commas in the data and fix them by hand.

I’d use a simple search and replace.

Find and replace every occurrence of quote-comma-quote, comma-quote, and quote-comma with some series of characters not in the file, such ##$$##$$ for quote-comma-quote, ##%%##%% for quote-comma, etc. Do the quote-comma-quote first.

Now there should be no quotes in the file delineating fields. Find and replace all remaining quotes with some non-offending character such as single-quotes or even blanks.

Then reverse the search & replaces you did initially to return the field delineators to their proper characters.

As J Cubed noted, this assumes your data doesn’t have commas.