regexp help using quote as text vs quote as data identifier

Enright3 · February 18, 2008, 11:02pm

I need to import (using DTS) some comma delimited text files into sql server 2000.
I have discovered that the quotation mark is getting dual use. Not only is some of column data surrounded by quotes, but sometimes the quote marks are part of the data itself, and that’s causing problems.

i.e.
1234, “some data here”, “John “jack” Smith”, 1234, 1234

What’s the conventional wisdom on how to handle that? It looks like something I could key on using regular expressions.

Any suggestions?

Thanks,

E3

arseNal · February 18, 2008, 11:37pm

I don’t think any regex is going to help you. If the data is coming out like that, the process creating the file is broken. Quotes inside the text fields need to be escaped in the file; I don’t think there’s any logical way of parsing the file in all circumstances if they aren’t.

Enright3 · February 18, 2008, 11:58pm

I was afraid of that. I wasn’t sure. In my mind I thought if I could look for a pair of quotation marks not followed by a comma, we could assume I’d stumbled across an ‘inner quotation mark’. I had a few other thoughts too, that are a little to verbose to mention here… anyway…

ultrafilter · February 19, 2008, 12:22am

If there are an odd number of quotation marks in any entry, this won’t work.

Manduck · February 19, 2008, 12:33am

I think a regexp or two could identify problem records, but you would have to probably fix them by hand. You could look for any records that have a closing quote that isn’t followed by either a comma or end-of-line, or opening quotes that aren’t preceded by a comma or beginning of line. You could also look for records that have too many fields, which would indicate an extra quote mark and comma.

ETA: An odd number of quotation marks would be another indication of a bad record.

J_Cubed · February 19, 2008, 6:13am

Well, if you know that the text data never has commas, you can split on them into an array and then escape quotes that aren’t at the ends of a string. If the text data can have commas, but each row has the same number of columns and extra commas are rare, you can take out the rows with commas in the data and fix them by hand.

JimOfAllTrades · February 19, 2008, 7:05am

I’d use a simple search and replace.

Find and replace every occurrence of quote-comma-quote, comma-quote, and quote-comma with some series of characters not in the file, such ##$$##$$ for quote-comma-quote, ##%%##%% for quote-comma, etc. Do the quote-comma-quote first.

Now there should be no quotes in the file delineating fields. Find and replace all remaining quotes with some non-offending character such as single-quotes or even blanks.

Then reverse the search & replaces you did initially to return the field delineators to their proper characters.

As J Cubed noted, this assumes your data doesn’t have commas.

Topic		Replies	Views
What's with the double commas? In My Humble Opinion	8	566	March 28, 2022
Punctuation: "inside," or "outside", quotation marks? In My Humble Opinion	17	2368	December 13, 2003
Use of " " and punctuation Factual Questions	11	913	October 30, 1999
Filemaker: how to remove quotes from imported data? Factual Questions	2	2422	September 15, 2009
Anybody else having problems cutting and pasting when quoting? About This Message Board	5	702	March 28, 2004

regexp help using quote as text vs quote as data identifier

Related topics