Regular Expression Help

OK, I’m a but rusty on my regular expressions (like 10 years rusty).

I have a comma delimited text file, and I want to find any value immediately following the 7th comma that does NOT equal “M”, “F”, or “U”.

for example: I want to find the highlighted value in the following sample string:
“11111111”,"",“Level”,“Practice”,"","","",“X”,“01/01/1900”,""," "

What would the regular expression be to find that instance?

E3

Something like this:

.,.,.,.,.,"[^MFU]",.

should work, I think. I tried it and it worked in my tests.

Sorry that should have been

.*,.*,.*,.*,.*,.*,.*,"[^MFU]",.*

I left out a few commas

I’d do it with grep and awk, but I’m not a programmer:

echo $LINE | awk -F"," '{print $7} | grep [A-EG-LN-TV-Z]

You could probably do that a helluva lot more efficiently with a read regex, but I’m lazy.

If your regex engine supports it, this is more readable:


([^,]+,){7}"[^MFU]",.*

The Perl regex engine and many others support that notation now.

You might also make the first set of parens noncapturing by adding ?: to the very front of the expression inside them. Perl supports that, but I don’t know about any others.

RegExp uses greedy matching so this will not work on a line that has more than 8 commas. I don’t think the OP said how many commas were on a line. You need to anchor the beginning with

^

and instead of

.*

use

[^,]*

Note that Derleth uses this in the more economical version.

Using .* to match a single field is bad; that will match one or more fields. Using [^,]*, will match exactly one field.

The replies so far will only find single quoted characters. (So they will find a seventh field “X” but not “XX” or “”.) To find multiple-character strings too, you have to be more careful. If you know that the fields are always double-quoted, you can use something like
“(|[^,]{2,}|[^MFU,])”
to match a quoted string with either zero or two characters between the quotes or with a single character which is not M, F, or U.

This won’t find fields like M or X, though. If you want to find unquoted and malformed fields like M and “M"M as well, add extra conditions to the whole thing:
“(|[^,]{2,}|[^MFU,])”|[^,”][^,]}|[^,][^,"]
(this checks for fields which either start or end with a nonquote). Now you must anchor the end to make sure you get the whole field; you end up with something like


/^([^,]*,){7}("(|[^,]{2,}|[^MFU,])"|[^,"][^,]*}|[^,]*[^,"])(,.*)?$/

(the desired field is now in $2).

Really, finding exceptional cases like this is probably better done with actual logic instead of coding some write-only regexp though. Use a regexp to check for proper quoting; then split the line into fields and check each one in code.

I’m not a real regex guru, but it sure looks to me like everybody’s examples to date will fail if any of the earlier field entries are of the format “abc,def”.

If looks to me like they’d miscount a single field “abc,def” as if it was two fields.

If the source file grammar has the " wrappers for field values optional it gets even messier. Omphaloskeptic’s fine entry covers for this case on the 7th field. But I think you’d want similar logic on fields 1-6.

Good point; I’d been thinking of the commas as outer delimiters, but if quotes can quote commas then [^,]* won’t work. You’d need something like
(("[^"]"|[^",]|),){7}
and there would be some changes to the pattern for the field of interest as well.