Regular expression: matching "everything but"

chorpler · April 9, 2009, 12:38am

I’m editing a big HTML file, with a bunch of CSS classes that. I want to convert everything that says:

where XXX is a number from 1-652, into

Except for classes 115, 121, 122, 130, 134, 137, 140, 146, 147, 148, and 149, which I want to convert to

I figured out how to match up those classes with the following regex-using perl program:



open(INPUT,"<$ARGV[0]") or die;
@input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",@input_array);

$input_scalar =~ s/(<p class=Style(115|121|122|130|134|137|140|146|147|148|149)>)(.*?)(<\/p>)/<p class=Blockquote>\3<\/p>/g;

print($input_scalar);

This takes an input file and prints it to standard out; I usually redirect it to a file after I determine that it works.

That handles all those “exceptional” classes. But how can I invert that to come up with my other search and replace? There are way too many numbers to put them all in like I did with that one…

Thanks for any help.

friedo · April 9, 2009, 12:45am

Perl supports a zero-width negative look-ahead assertion (which also happens to be my favorite instance of regex jargon.)

So you can do this:



s/(<p class=Style(?!(115|121|122|130|134|137|140|146|147|148|149))\d{3}>/<p class=Normal/g;

What this says is:

Match <p class=Style
not followed by the list of numbers
Followed by three digits

I think that should do it.

BTW, you are using strict and warnings, right?

ultrafilter · April 9, 2009, 12:47am

It’s not elegant, but you can use one regex to make sure that you have a tag of the form , and then use another regex to decide whether that three-digit sequence is in that list. If so, replace it with Blockquote; if not, replace it with Normal.

chorpler · April 9, 2009, 12:59am

Thanks, friedo! I had found the zero-width negative look-ahead operator before, but I left off the trailing \d{1,3} (just \d{3} won’t do it because, as I forgot to mention, the styles go from Style1 to Style652, rather than Style001) and it was only matching the Style text and was leaving the numbers unchanged, so I thought I was misunderstanding “zero-width negative look-ahead.” But of course, I just had it wrong. Thanks!

ultrafilter, I thought about doing something like that, but it’s been so long since i used perl that I’ve completely forgotten it, so I couldn’t tell how to put the output of one regex into the next and still output the whole file the way my current search-and-replace statement does.

Omphaloskeptic · April 9, 2009, 5:29am

Alternately, if you do all of the exceptional replacements first, then you should just be able to follow them with a catchall


$input_scalar =~ s/(<p class=Style\d{1,3}>)(.*?)(<\/p>)/<p class=Normal>$2<\/p>/g;

since there won’t be any of the exceptional values left to match.

Also, the (.*?) in your regexp won’t match multiline constructs (e.g., where the and tags are on separate lines) unless you use the /s modifier. You may want to use /i as well (or [Pp]), if the tags might be ….

Something I find useful for these quick search-and-replace operations is perl -pe <perl-command> <file>. This sets $_ equal to each line of <file> and runs <perl-command> on it, then prints the result, so you can try different things quickly. When you get it to work, use perl -p -i.bak -e, which then saves the original file as <file>.bak and puts the modifications in <file>.

Reply · April 9, 2009, 6:10am

Wow. Regexes rule. The Dope rules more!

:worship:

chorpler · April 9, 2009, 6:51am

Omphaloskeptic:

Alternately, if you do all of the exceptional replacements first, then you should just be able to follow them with a catchall
$input_scalar =~ s/()(.*?)(<\/p>)/$2<\/p>/g;
since there won’t be any of the exceptional values left to match.

Also, the (.*?) in your regexp won’t match multiline constructs (e.g., where the and tags are on separate lines) unless you use the /s modifier. You may want to use /i as well (or [Pp]), if the tags might be ….

Something I find useful for these quick search-and-replace operations is perl -pe <perl-command> <file>. This sets $_ equal to each line of <file> and runs <perl-command> on it, then prints the result, so you can try different things quickly. When you get it to work, use perl -p -i.bak -e, which then saves the original file as <file>.bak and puts the modifications in <file>.

Thanks Omphaloskeptic. I was doing perl -pe at first, but I couldn’t get it to work with multiple lines (as you mention), so I switched to a full perl program. Then I realized I didn’t really WANT my file to be multiline, so I consolidated it so that every statement in the HTML file is on a single line, separated by two CR-LF’s, and just turned on line wrapping in my editor. That way I didn’t have to ever worry about the multiline problem. I appreciate the extra information for the future, though – undoubtedly it will come in handy at some point, probably soon.

Topic		Replies	Views
Any way to define a character class "except" in POSIX Regexps? Factual Questions	9	1494	September 6, 2011
Need help forming a regexp ("not" operator?) Factual Questions	7	2509	April 27, 2011
PERL: translate this nasty regex Factual Questions	1	1274	May 5, 2006
Simple Perl Question. Factual Questions	2	635	November 3, 2001
Regular expression counter Factual Questions	5	4086	May 28, 2012

Regular expression: matching "everything but"

Related topics