Regular expression: matching "everything but"

I’m editing a big HTML file, with a bunch of CSS classes that. I want to convert everything that says:

<p class=StyleXXX>

where XXX is a number from 1-652, into

<p class=Normal>

Except for classes 115, 121, 122, 130, 134, 137, 140, 146, 147, 148, and 149, which I want to convert to

<p class=Blockquote>

I figured out how to match up those classes with the following regex-using perl program:



open(INPUT,"<$ARGV[0]") or die;
@input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",@input_array);

$input_scalar =~ s/(<p class=Style(115|121|122|130|134|137|140|146|147|148|149)>)(.*?)(<\/p>)/<p class=Blockquote>\3<\/p>/g;

print($input_scalar);


This takes an input file and prints it to standard out; I usually redirect it to a file after I determine that it works.

That handles all those “exceptional” classes. But how can I invert that to come up with my other search and replace? There are way too many numbers to put them all in like I did with that one…

Thanks for any help.

Perl supports a zero-width negative look-ahead assertion (which also happens to be my favorite instance of regex jargon.)

So you can do this:



s/(<p class=Style(?!(115|121|122|130|134|137|140|146|147|148|149))\d{3}>/<p class=Normal/g;


What this says is:

  1. Match <p class=Style
  2. not followed by the list of numbers
  3. Followed by three digits

I think that should do it.

BTW, you are using strict and warnings, right? :wink:

It’s not elegant, but you can use one regex to make sure that you have a tag of the form <p class=Style(\d\d\d)>, and then use another regex to decide whether that three-digit sequence is in that list. If so, replace it with Blockquote; if not, replace it with Normal.

Thanks, friedo! I had found the zero-width negative look-ahead operator before, but I left off the trailing \d{1,3} (just \d{3} won’t do it because, as I forgot to mention, the styles go from Style1 to Style652, rather than Style001) and it was only matching the Style text and was leaving the numbers unchanged, so I thought I was misunderstanding “zero-width negative look-ahead.” But of course, I just had it wrong. Thanks!

ultrafilter, I thought about doing something like that, but it’s been so long since i used perl that I’ve completely forgotten it, so I couldn’t tell how to put the output of one regex into the next and still output the whole file the way my current search-and-replace statement does.

Alternately, if you do all of the exceptional replacements first, then you should just be able to follow them with a catchall


$input_scalar =~ s/(<p class=Style\d{1,3}>)(.*?)(<\/p>)/<p class=Normal>$2<\/p>/g;

since there won’t be any of the exceptional values left to match.

Also, the (.*?) in your regexp won’t match multiline constructs (e.g., where the <p> and </p> tags are on separate lines) unless you use the /s modifier. You may want to use /i as well (or [Pp]), if the tags might be <P>…</P>.

Something I find useful for these quick search-and-replace operations is perl -pe <perl-command> <file>. This sets $_ equal to each line of <file> and runs <perl-command> on it, then prints the result, so you can try different things quickly. When you get it to work, use perl -p -i.bak -e, which then saves the original file as <file>.bak and puts the modifications in <file>.

Wow. Regexes rule. The Dope rules more!

:worship:

Thanks Omphaloskeptic. I was doing perl -pe at first, but I couldn’t get it to work with multiple lines (as you mention), so I switched to a full perl program. Then I realized I didn’t really WANT my file to be multiline, so I consolidated it so that every statement in the HTML file is on a single line, separated by two CR-LF’s, and just turned on line wrapping in my editor. That way I didn’t have to ever worry about the multiline problem. I appreciate the extra information for the future, though – undoubtedly it will come in handy at some point, probably soon.