perl script help

NoCoolUserName · January 27, 2010, 5:53pm

I need to process a long string that contains a bunch of keywords. Each “record” begins with a keyword, then depending on THAT keyword, other keywords follow. For example:

BEGIN PART1 blah blah blah BEGIN PART99 blah blah blah

So, how do I loop through and deal with each different sort of item? I started with a foreach loop, but then I have to keep a flag set for the type of item and that seems clumsy. Then I did a for ($i=0, $i++…) loop and that wasn’t much better. There has to be a way to deal with this, but I’m not seeing it.

Thanks!

Superfluous_Parentheses · January 27, 2010, 6:03pm

I would go with something like this:



$_ = some really long string
while (length) {
  if (s/^BEGIN PART1 (\w+) (\w+) (\w+) //) {
      # do something with $1, $2 and $3
  }
  elsif (s/^BEGIN PART99 (\w+) (\w+) (\w+) //) {
      # do something with the parts
  }
  # etc
  else {
    die "No match found starting at $_";
  }
}

basically, this chops off any matching section of the beginning of the string in $_ and repeats the loop until the string is empty or no match is found.

UncleRojelio · January 27, 2010, 6:09pm

You could use the ‘split’ operator to put each delimited word into an array.

MrDibble · January 27, 2010, 6:44pm

chop it up into individual records with the **split **operator or something similar, then pass it to an hash array with the key:value being the PARTXX:substring pair (this is assuming a 1:1 equivalence, otherwise pass to a sub-array of values) Use a switch conditional in a foreach loop to go through the array keys. If you have anything like a tree structure, hashes are your friend.

Any chance you can include some of the actual string, and expected outputs? Your OP’s a bit nebulous.

NoCoolUserName · January 27, 2010, 6:50pm

Sheesh, I gotta get more sleep. Or more caffeine. I’ll be more specific

So, I’ve already loaded everything into an array. The keyword that starts a sequence is “DEFINE” and then the type of item comes next, followed by the name of the item. I’m going to load an associative array for each item name, and depending on the type of item that array will have different data.

So:

DEFINE JUNK Junk12 stuff stuff stuff DEFINE FUBAR fubar33 foo bar etc.

JUNK type items have a different set of properties from FUBAR type items, so as soon as I find JUNK I go one way, FUBAR takes different processing. I wasn’t planning on subroutines because I’m weak on passing an array to a sub and then knowing where I was when I get back.

Maybe I create a different array for each DEFINE and then process each of those individually?

NoCoolUserName · January 27, 2010, 6:52pm

The original data is lengthy. I hope the above is more clear?

NoCoolUserName · January 27, 2010, 7:15pm

for ($i=0; i++, #line) {
if ($line[i] =~ /^DEFINE/) {
$i++;
# $line[$i] is now the item type
if ($line[$i]) =~ /JUNK/ {
$i++;
# now $line[$i] is the item name
# create an assoc array for JUNK with item name as index?
} elsif ($line[$i] =~ /FUBAR/ {
…
}
# every following $line item is part of the above array, so my loop continues and I load that up
# a new DEFINE starts a new assoc array
}

Now loop through each assoc array?

Seems complex, but maybe it has to be.

Shoot, I don’t know how to do spacing that will show up. It looks like crap (even worse than my code generally is) left-justified.

Digital_Stimulus · January 27, 2010, 8:46pm

Just curious – why don’t you split your original data on "DEFINE "? That is:


@defines = split(/DEFINE /, $_);
foreach $define (@defines) {
   # process each DEFINE similar to your proposed if/elsifs
   # after further splitting each $define into another array
}

NoCoolUserName · January 27, 2010, 10:25pm

Digital_Stimulus:

Just curious – why don’t you split your original data on "DEFINE "? That is:


@defines = split(/DEFINE /, $_);
foreach $define (@defines) {
   # process each DEFINE similar to your proposed if/elsifs
   # after further splitting each $define into another array
}

The raw data is multiple lines split in random spots. I’m current reading each line, splitting on space, and putting into one big array. I suppose I could put the array back into a string and then split on DEFINE, then split each of those on space.

Is there a way to directly split an array into multiple arrays?

Punoqllads · January 28, 2010, 12:32am

How about something like:



sub Dispatch( @ )
{
  my ($type, @items) = @_;

  return unless defined($type);

  if ($type eq 'JUNK')
  {
        DoJunk(@items);
  }
  elsif ($type eq 'FUBAR')
  {
        DoFubar(@items);
  }
  # etc...
  else
  {
        warn("Unrecognized type '$type'
");
  }
}

## Main starts here

my @lines = <ARGV>;

my @words = split(/\s+/, join('', @lines));

my $index;

for($index = 0; $index <= $#words; ++$index)
{
  last if $words[$index] eq 'DEFINE';
}

die "No 'DEFINE' token found in input
" if $index > $#words;

my @tokens = ();
for(; $index <= $#words; ++$index)
{
  if ($words[$index] eq "DEFINE")
  {
        Dispatch(@tokens);
        @tokens = ();
  }
  else
  {
        push(@tokens, $words[$index]);
  }
}

Dispatch(@tokens);

Disclaimer: no warranties expressed or implied, etc.

Digital_Stimulus · January 28, 2010, 12:45am

Bleahhh. By “random spots”, do you mean it’s possible to have an end of line in the middle of a word?

Not of which I’m aware. If there is, hopefully someone will alleviate my (our) ignorance.

Omphaloskeptic · January 28, 2010, 1:33am

You can do


my @lines = <ARGV>;
my @words = map { split } @lines;

to avoid the string concatenation. (This assumes that a word never spans multiple lines.) Alternately, you can read the entire file into a single string, by undefining the record terminator:


{ # braces for localization of $/
  local $/;
  $_ = <ARGV>; # $_ contains the entire file (assumes a single file in @ARGV)
}
@words = split;

In both of these cases you are reading the entire file into memory before processing, which may be an issue if the file is very large. If memory usage is an issue you can maintain a buffer containing the last partial record read, trimming this down whenever a record becomes complete.

Punoqllads · January 28, 2010, 2:17am

Ooo, much nicer. Your ideas are intriguing and I would like to subscribe to your newsletter or service.

NoCoolUserName · January 28, 2010, 4:12pm

Words are not split across lines, thank bog. Sorry, “random” was a bit overenthusiastic. Records are split across lines, but at word boundaries.

Omphaloskeptic:

You can do
my @lines = <ARGV>;
my @words = map { split } @lines;
to avoid the string concatenation. (This assumes that a word never spans multiple lines.) Alternately, you can read the entire file into a single string, by undefining the record terminator:
{ # braces for localization of $/
  local $/;
  $_ = <ARGV>; # $_ contains the entire file (assumes a single file in @ARGV)
}
@words = split;
In both of these cases you are reading the entire file into memory before processing, which may be an issue if the file is very large. If memory usage is an issue you can maintain a buffer containing the last partial record read, trimming this down whenever a record becomes complete.

So if I give the file name as an argument, and then do _ = <ARGV> I'll get the entire file in one big _ string? Nice! Then I’ll use “split on DEFINE” and have my easy-to-process records.

Thanks!

Punoqllads · January 28, 2010, 5:39pm

No, $_ = <ARGV> will only get the first line in the first file, or one line from standard input if no filenames are passed in on the command line.

NoCoolUserName · January 28, 2010, 6:39pm

Ah, but we’re )undefining the record terminator, which makes it all one, big line. (If I understand the following properly)

Omphaloskeptic · January 28, 2010, 6:42pm

Right. But if you change the definition of the record separator / you change what Perl thinks of as the end of the line. In particular, **undef /; _=<ARGV>;** will read to the end of the current file, not just to the first newline. The braces around this block in my code example above were so that the value of / was not changed for the rest of the file (since other places in the file may reasonably expect line-oriented reads).

(On preview, what NoCoolUserName said.)

Omphaloskeptic · January 28, 2010, 11:16pm

One more clarification: The string $_ will contain newlines. It won’t be “one, big line” so much as “one
big
string.” So you should be prepared for arguments separated with whitespace other than just spaces, when you do your record processing.

NoCoolUserName · January 30, 2010, 4:08pm

So we add “s/
/ /;” and life is pretty darned good.

Topic		Replies	Views
Simple PERL programming question Factual Questions	8	930	May 4, 2006
perl: using split and keeping split /regex/ in returned array elements Factual Questions	6	5223	February 2, 2010
PERL question (right forum?) Factual Questions	4	669	April 11, 2002
String matching algorithm Factual Questions	11	1099	February 26, 2005
perl help (foreach with %array) Factual Questions	7	1636	January 22, 2012

perl script help

Related topics