I need to process a long string that contains a bunch of keywords. Each “record” begins with a keyword, then depending on THAT keyword, other keywords follow. For example:
BEGIN PART1 blah blah blah BEGIN PART99 blah blah blah
So, how do I loop through and deal with each different sort of item? I started with a foreach loop, but then I have to keep a flag set for the type of item and that seems clumsy. Then I did a for ($i=0, $i++…) loop and that wasn’t much better. There has to be a way to deal with this, but I’m not seeing it.
$_ = some really long string
while (length) {
if (s/^BEGIN PART1 (\w+) (\w+) (\w+) //) {
# do something with $1, $2 and $3
}
elsif (s/^BEGIN PART99 (\w+) (\w+) (\w+) //) {
# do something with the parts
}
# etc
else {
die "No match found starting at $_";
}
}
basically, this chops off any matching section of the beginning of the string in $_ and repeats the loop until the string is empty or no match is found.
chop it up into individual records with the **split **operator or something similar, then pass it to an hash array with the key:value being the PARTXX:substring pair (this is assuming a 1:1 equivalence, otherwise pass to a sub-array of values) Use a switch conditional in a foreach loop to go through the array keys. If you have anything like a tree structure, hashes are your friend.
Any chance you can include some of the actual string, and expected outputs? Your OP’s a bit nebulous.
Sheesh, I gotta get more sleep. Or more caffeine. I’ll be more specific
So, I’ve already loaded everything into an array. The keyword that starts a sequence is “DEFINE” and then the type of item comes next, followed by the name of the item. I’m going to load an associative array for each item name, and depending on the type of item that array will have different data.
So:
DEFINE JUNK Junk12 stuff stuff stuff DEFINE FUBAR fubar33 foo bar etc.
JUNK type items have a different set of properties from FUBAR type items, so as soon as I find JUNK I go one way, FUBAR takes different processing. I wasn’t planning on subroutines because I’m weak on passing an array to a sub and then knowing where I was when I get back.
Maybe I create a different array for each DEFINE and then process each of those individually?
for ($i=0; i++, #line) {
if ($line[i] =~ /^DEFINE/) {
$i++;
# $line[$i] is now the item type
if ($line[$i]) =~ /JUNK/ {
$i++;
# now $line[$i] is the item name
# create an assoc array for JUNK with item name as index?
} elsif ($line[$i] =~ /FUBAR/ {
…
}
# every following $line item is part of the above array, so my loop continues and I load that up
# a new DEFINE starts a new assoc array
}
Now loop through each assoc array?
Seems complex, but maybe it has to be.
Shoot, I don’t know how to do spacing that will show up. It looks like crap (even worse than my code generally is) left-justified.
Just curious – why don’t you split your original data on "DEFINE "? That is:
@defines = split(/DEFINE /, $_);
foreach $define (@defines) {
# process each DEFINE similar to your proposed if/elsifs
# after further splitting each $define into another array
}
The raw data is multiple lines split in random spots. I’m current reading each line, splitting on space, and putting into one big array. I suppose I could put the array back into a string and then split on DEFINE, then split each of those on space.
Is there a way to directly split an array into multiple arrays?
my @lines = <ARGV>;
my @words = map { split } @lines;
to avoid the string concatenation. (This assumes that a word never spans multiple lines.) Alternately, you can read the entire file into a single string, by undefining the record terminator:
{ # braces for localization of $/
local $/;
$_ = <ARGV>; # $_ contains the entire file (assumes a single file in @ARGV)
}
@words = split;
In both of these cases you are reading the entire file into memory before processing, which may be an issue if the file is very large. If memory usage is an issue you can maintain a buffer containing the last partial record read, trimming this down whenever a record becomes complete.
Words are not split across lines, thank bog. Sorry, “random” was a bit overenthusiastic. Records are split across lines, but at word boundaries.
So if I give the file name as an argument, and then do _ = <ARGV> I'll get the entire file in one big _ string? Nice! Then I’ll use “split on DEFINE” and have my easy-to-process records.
Right. But if you change the definition of the record separator / you change what Perl thinks of as the end of the line. In particular, **undef /; _=<ARGV>;** will read to the end of the current file, not just to the first newline. The braces around this block in my code example above were so that the value of / was not changed for the rest of the file (since other places in the file may reasonably expect line-oriented reads).
One more clarification: The string $_ will contain newlines. It won’t be “one, big line” so much as “one
big
string.” So you should be prepared for arguments separated with whitespace other than just spaces, when you do your record processing.