Simple PERL programming question

pulykamell · May 3, 2006, 8:11pm

This is probably trial, but I’m pretty new to PERL, and I can’t figure out what the neatest way to accomplish the following is:

I have one file that is a list of strings that need to be found. Let’s call it KEYWORDS.TXT. The file contains about 100 or so keywords that are listed in the format KEYWORD
. (That is, simply a word followed by a newline character).

I have another file that is a list that the text that needs to be searched. Let’s call it TEXT.TXT.

What I need to do is iterate over each line of text, and check whether it contains any of the keywords. So, basically, while (<INFILE>) if $_ contains an item from KEYWORDS.TXT print OUTFILE, else loop.

So, do I need to read KEYWORDS.TXT into an array and iterate through each item of the array for every cycle of the main loop? I’m not quite sure how all the I/O works in PERL, so I don’t know how to read this into an array.

Or is there some simpler solution I’m missing. I’m fairly new to this, so go easy on me.

dre2xl · May 3, 2006, 8:18pm

Well, you could do the double iteration, but another way is to read KEYWORDS.TXT into a hash.

Such as this:

my $keywordsHash;
$keywordsHash{$keywordsFileLines[1]} = 1;
$keywordsHash{$keywordsFileLines[2]} = 1;

and so on.

Then, when you’re reading TEXT.TXT, do this:

foreach (@textFile)
{
if (keywordsHash{_})
{
[print OUTFILE]
}
}

Meros · May 3, 2006, 8:59pm

It looks like that would only work if your text.txt had one word per line. Otherwise, the key for your hash would contain the entire line and you would end up with a lot of false results.

If there’s more than one word per line of the file to be scanned, then I would do a double iteration. (please ignore my sloppy code, i’m writing this quickly, god i love perl )



open FH, "<keywords.txt";
$i=0;

while(<FH>){
    $keywords[$i]=chomp($_);
    $i++;
}
close FH;

Do that to load your keywords into an array to step through, then for the other file, for the purposes of this i’m going to dump lines which match into another variable for writing to a file



open FH, "<text.txt";
while(<FH>){

    $line=$_;
     foreach $key (@keywords){
         $pat="/$key/";
         
         if($line =~ $pat){
               $newtxt=$newtxt.$line;
         }
    }
}

Again, this as it stands may or may not be functional (haven’t tested it, but i’ve been writing code like this all day), but hopefully I got the gist across.

friedo · May 3, 2006, 9:05pm

First iterate over the keywords and build an array containing each keyword as an element:



use strict;
use warnings;

my @keywords;
open my $kwfh, "KEYWORDS.TXT" or die $!;
while( <$kwfh> ) { 
    chomp;
    push @keywords, $_;
}

Then iterate over the lines in TEXT and check to see if each keyword appears:



open my $tfh, "TEXT.TXT" or die $!;
while( <$tfh> ) { 
    foreach my $kw( @keywords ) { 
        if( /\Q$kw/ ) { 
            print;
            last;
        }
    }
}

dre2xl · May 3, 2006, 9:10pm

Yeah, I was picturing one word per line in TEXT.txt . for multiple words per line, your solution’s good =)

pulykamell · May 3, 2006, 9:21pm

friedo:

First iterate over the keywords and build an array containing each keyword as an element:
use strict;
use warnings;

my @keywords;
open my $kwfh, "KEYWORDS.TXT" or die $!;
while( <$kwfh> ) { 
    chomp;
    push @keywords, $_;
}
Then iterate over the lines in TEXT and check to see if each keyword appears:
open my $tfh, "TEXT.TXT" or die $!;
while( <$tfh> ) { 
    foreach my $kw( @keywords ) { 
        if( /\Q$kw/ ) { 
            print;
            last;
        }
    }
}

I tried it once myself, and I tried it once using your exact code.

For some reason, for both your & my program, it’s spitting everything out at me. Every single line matches. I can’t for the life of me figure out why.

I commented out the second half of the program, to make sure the array is being read in right, and the array itself is fine. 134 items, each one consisting of one element, everything is okay.

So something in the for loop is causing everything to match.

friedo · May 3, 2006, 9:31pm

Does your KEYWORDS.TXT have any blank lines? If so, an empty-string as a keyword would match everything. You could modify the loop to read the keywords to filter them out:



my @keywords;
open my $kwfh, "KEYWORDS.TXT" or die $!;
while( <$kwfh> ) { 
    chomp;
    next unless length $_;
    push @keywords, $_;
}

pulykamell · May 3, 2006, 9:36pm

friedo:

Does your KEYWORDS.TXT have any blank lines? If so, an empty-string as a keyword would match everything. You could modify the loop to read the keywords to filter them out:
my @keywords;
open my $kwfh, "KEYWORDS.TXT" or die $!;
while( <$kwfh> ) { 
    chomp;
    next unless length $_;
    push @keywords, $_;
}

:smack:

I had a blank line at the very end of the file. See, this is why I’m not a programmer. Stuff like this would drive me bonkers. Last time I was in this position, it was substituting a “=” for a “==”. Took me an hour to figure out. Only to figure out that I was still wrong and needed an “eq”.

Argh.

Thanks all!

Punoqllads · May 4, 2006, 12:46am

You can do it without a nested loop using slices if the files are small enough to fit into memory.



open(KEYWORDS, "KEYWORDS.TXT");
open(TEXTFILE, "TEXT.TXT");
open(OUTFILE, ">OUTPUT.TXT");

my (@keywords, @lines, %lineNum, %outlines);

@keywords = <KEYWORDS>;
@lines = <TEXTFILE>;

# Note that it's "@lineNum", not "%lineNum" or even "$lineNum".
# We're using slices.

# lineNum is so that we can print out the valid lines in the same order they
# were in the input text file.
@lineNum{@lines} = (0 .. $#lines);

chomp @keywords;

foreach $word (@keywords)
{
  # See previous note on slices.
  @outlines{grep(/$word/, @lines)} = 1;
}

print( OUTFILE sort { $lineNum{$a} <=> $lineNum{$b} } keys %outlines );

Topic		Replies	Views
perl script help Factual Questions	18	2251	January 30, 2010
Please explain this Perl code Factual Questions	1	726	April 4, 2006
perl help (foreach with %array) Factual Questions	7	1642	January 22, 2012
Need help extracting info from a text file Factual Questions	19	1189	December 21, 2004
Programming alternatives, Part 2 Factual Questions	113	6887	October 8, 2018

Simple PERL programming question

Related topics