This is probably trial, but I’m pretty new to PERL, and I can’t figure out what the neatest way to accomplish the following is:
I have one file that is a list of strings that need to be found. Let’s call it KEYWORDS.TXT. The file contains about 100 or so keywords that are listed in the format KEYWORD
. (That is, simply a word followed by a newline character).
I have another file that is a list that the text that needs to be searched. Let’s call it TEXT.TXT.
What I need to do is iterate over each line of text, and check whether it contains any of the keywords. So, basically, while (<INFILE>) if $_ contains an item from KEYWORDS.TXT print OUTFILE, else loop.
So, do I need to read KEYWORDS.TXT into an array and iterate through each item of the array for every cycle of the main loop? I’m not quite sure how all the I/O works in PERL, so I don’t know how to read this into an array.
Or is there some simpler solution I’m missing. I’m fairly new to this, so go easy on me.
It looks like that would only work if your text.txt had one word per line. Otherwise, the key for your hash would contain the entire line and you would end up with a lot of false results.
If there’s more than one word per line of the file to be scanned, then I would do a double iteration. (please ignore my sloppy code, i’m writing this quickly, god i love perl )
open FH, "<keywords.txt";
$i=0;
while(<FH>){
$keywords[$i]=chomp($_);
$i++;
}
close FH;
Do that to load your keywords into an array to step through, then for the other file, for the purposes of this i’m going to dump lines which match into another variable for writing to a file
Again, this as it stands may or may not be functional (haven’t tested it, but i’ve been writing code like this all day), but hopefully I got the gist across.
I tried it once myself, and I tried it once using your exact code.
For some reason, for both your & my program, it’s spitting everything out at me. Every single line matches. I can’t for the life of me figure out why.
I commented out the second half of the program, to make sure the array is being read in right, and the array itself is fine. 134 items, each one consisting of one element, everything is okay.
So something in the for loop is causing everything to match.
Does your KEYWORDS.TXT have any blank lines? If so, an empty-string as a keyword would match everything. You could modify the loop to read the keywords to filter them out:
my @keywords;
open my $kwfh, "KEYWORDS.TXT" or die $!;
while( <$kwfh> ) {
chomp;
next unless length $_;
push @keywords, $_;
}
I had a blank line at the very end of the file. See, this is why I’m not a programmer. Stuff like this would drive me bonkers. Last time I was in this position, it was substituting a “=” for a “==”. Took me an hour to figure out. Only to figure out that I was still wrong and needed an “eq”.
You can do it without a nested loop using slices if the files are small enough to fit into memory.
open(KEYWORDS, "KEYWORDS.TXT");
open(TEXTFILE, "TEXT.TXT");
open(OUTFILE, ">OUTPUT.TXT");
my (@keywords, @lines, %lineNum, %outlines);
@keywords = <KEYWORDS>;
@lines = <TEXTFILE>;
# Note that it's "@lineNum", not "%lineNum" or even "$lineNum".
# We're using slices.
# lineNum is so that we can print out the valid lines in the same order they
# were in the input text file.
@lineNum{@lines} = (0 .. $#lines);
chomp @keywords;
foreach $word (@keywords)
{
# See previous note on slices.
@outlines{grep(/$word/, @lines)} = 1;
}
print( OUTFILE sort { $lineNum{$a} <=> $lineNum{$b} } keys %outlines );