Need help extracting info from a text file

I have a friend who asked me to help him get some information out of a text file. The file is a collection of E-mails and the common information is

Name=<somedata>
Address=<somedata>
City=<somedata>
State=<somedata>
ZIP=<somedata>
email=<somedata>

with so spaces at all, is there a simple way to do this in word? I do have the original mozilla mail file it was made from

I don’t want to go though it by hand because it is over 15000 pages long

Email it to me. If the formatting is consistent, I should be able to do a simple program to extract the data to a file for you.

The file is 56 megs so I can’t E-mail it. I will send you a sample of the data

What exactly do you mean by “get some information out of” the text file? To what kind of format must the information be extracted? If it’s something simple, it can be done in a line or two of Perl.

Friedo it can stay in pure text format the problem is I don’t know perl and the length of data to be removed leaving just the contact informationvaries so I can’t use a looping script

Keep it poorly nourished (low protien), sleep deprived, and in low light conditions for about a week. Following that, restrain it in an upright postion for 45-60 minutes and make it watch you read and/or eat a very well-prepared omelette. Following this, speak to it in randomly kind/angry/pleading tones about completely irellevant topics. Berate it for not knowing obscure baseball statistics, implore it to please give you its Aunt Gladys’ potato salad recipe, ask it its name, rank and ID number–and laugh that it is lying. Vomit the omelette into a trash can and return the file to its cell. After about 5 days or so, the file will give up its info. Clown paint may or may not get speedier results, but by no means are you to come into direct physical contact with the file. That’s illegal.

Of course you can, it’s just a matter of defining the loop. :slight_smile:

So post an example of some actual data, and an example of what you want the end result to look like. I can probably at least get you pointed in the right direction.

Diet Peach Snapple hurts when it comes out your nose. I turned and avoided the keyboard so you don’t owe me a new one. :smiley:

In Word? Probably not.

You could probably do this in Excel. Perl would be my preference, and there are probably dozens of people including myself on this board who could write an extractor loop in 5 lines or less. Don’t be frightened, you could learn something really useful.

But we’re missing a big piece of information… Ordinarily this is the nice, clean kind of format that you’d want to put text INTO. So what do you want the OUTPUT to look like? If you specify that, and provide some sample data, I’m sure one of us will take up the task due to boredom. Perl people are like that :slight_smile:

I want to thanbk everyone who helped me here Q.E.D and FRIEDO. I was able to get the informaion out that my friend’s father needed and I look like a hero.

Just out of curiosity, can we see the solution?

Sure but the solution doesn’t work 100% because I noticed the file I was given uses two different types of formatting I think I can figure out how to get it to pull out the second type



#!/usr/bin/perl

use strict;
use warnings;

my $input = $ARGV[0];

open IN, $input or die "Could not open $input for reading: $!";

while(<IN>) {
        if (/^(.*?)=(.*?)$/) {
                my $key = $1;
                my $val = $2;

                print "$val," if $key eq 'Name';
                print "$val," if $key eq 'Address';
                print "$val," if $key eq 'City';
                print "$val," if $key eq 'State';
                print "$val," if $key eq 'ZIP';
                print "$val
" if $key eq 'email';
        }
}


It pulls out the info from the following format fine

OtherLocation=
Other:
Other
Other:
Beds=
Baths=
Name=
Address=
City=
State=
ZIP=
email=

but doesn’t process the second form of data he gets which looks like

Below is the result of your feedback form. It was submitted by
() at on Sunday, October 17, 2004 at 20:57:11 EST

The Wildwoods:

Single:

Duplex:

Townhouse:

Condo:

Beds:

Baths:

Price Range:

Dates:

Address:

City:

State:

ZIP:

Home_Phone:

Best_Time_To_Call_Home: PM




I think I can use 

print "$val," if $key eq 'City:';



to get the second type of data but don’t know how to get the name and E-mail associated with it since that appears at the top with no identifier that I can understand

Nope… you wouldn’t change the key, you would change the regular expression. What I mean by regular expression is the piece of code that describes a format.

Your original RE was:


/^(.*?)=(.*?)$/

The “slash” character just means the begin and end of the RE.
This expression means "at the beginning of the line, match the shortest possible strings that are separated by an = sign, followed by an end of line (roughly).

So to capture the different format with the colon, it should work if you just change the = sign to a :
good luck…

So you mean copy the script then change it to




while(<IN>) {
        if (/^(.*?):(.*?)$/) {
                my $key = $1;
                my $val = $2;

                print "$val," if $key eq 'Name';
                print "$val," if $key eq 'Address';
                print "$val," if $key eq 'City';
                print "$val," if $key eq 'State';
                print "$val," if $key eq 'ZIP';
                print "$val
" if $key eq 'email';
        }
}



Is that correct? If so how do I also copy the name and E-mail address that is in the format of

elow is the result of your feedback form. It was submitted by
<name> (<email address>) at <IP> on Sunday, October 17, 2004 at 20:57:11 EST

Yes, try the above that you wrote… it couldn’t hurt. Put a few print statements in there to see what you’re getting out of the match, play with it until you get it right.

As to the second question, is there really a line break after “It was submitted by” ? If so, it takes a little thought… if not, it’s fairly straightforward, here’s one untested solutuion:

if (/It was submitted by (.?) (.?@.*?)) at ([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})/ {
my ($name, $email, $ip) = ($1, $2, $3);
# do what you like with $name, $email, $ip
}

There are some big problems with the above code… for example it doesn’t match all possible email addresses, it will not catch first and last names, it will count invalid IP’s as valid ones, etc. It’s just for illustrative purposes.

You can easily teach yourself how to do this, you just need to spend some time on it. Otherwise… I sense that this is for business purposes, so email me to talk about a fee if you want me to take this any further. It wouldn’t be steep and it sounds like it would be worth your while.

OMG, MannyL, you’d didn’t do this, did you? You didn’t actually send a 56Mb file of names and email addresses to Q.E “Mr Spammer” D?

Your inbox, and the inbox of everyone you know, is history, pal.

:wink:

Well it’s not really for business purposes as I am not going to be making any money off it. I have a friend who’s father does something with real estate and he asked me to get data off the file just a friend doing a freind a favor

I used fake addresses well mostly fake I repaced all the addresses with yours :slight_smile:

Your friend’s father does real estate work free of charge? That’s charitable. :dubious:

You see my point… information has value. Someone’s going to make money off of this work - just not you or any of us, it seems.

Ugh. Why didn’t you just type:



if ($key eq "email") { print "$val
" } else { print "$val," }


? I literally couldn’t write the code you posted, simply because the line I put is so much more elegant. But I’m geeky like that.

Yes his father does this work free of charge to help out someone who helped him out when he was younger. That person may be and most likely is making money but he spent alot to help out friend’s dad. That said I will be contacting you for some paid teaching so I can learn more and ask for compensation next time they ask me to some something at 0 hour.