Text utility required: harvesting email addresses from text file

Okay, this is driving me nuts, folks. I turn to the collective wisdom of the Straight Dope.

I have a rather ginormous text file composed of a huge number of emails clumped together (3000+). I need to export all email addresses from this file into another text file (preferably an excel document).

I know there’s a fairly straightforward way to do this, but I’m coming up dry.

Any thoughts? I have a strong feeling that it’s something really obvious that a three year old could have pointed out to me. Such is life. :slight_smile:

Assuming it is a delimited text file, all you need to do is start Excel and select the text file to open.

See Delimiter - Wikipedia for an explanation.

Can you post a snippet in a code block?

can’t you just grep for regular expressions and pipe into a new file? :wink:

Seconded. Depending on the formatting, I can probably write a simple program to do this for you (assuming you’re using windows).

My Perl and sed fingers are itching for a sample. This is the kind of thing I get paid The Big Bucks for. And sometimes I do it for fun!

I am deranged individual.

You guys rock. :slight_smile:

Okay, here’s a code snippet, anonymized a bit. The full version has something like 3400 of these.

Yeah, Lotus server. Whee. :slight_smile:

Each message starts with either “,” a line break and “Principal:” or “,” a line break and “In_Reply_To:”. Messages themselves have a fairly arbitrary length and an arbitrary number of fields (thank you, Lotus).

Just looking at it, it seems you could just search for “words” that contain an @ symbol and then export the results of that search. There’d be some false hits with it, but not enough to be really annoying.



,
Principal:  CN=John Smith/O=AAA
$AltPrincipal:  CN=John Smith/O=AAA
ForwardedFrom:  CN=John Smith/O=AAA
ForwardedDate:  11/09/2006 03:15:00 PM
OriginalModTime:  11/09/2006 03:26:03 PM
Logo:  stdNotesLtr0
useApplet:  True
DefaultMailSaveOptions:  1
ExpandPersonalGroups:  1
tmpImp:  
Sign:  
Subject:  Fw: My Rhinoceros Is Lonely
SendTo:  JSmith@sample.com
CopyTo:  
InetSendTo:  .
InetCopyTo:  
$StorageTo:  .
$Orig:  7D526ECE365A675F85257221006F68B1
$Mailer:  Lotus Notes Release 7.0 August 18, 2005
$MessageID:  <OF7D526ECE.365A675F-ON85257221.006F68B1-85257221.00703FB7@LocalDomain>
From:  CN=John Smith/O=AAA
INetFrom:  JSmith@sample.com
PostedDate:  11/09/2006 03:26:00 PM
Encrypt:  0
SpamSentinelVerified:  1
SpamS_Version:  4.0.3.9
$UpdatedBy:  CN=AAALOTUS/O=AAA
Categories:  
$Revisions:  
RouteServers:  CN=AAALOTUS/O=AAA
RouteTimes:  11/09/2006 03:27:56 PM-11/09/2006 03:27:57 PM
DeliveredDate:  11/09/2006 03:28:02 PM
BlindCopyTo:  CN=John Smith/O=AAA
InetBlindCopyTo:  MBerkey@ukcdogs.com
$StorageBcc:  1
$MiniView:  

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec id sem. Integer ultrices metus eget tellus. Donec auctor odio ut metus. Phasellus lobortis tempus orci. Ut vitae felis. Mauris faucibus. Donec eget odio eget justo lobortis dapibus. Etiam scelerisque aliquet diam. Maecenas porta egestas neque. Suspendisse laoreet faucibus mauris. Proin rutrum libero ac leo. Etiam pulvinar ligula ac magna. Phasellus quis dolor non turpis interdum semper. Aliquam a augue. Fusce ac magna. Mauris quis justo vitae nisl mattis lobortis. Praesent in mi. Curabitur sed velit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec id sem. Integer ultrices metus eget tellus. Donec auctor odio ut metus. Phasellus lobortis tempus orci. Ut vitae felis. Mauris faucibus. Donec eget odio eget justo lobortis dapibus. Etiam scelerisque aliquet diam. Maecenas porta egestas neque. Suspendisse laoreet faucibus mauris. Proin rutrum libero ac leo. Etiam pulvinar ligula ac magna. Phasellus quis dolor non turpis interdum semper. Aliquam a augue. Fusce ac magna. Mauris quis justo vitae nisl mattis lobortis. Praesent in mi. Curabitur sed velit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec id sem. Integer ultrices metus eget tellus. Donec auctor odio ut metus. Phasellus lobortis tempus orci. Ut vitae felis. Mauris faucibus. Donec eget odio eget justo lobortis dapibus. Etiam scelerisque aliquet diam. Maecenas porta egestas neque. Suspendisse laoreet faucibus mauris. Proin rutrum libero ac leo. Etiam pulvinar ligula ac magna. Phasellus quis dolor non turpis interdum semper. Aliquam a augue. Fusce ac magna. Mauris quis justo vitae nisl mattis lobortis. Praesent in mi. Curabitur sed velit.

John Smith,
Generico, Inc.

----- Forwarded by Jane Smith/Generico on 11/09/2006 08:48 AM -----

"Bradley, Bill" <bbradley@sampletech.ca> 
11/07/2006 02:04 PM	
	
	To
	<jnsmith@sample.com>
	cc
	
	Subject
	Internet Classified Ad's
	
	
	
	
	
Duis ornare, sapien nec congue ultrices, lacus lorem facilisis erat, ac laoreet risus lectus vel nunc. Aenean tincidunt tincidunt ante. Ut facilisis justo vel dui. Aliquam ultricies sem non neque. Nunc nec diam a ante egestas tristique. Donec gravida. Aliquam vitae mi. Sed ultrices faucibus nulla. Aenean vulputate urna sed odio. Etiam at enim ut lectus elementum commodo. Cras lorem pede, suscipit vitae, mattis id, aliquet ac, magna. Proin accumsan consectetuer massa.

Thanks, 
Bradley, Bill
Staff Scientist 
Spumco, Incorporated
Address of arbitrary length here
 
,
In_Reply_To:  <20060425173544.53698.qmail@myghettoserver.com>
$Orig:  ABB76DBCA89F18228525715B00614160
OriginalModTime:  04/25/2006 01:41:55 PM
MIMEMailHeaderCharset:  2031619
$StorageCc:  
$StorageTo:  .
InetCopyTo:  
InetSendTo:  .
AltCopyTo:  
InheritedFrom:  John Doe <jdoe@sample.net>
InheritedAltFrom:  John Doe <jdoe@sample.net>
Logo:  stdNotesLtr0
DefaultMailSaveOptions:  1
Principal:  CN=John Smith/O=AAA
ExpandPersonalGroups:  1
origStat:  
tmpImp:  
Sign:  
SendTo:  John Doe <jdoe@sample.net>
CopyTo:  
Subject:  Re: Inbox full at Generico Message Boards
MIME_Version:  1.0
$Mailer:  Lotus Notes Release 7.0 August 18, 2005
$MessageID:  <OFABB76DBC.A89F1822-ON8525715B.00614160-8525715B.006138CA@LocalDomain>
From:  CN=John Smith/O=AAA
INetFrom:  JSmith@sample.com
PostedDate:  04/25/2006 01:42:47 PM
Encrypt:  0
SpamSentinelVerified:  1
SpamS_Version:  4.0.3.9
RoutingState:  
$UpdatedBy:  CN=AAALOTUS/O=AAA
Categories:  
$Revisions:  
RouteServers:  CN=AAALOTUS/O=AAA
RouteTimes:  04/25/2006 01:43:58 PM-04/25/2006 01:44:00 PM
DeliveredDate:  04/25/2006 01:44:00 PM
BlindCopyTo:  CN=John Smith/O=AAA
InetBlindCopyTo:  .
$StorageBcc:  .
$MiniView:  
$MIMETrack:  MIME-CD by Notes Client on John Smith/Generico(Release 7.0|August 18, 2005) at 03/19/2007 04:02:44 PM,MIME-CD complete at 03/19/2007 04:02:44 PM
$PaperColor:  1

Curabitur ante nisi, lobortis sed, accumsan in, aliquet quis, nibh. Suspendisse potenti. Nulla ut massa eget tortor facilisis luctus. Aenean mi. Pellentesque mollis, tortor luctus sagittis consequat, urna tortor euismod augue, eget imperdiet enim mi eget nulla. Aliquam laoreet tincidunt dolor. Aliquam cursus. Suspendisse potenti. Curabitur pretium, felis in facilisis consequat, ipsum tortor iaculis nisi, ut lobortis mi nisi vitae magna. Suspendisse ac lacus. Suspendisse sit amet arcu sed purus sollicitudin ornare. Pellentesque arcu risus, dapibus a, gravida eu, tempor ac, elit. Suspendisse gravida, purus vitae feugiat lacinia, dolor arcu laoreet nulla, id auctor eros felis in pede. Nunc eu elit. Morbi id nisl nec lorem adipiscing ultricies. Maecenas nisi. Proin erat. Pellentesque eros. Integer suscipit bibendum dui. Fusce congue sapien nec metus.

John Smith,
Generico, Inc.


So are you looking just the email addresses in a specific field or all the email addresses? Either way, I think I would just grep the file like MrDibble said.

Pretty much all of them, I think. Grep it is. :slight_smile:

Process the file so that each word is on a separate line, and then grep for the following regular expression:


^[-!#$%&'*+/0-9=?A-Z^_a-z{|}~](\.?[-!#$%&'*+/0-9=?A-Z^_a-z{|}~])*@[a-zA-Z](-?[a-zA-Z0-9])*(\.[a-zA-Z](-?[a-zA-Z0-9])*)+$

Or this regular expression, of course:


@

Depending on whether false positives or programmer time is the bigger annoyance.

Job done. Thanks a bunch!

For your reward I will tell the pretty redhead who runs that department (and who has legs like whoa and vaguely resembles Kari Byron) how much you guys rock. :smiley: