Scanning a site for emails.

JJOHNSON · October 11, 2017, 9:20pm

I work for a company that asked me to scan our website for exposed emails, what other tools besides Online Email Extractor can I use that’s free? I don’t want to go through our purchasing department to get this task done.

Leaffan · October 11, 2017, 9:29pm

I have no clue what this means.

JJOHNSON · October 11, 2017, 9:37pm

@leaffan

The want to make sure no pages on our website has visible email address, so robots can't scrape it and send people in the company spam.

Duckster · October 12, 2017, 10:06pm

Not knowing the details of your website, you should be able to easily scan the site on the web server itself, looking for telltale email signature components. If you are running a CMS site, checking the data in the various databases tables should suffice.

XT · October 12, 2017, 10:57pm

I’m guessing that the person asking you don’t know what they are asking for, but my WAG is they mean a hosted (or on prem) email filtering service. There are tons of them. We used to use an on prem service from Sophos, before that it was hosted service from Symantec, now we are going to Mimecast. Basically, you redirect your email (MX record) either to a hosted site that filters your email for virus, spam and the like then forwards it to you or you redirect your own MX record to an on prem appliance that filters it and sends it on to your email server (Exchange or whatever), and vice versa…your email server will send email either through your on prem appliance or redirect it to your hosted service offsite. I would estimate this cut down on our email spam by something like 95%, maybe more. It went from being a huge issue to a minor annoyance. I could get you a list of vendors, but if you just Google ‘hosted email filtering’ you should get a ton, or ask your software vendor for a list if you trust them.

Chronos · October 13, 2017, 2:37am

I’m not sure why you would assume that’s what they meant, instead of assuming that they meant scanning for exposed emails, which is after all what they said, and is a perfectly reasonable thing to do.

septimus · October 13, 2017, 2:48am

My local machine has, more or less, a copy of my website. Is this normal to have? I would just type, from the appropriate directory on my local machine, something like
grep @ {.,}/htm*
and manually ignore “false hits” on the “@.”

edwardcoast · October 13, 2017, 3:45am

[quote=“septimus, post:7, topic:798595”]

My local machine has, more or less, a copy of my website. Is this normal to have? I would just type, from the appropriate directory on my local machine, something like
grep @ {.,}/htm*
and manually ignore “false hits” on the “@.”[/QUOTE

This won’t be useable for a website built as a CMS, you’d have to dump the MySQL database and do a search there. A CMS is usually written in a programming language like PHP and stores the content of the web pages in a MySQL database.

Chronos · October 13, 2017, 1:37pm

It’s still HTML by the time it’s sent to the user, though. You could download static HTML copies through a web browser and do the search on those.

markn_1 · October 13, 2017, 2:26pm

A better grep pattern would be something like this



grep -Ei '[a-z0-9_.-]+@[a-z0-9_.-]+\.[a-z]+' ...

If you know that all the domain names you care about end in .com, you could be even more selective with



grep -Ei '[a-z0-9_.-]+@[a-z0-9_.-]+\.com' ...

And if you know exactly what domain name the email addresses are using you could use



grep -Ei '[a-z0-9_.-]+@mydomain\.com' ...

md2000 · October 13, 2017, 2:48pm

I think I’d wonder why the heck your company website would have anything but company emails on it anyway - so a scan for “@mycompany.com” should be all you need.

To get what people see, a “roll up the web” type website download program would help. Then scan the results. the only problem, of course, is that with so much active content, composed on the fly based on user input, some websites don’t have simple output that can be scanned. If you have custom generated pages, do they have any way of producing an email as part of the content? (I.e. if you have a database of press releases, do they include emails in the press release? If they are Word files or such, would there be an email hidden in the metadata? (Even if not, if I know that John Smith is an employee because his name is in a press release, and your email is @mycompany.com then I’d try john.smith@…, jsmith, johns, johnsmith… So it’s not like email will never be known to the outside world at large.)

If they mean actual mail - why the heck would anyone put their email server (or email mailbox, like a PST file) on a public web server? Put as little as possible on the web server itself. More likely you want to check for other vulnerabilities - most companies have web access to their emails. Is this open to everyone? Do some userid’s have known passwords or simple ones? Are there generic mailboxes like info@ mycompany.com where someone might be able to guess a password?

scooter_trash · October 14, 2017, 2:19pm

Been there!

I used free software called httrack to mirror the website locally then wrote some Perl code to parse every file and look for email addresses using regular expressions. it wasn’t hard at all.

JJOHNSON · October 15, 2017, 4:21pm

Thank you all for the positive feedback. I will try the suggestions made. If I still have any problems, I will be back in this forum. I needed help and I now know I went to the right place.

edwardcoast · October 16, 2017, 9:04am

That’s not an efficient or comprehensive method. If the website were a CMS, there could be parts of it that are currently unpublished which could contain the email address you don’t want to appear in text. Doing a dump of the database and using Linux utilities like ‘grep -i’ would reveal whatever the website contains.

DPRK · October 16, 2017, 4:08pm

Business (ideally, all) email addresses need to be screened against spam and viruses anyway. And internal email addresses should not be accessible from outside the internal network so it should not be possible to spam those.

In any case, is it truly your intention that nobody be able to contact your company over email? What if the addresses were displayed as images rather than text?

CookingWithGas · October 16, 2017, 7:40pm

With all due respect the are asking you to scan for exposed email addresses.

Really_Not_All_That_Bright · October 16, 2017, 8:44pm

Whether or not they’re screened against spam and viruses, some (particularly the former) will still get through. My e-mail address is a matter of public record. Mimecast catches about 80% of spam, and Outlook’s internal filter catches much of the rest, but it’s still a pain.

Topic		Replies	Views
Dealing with email Factual Questions	22	1543	February 8, 2008
What do email harvesters look for on a website? Factual Questions	3	801	January 28, 2007
Why can't my employer stop the spam e-mails? Factual Questions	24	1564	July 16, 2006
Coding Corporate Profiles without fear of SPAM Factual Questions	7	762	July 12, 2007
Gmail Terms of Service: We read your mail In My Humble Opinion	33	4691	April 20, 2014

Scanning a site for emails.

Related topics