Scanning a site for emails.

I work for a company that asked me to scan our website for exposed emails, what other tools besides Online Email Extractor can I use that’s free? I don’t want to go through our purchasing department to get this task done.

I have no clue what this means.


The want to make sure no pages on our website has visible email address, so robots can't scrape it and send people in the company spam.

Not knowing the details of your website, you should be able to easily scan the site on the web server itself, looking for telltale email signature components. If you are running a CMS site, checking the data in the various databases tables should suffice.

I’m guessing that the person asking you don’t know what they are asking for, but my WAG is they mean a hosted (or on prem) email filtering service. There are tons of them. We used to use an on prem service from Sophos, before that it was hosted service from Symantec, now we are going to Mimecast. Basically, you redirect your email (MX record) either to a hosted site that filters your email for virus, spam and the like then forwards it to you or you redirect your own MX record to an on prem appliance that filters it and sends it on to your email server (Exchange or whatever), and vice versa…your email server will send email either through your on prem appliance or redirect it to your hosted service offsite. I would estimate this cut down on our email spam by something like 95%, maybe more. It went from being a huge issue to a minor annoyance. I could get you a list of vendors, but if you just Google ‘hosted email filtering’ you should get a ton, or ask your software vendor for a list if you trust them.

I’m not sure why you would assume that’s what they meant, instead of assuming that they meant scanning for exposed emails, which is after all what they said, and is a perfectly reasonable thing to do.

My local machine has, more or less, a copy of my website. Is this normal to have? I would just type, from the appropriate directory on my local machine, something like
grep @ {.,}/htm*
and manually ignore “false hits” on the “@.”

[quote=“septimus, post:7, topic:798595”]

My local machine has, more or less, a copy of my website. Is this normal to have? I would just type, from the appropriate directory on my local machine, something like
grep @ {.,}/htm*
and manually ignore “false hits” on the “@.”[/QUOTE

This won’t be useable for a website built as a CMS, you’d have to dump the MySQL database and do a search there. A CMS is usually written in a programming language like PHP and stores the content of the web pages in a MySQL database.

It’s still HTML by the time it’s sent to the user, though. You could download static HTML copies through a web browser and do the search on those.

A better grep pattern would be something like this

grep -Ei '[a-z0-9_.-]+@[a-z0-9_.-]+\.[a-z]+' ...

If you know that all the domain names you care about end in .com, you could be even more selective with

grep -Ei '[a-z0-9_.-]+@[a-z0-9_.-]+\.com' ...

And if you know exactly what domain name the email addresses are using you could use

grep -Ei '[a-z0-9_.-]+@mydomain\.com' ...

I think I’d wonder why the heck your company website would have anything but company emails on it anyway - so a scan for “” should be all you need.

To get what people see, a “roll up the web” type website download program would help. Then scan the results. the only problem, of course, is that with so much active content, composed on the fly based on user input, some websites don’t have simple output that can be scanned. If you have custom generated pages, do they have any way of producing an email as part of the content? (I.e. if you have a database of press releases, do they include emails in the press release? If they are Word files or such, would there be an email hidden in the metadata? (Even if not, if I know that John Smith is an employee because his name is in a press release, and your email is then I’d try john.smith@…, jsmith, johns, johnsmith… So it’s not like email will never be known to the outside world at large.)

If they mean actual mail - why the heck would anyone put their email server (or email mailbox, like a PST file) on a public web server? Put as little as possible on the web server itself. More likely you want to check for other vulnerabilities - most companies have web access to their emails. Is this open to everyone? Do some userid’s have known passwords or simple ones? Are there generic mailboxes like info@ where someone might be able to guess a password?

Been there!

I used free software called httrack to mirror the website locally then wrote some Perl code to parse every file and look for email addresses using regular expressions. it wasn’t hard at all.

Thank you all for the positive feedback. I will try the suggestions made. If I still have any problems, I will be back in this forum. I needed help and I now know I went to the right place.

That’s not an efficient or comprehensive method. If the website were a CMS, there could be parts of it that are currently unpublished which could contain the email address you don’t want to appear in text. Doing a dump of the database and using Linux utilities like ‘grep -i’ would reveal whatever the website contains.

Business (ideally, all) email addresses need to be screened against spam and viruses anyway. And internal email addresses should not be accessible from outside the internal network so it should not be possible to spam those.

In any case, is it truly your intention that nobody be able to contact your company over email? What if the addresses were displayed as images rather than text?

With all due respect the are asking you to scan for exposed email addresses.

Whether or not they’re screened against spam and viruses, some (particularly the former) will still get through. My e-mail address is a matter of public record. Mimecast catches about 80% of spam, and Outlook’s internal filter catches much of the rest, but it’s still a pain.