Preventing website "harvesting"?

I know this can be done and I could do some searching to refresh my fuzzy memory of how, but I’d appreciate some short answers to speed the plow, as I’m still draggy from surgery last week.

I have a popular hobby-topic website that includes a ton of media - photos, sound clips, video clips and links - and in general I don’t protect or want to protect any of that material from access or download.

However, I have perpetual data overhead from links to my content, and worse, about once a month I get an ‘excess activity warning’ from my host that appears to be someone slurping the entire site for their own files.

How do I:

[ol]
[li]Limit outside use of my media resources, or at least track such use so I can block selected hogs and/or encourage them to host the file themselves?[/li][li]Block the easy roads to pulling down several gigs of site data at once?[/li][/ol]
Thanks. ETA: The site is on SiteGround, if that shapes or directs any of the answers.

Step one would be to read this I think: https://www.siteground.com/tutorials/cpanel/hotlink_protection.htm

I’m sorry for your surgery.

However I really doubt if there’s anything that can stop website copiers like HHTrack or people using WGET etc.; there are little javascript tricks to stop image copying that merely annoy, and are easily defeated ( with extensions such as… ‘Enable Right Click’ etc., or merely turning off javascript ).

I didn’t know people were still concerned with hotlinking — it takes one back to a different time ( and since their own hosting is pretty cheap there’s certainly no excuse for hotlinking now ) — but it’s easily blocked by modifying your .htaccess file .
Or more extreme: finding a stop hotlinking option in your CPanel: your hosts should have suggested this,

I have a more fun option to stop hot linking. Once I had a Seattle based blog hot linking to an image on my site. I switched the image to a logo for the Oklahoma City Thunder and the hot linking stopped very quickly thereafter. NBA fans will understand.

You can take a look at referrer logs to see where downloads are coming from to detect someone who has a link to an image or some large file on your site, and then either ban that referrer or move the image.

If someone is using a program like wget to download your entire site, you could try to block them in any number of ways, but you’re not likely to be successful. They can probably work around whatever protections you put into place.

Anyone remember when SDMB threads were harvested to make an “all chickens, all the time” website? Back about '05? Cockadoodledoo forum?

QtM remembers . . .

I’d forgotten that. I think there were a few occasions when other boards stole content from us. Nice to see though that Cockadoodledoo is clucking no more.

There was some popular site (was it Slashdot?) that defeated hotlinking by substituting really nasty porn images.

There was some web site (not Slashdot) multiple years back, and I mean back in “almost everybody is on dial-up days” a web site was having it’s photos hot-linked by a school, and even used in lessions. The site owner contacted the school and asked them to stop, they essentially told him to bite them (but more politely, claiming something like fair use or some other hot air) so he swapped out porn for them. They stopped hot linking.

Google prevents this by blocking your IP if you try and access too many pages too quickly. They probably are extra miserly when the user-agent looks a bit iffy as well.

Are you sure that the slurper isn’t some sort of spider from a search engine. You can prevent that with a sitemap and/or .htaccess setting.

There are web application firewalls that can prevent webscraping, but such solutions do not come cheap.
Sent from my SM-G900I using Tapatalk

Actually, there is one that is free and works amazingly well: CloudFlare. I cannot believe how well it works. It offers basic blocking by default, including bot prevention and basic content scraping. It also offers free SSL and globally multicast DNS. BUT… on top of all that, it gives you a free global CDN to hide your origin server behind, so if you’re serving mostly static content (which is what Barbarian’s content sounds like) it could take 50% or more of the bandwidth usage away, depending only on how frequently you clear your most popular cached objects. That’s all under their free tier. As a bonus, if your site is a PHP site (or generated using some other serverside code), CloudFlare will speed it up a huge amount for your visitors just by virtue of caching the already-rendered content.

Moving from Distil (another WAP-type service) to CloudFlare saved us hundreds a month and CloudFlare offered much, much better performance and the free plan worked fine for us for quite a while.

TLDR: Put the whole site behind CloudFlare’s free plan and stop worrying about it.

I’ve never really looked at what Cloudflare actually offer, just at the connections our customers get from their services (usually just CDN aggregation).

Interesting.

Yeah, it’s really ridiculous what they’re able to offer for free or cheap ($18/mo goes VERY far) compared to Distil, Rackspace, Amazon, Azure, Imperva, etc. We considered all of those options and several others, gave CloudFlare a shot – initially rather skeptical – but it soon proved to work so well that we redid our entire stack to move off Amazon CloudFront, took Distil away, and refactored everything to work better with CloudFlare. They are beta-ing automatic failover across regions now, which was the last thing that Rackspace offered that they didn’t yet.

I imagine they will be very disruptive in the years to come. Distil begged and begged us to come back but they just couldn’t match CloudFlare’s performance to cost ratio, and CloudFlare never gave us the horrendous false positives that Distil did either. Distil on paper is much better at blocking bots, but what they don’t tell you (and what they are unable to fix, still) is that they also block a lot of real humans that turn into customer support headaches. Anyway, that’s neither here nor there, but Amateur Barbarian: just don’t go so far in the paranoia that you end up with false positives who can’t view your website at all. CloudFlare’s been fine for us in this regard, especially if you limit it to only CAPTCHA or JavaScript-test violators, not ban them altogether. Distil, a competitor, does not have that kind of granularity.

Just as a note - hotlinking is not a copyright violation but copying the picture to use on your website is.

Sure. But this is the net we’re talking about, not reality. :slight_smile:

Thanks for all replies and discussion. I really do understand most of the basics of redirectors and all that; I was looking for the one really simple answer that was eluding my foggy brain.

I don’t much care about hotlinking and I’m pleased to have my resources used. But the periodic site-scrapes that garner me bandwidth warnings are annoying.

When I write a website I would use links rather than copying it into mine because I respect IP rights.

With both full attribution prominently displayed on your site and written permission of the copyright holder and the website owner on file. Right?

I’m not sure how an unattributed hot link is either within copyright or fair use or showing respect.

Because links are not a violation of copyright.