Figuring out which traffic is legitimate and which isn’t, can be tricky. If a (ro)bot script is written well, and often they are, they appear to be just like any other normal visitor.
The challenge that we’re presented with is being able to spot the bad bots and isolate their traffic from the “good” traffic – the traffic we want to keep.
Bot, or Not?
Bots don’t load pages randomly. They do it for a reason. That particular reason depends on its purpose and what they want to achieve.
If the bot is a web crawler, like Google or Bing, then the reason is clear.
But other bot traffic can have more nefarious motives.
It could be a bot designed to test your login. It may try a dictionary-based attack to “guess” login credentials to gain access to the admin area.
It could be a SPAM bot whose sole mission in life is posting SPAM comments to your blog articles.
Or they could simply be probing your site, building up a picture of it so it has a better idea of how your site is structured and what it’s “made of”. The reasons for doing this are endless, but nothing good is likely to come from it.
Shield Security has protection against various types of requests and uses different tools to thwart bots depending on what exactly is happening. It uses the particular nature of the requests to identify them as coming from a bot, or not.
But in the case of a “probing” bot that’s just browsing your site, it can be difficult to isolate this traffic and “see” a bot request when it looks just like a normal visitor.
We don’t want our sites probed. We don’t want bots building up a picture of our site for later use.
So then, how can we reliably identify bots that don’t look like bots?
Legitimate Bots Follow The Rules; Bad bots don’t care.
There’s a file on all your WordPress sites, called robots.txt
. You can see ours here: robots.txt
It’s not a real file, but one that WordPress creates on-the-fly as it’s needed.
The purpose of robots.txt is to provide a lists of rules for bots and web crawlers, like Google and Bing. It tells them what they’re allowed to index, and what they need to ignore. The search engines will only index pages that are permitted, and wont even scan the ones that are disallowed.
Legitimate bots and crawlers will honour these directives.
Since malicious bots don’t care about the rules, they’ll completely ignore your robots.txt and do whatever they like.
That’s outrageous, you might say, but we can use this to our advantage, as we’ll see a bit later.
There’s another rule that web crawlers look at when they’re indexing your site:rel="nofollow"
. This property is added the web links to tell crawlers not to index the page behind the link – to basically ignore it.
Again, good bots should adhere to these rules, bad bots aren’t going to care.
How to use robots.txt rules and rel=”nofollow” to identify bad bots
Now that we know bots will blatantly ignore the rules, we can use this to our advantage.
Imagine you placed a link on your page:
- which was hidden, so only bots could see it and human visitors were oblivious
- which had
rel="nofollow"
on it. - which linked to a page on your site that was disallowed by your robots.txt rules.
Now imagine that you saw traffic to that link. How could that traffic possibly have been generated?
The only traffic that would go to that link are bots (that could see the link on your page) and ignored all the rules that said “do go there”.
Mouse Trap for Bad Bots comes to Shield Security Pro 7.3
Shield Security Pro 7.3+ has a Mouse Trap feature for detecting bad bots. This new feature arrives along with along with a few other bot detection signals which we’ll discuss elsewhere in more detail.
The Mouse Trap works by leaving a bit of “cheese” in the form a link that alerts Shield to a bad bot whenever it’s been accessed.
It’s a simple matter of switching on the option and Shield will take care of the rest. When enabled Shield Security will:
- automatically update your
robots.txt
file to indicate the virtual links that are not to be indexed. - automatically include an invisible, fake link at the footer of all pages on your site. This is the cheese to tempt the bad bots.
Options for handling bots that nibble on the link cheese
There are 4 options for dealing with any bots that access the fake link in the mouse trap:
- Log the event in the Activity Log
- Increment the offense counter for that IP (by 1)
- Double increment the offense counter for that IP (by 2)
- Immediately block the IP address
How your site responds is entirely up to you. To err on the side of caution we recommend option #1. But if your patience is a little thin for bad bots, choose the option #2 and immediately block them from further probing.
We definitely advocate a more lenient approach in the beginning. Test this feature on your site by assigning black marks against and IP address several times before outright blocking it. In this way you can be sure the feature works well for your site before firming up your protection and instantly blocking offenders.
Please do leave us comments and questions below and we’ll get right back to you.
[At the time of writing, Shield 7.3 hasn’t been released as it’s undergoing final testing. It wont be long though… ]
[2019/04/10] Update regarding SEO implications
Following some comments about this feature and the potential impact on SEO and the view Google might take on this with respect to “cloaking”. While this, of course, isn’t bad cloaking, it’s still cloaking – the act of hiding or changing content based on whether a visitor is a Google Bot.
As a result, we’ve changed how the feature operates. Now, it’ll always print the link regardless of whether the visitor is an official web crawler such as Google Bot. or not. And since legitimate bots honour directives set out in robots.txt
, they’ll not follow the links anyway.
Paul, I love this new idea. Will this help for those bots that are trying to find your login username? And just wondering if this will use an IP block or other?
Heya Charlie,
Yep, this will all use the IP block system and there will be separate options that deal with invalid login usernames etc. More on that to come though…
Cheers!
Paul.
Great stuff! Thank You Paul. This new feature is really cool.
Great! Delighted to hear you like the feature 🙂
HI Paul,
I like the idea os setting up a trap but not so much that you will put links on all the pages of a site.
This can have problems on the long term. I cannot be sure without more checks but:
– SEO issue. Will google see this as an attempt to get links and hide them? I know is not the case, but will the google bot understand?
– In case google consider this oK for SEO, can this be considered by the google bot that your website has been hacked, as some hacks do exactly that, put links on your site that only bots can see.
I think the idea is not bad, but one link on one page (login page for example) or with a shortcode on the frontpage or wherever the admin want it to appear, will work too and won´t be that risky for legimitate bots to make a mistake.
Have you tested all this on your websites? for how long? Did you contact google to make sure this is ok with them?
Thank you!
Hi Angel, thanks for the question.
With SEO and Google – legitimate bots are automatically filtered out from any Shield processing. This means that these links wont ever appear to the search engines in the very first place, and so can’t have any effect on your SEO.
These links will NOT appear if any of the following are true:
– you’re logged in
– your IP is white listed
– the IP manager module is disabled
– the request comes from a legitimate web crawler such a Google/Bing.(see update above)Hope this helps.
It helps but I do not agree. It is too much risk.
Because then you are hiding the link from them too, so if they change the bot name un the future this could be a big issue.
I won´t activate this unless I can select which pages are showing the links.
I will prepare a mouse trap with a page not indexed for google and then put the link there, but I need to control that.
Add a shortocode or a selector to select the page we want the link to appear and then I could use it.
Bots are not a threat that important to risk the whole SEO of the website. Even if the risk is small, I want 0 risk on that matter.
also, from google:
https://support.google.com/webmasters/answer/66355?hl=en&ref_topic=6001971
The problem could be the cloaking of the link even if the link is a “no follow”, having hidden links could be a problem.
I prefer to do this on a page I know my nice bot won´t reach just in case.
Hi Paul,
I share the samen concerns as Angels has! I am a SEO-specialist and am afraid that Google will see this as a trick to try to get higher rankings. They don’t like invisible links and punish you when they see these!
Fred
Hi Fred. As you’ll see above in the reply to Angel, this can’t have any affect on your SEO because legitimate search engines wont ever see the hidden links.
I like the overall concept and it appears you have it figured out regarding how you will handle the legitimate BOTs like Bing and Google. Like you say, legitimate BOTS will never see the links if they follow the rules.
Hi Roy. That’s exactly right. Shield already whitelists legitimate bots and doesn’t apply any processing when they visit, so these links wont ever appear.
Hi Paul,
And what about when you use a plugin like WP Fastest Cache with the setting “Create the cache of all the site automatically” on?
When is the mouse trap injected in the page?
Hi Fred,
In this case, it’ll likely never appear. But this is typical of the trade-off that needs to be made when using Page Cache in what is an inherently dynamic platform: https://www.icontrolwp.com/blog/page-caching-wordpress/
Hi Paul,
Thanks! Great article which makes a lot of sense!
Love your solutions and explanations.
Fred
Hi Paul,
You state that “Now, it’ll always print the link regardless of whether the visitor is an official web crawler such as Google Bot. or not.”
Could you please explain how and where the link is printed?
Thanks!
Fred
Heya Fred, sure. The link is printed right at the end of the page, just before the closing
</body>
tag. It’ll have a bit of CSS which will hide the link from being displayed from normal visitors, so only bots would “see” it. Google (etc.) bots wont follow it because robots.txt will have told them not to.Hope that helps, Fred.
Cheers.
So just confirming this wont have any adverse effect on SEO and Google will NOT see this as bad behaviour?
I really need this feature and I will get the plugin as early as tomorrow if you can kind of put my mind to ease. I am about to launch a website which I have worked on for months including its SEO and everything else. I would hate for it to start on the wrong foot vis-a-vis google.
I read the above comments but just making 100% sure that there should be no issue. You have not heard anything confirmed that this feature can be harmful have you?
Waiting for your response. Btw I love the plugin. I replaced Sucuri with your plugin – thats how much I love it. And as mentioned, I am very close to buying the pro version.
Thanks again!
Hi,
Thanks for the comment and question.
So yes you are correct – there will be no adverse SEO effects here. We listened to the feedback on this and decided that it was better to include all the links whether being accessed by a Google/Search engine bot, or not. The robots.txt indicates what should/shouldn’t be followed, so legitimate bots that respect robots.txt wont be affected by this whatsoever.
Furthermore, Shield will never block a legitimate search engine spider because it verifies them when they present themselves and we whitelist them from any Shield protections.
If we’d heard anything about SEO harm being caused here, we’d have adjusted the feature immediately. This would affect us also, of course.
Thank you again for your question!
Paul.