W3C Link Checker and Robots.txt Exclusion

I was ready. I had just finished updating the last few links (and some content) on my clients website and I was ready to check for broken links. I opened the trusty link checker over at W3C Schools and got the following error:

Error: 403 Forbidden by robots.txt

This particular site is a WordPress site and I use the Google XML Sitemap plugin to handle all that stuff, which, in this case defaults to using WordPress to generate the robots.txt. I could, in the WordPress settings, allow search engines access but this was the staging server and I couldn’t allow robots to start searching and indexing it.

What I needed was an exception to allow the W3C Link Checker to search the site.

To solve this problem I first copied the exact text from the dynamically generated robots.txt file. I did this by navigating to the file at the root: http://yourclientsdomain/robots.txt. Next, I created a temporary robots.txt file in the root of the site. In it I pasted the exact text from the first step. Then from the W3C Link Checker documentation I added a robots exclusion rule for only their site. Like so:

User-Agent: *
Disallow: /

User-Agent: W3C-checklink
Disallow:

This allows the link checker access to the site, but continues to block all other robots. Once finished I renamed the robots.txt file in the root to robots_backup.txt. I wanted to keep it for the time being. I will be link checking again once we go live. Eventually I will delete it. Happy coding.

Posted in: Development  |  Tagged with: , ,  |  Leave a comment
2 comments on “W3C Link Checker and Robots.txt Exclusion
  1. Pat says:

    Thank you for your post. I’ve been using W3C Link Checker for years and years to keep an eye on the links on my production websites. I just tried today and got the robots.txt error you describe – no idea why! So I added your line to the robots.txt file and it’s now working. Thanks again!

Leave a Reply

Your email address will not be published. Required fields are marked *

*