I was ready. I had just finished updating the last few links (and some content) on my clients website and I was ready to check for broken links. I opened the trusty link checker over at W3C Schools and got the following error:
Error: 403 Forbidden by robots.txt
This particular site is a WordPress site and I use the Google XML Sitemap plugin to handle all that stuff, which, in this case defaults to using WordPress to generate the robots.txt. I could, in the WordPress settings, allow search engines access but this was the staging server and I couldn’t allow robots to start searching and indexing it.
What I needed was an exception to allow the W3C Link Checker to search the site.
To solve this problem I first copied the exact text from the dynamically generated robots.txt file. I did this by navigating to the file at the root: http://yourclientsdomain/robots.txt. Next, I created a temporary robots.txt file in the root of the site. In it I pasted the exact text from the first step. Then from the W3C Link Checker documentation I added a robots exclusion rule for only their site. Like so:
User-Agent: *
Disallow: /
User-Agent: W3C-checklink
Disallow:
This allows the link checker access to the site, but continues to block all other robots. Once finished I renamed the robots.txt file in the root to robots_backup.txt. I wanted to keep it for the time being. I will be link checking again once we go live. Eventually I will delete it. Happy coding.
Thank you for your post. I’ve been using W3C Link Checker for years and years to keep an eye on the links on my production websites. I just tried today and got the robots.txt error you describe – no idea why! So I added your line to the robots.txt file and it’s now working. Thanks again!
You bet, Pat. Glad it worked for you!