So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
- I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
- CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
- Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
- Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
- Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
- Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?
Comments
You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
Add Comment