Disallow crawling of the CDN site

So I have a site http://www.example.com.

The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering

  1. I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
  2. CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
  3. There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.

  4. How do I prevent the crawling of PHP pages?

  5. Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?

Some alternatives that I considered, based on stackoverflow answers:

  1. Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
  2. Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
  3. Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com

Suggestions?

Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?

Comments


  • Morbid

    You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting

    User-agent: *
    Disallow: /
    

Add Comment