User-agent: *
Disallow: /
User-agent: Bingbot
Disallow: /white-label/
User-agent: Googlebot
Disallow: /white-label/
User-agent: Slurp
Disallow: /white-label/
User-agent: DuckDuckBot
Disallow: /white-label/
This is my robots.txt
file. I want to block all crawlers for the entire website and allow only 4 bots on the entire website, with the exception of /white-label/
. Is this the correct approach for this?
Comments
I am scraping a site with
scrapy 2.0.1
usingSitemapSpider
,sitemap_urls = ['<site_root>/robots.txt']
file and inresponse.text
field I get some garbage. On other sites same code returns ok (readableresponse.text
), but this one site returnsgarbage. If I open that problem site with browser -robots.txt
opens ok and perfectly readable.I rotate user-agents from the list of well-known user-agents and 'scrapy-rotating-proxies' package , which seems working fine.
Anybody had similar expereince?
Thank you.
Add Comment