Scrapy gets garbage when scraping .robots.txt

User-agent: *
Disallow: /

User-agent: Bingbot
Disallow: /white-label/

User-agent: Googlebot
Disallow: /white-label/

User-agent: Slurp
Disallow: /white-label/

User-agent: DuckDuckBot
Disallow: /white-label/

This is my robots.txt file. I want to block all crawlers for the entire website and allow only 4 bots on the entire website, with the exception of /white-label/. Is this the correct approach for this?

Comments


  • Insane

    I am scraping a site with scrapy 2.0.1 using SitemapSpider, sitemap_urls = ['<site_root>/robots.txt'] file and in response.text field I get some garbage. On other sites same code returns ok (readable response.text), but this one site returnsgarbage. If I open that problem site with browser - robots.txt opens ok and perfectly readable.

    I rotate user-agents from the list of well-known user-agents and 'scrapy-rotating-proxies' package , which seems working fine.

    Anybody had similar expereince?

    Thank you.

Add Comment