What does this robot.txt mean?

So after a long fight trying to make the meta tags work with vue 3, now I am facing another issue. So I have a domain that contains several mini projects or websites, including this one, the problem I noticed is when I test my website locally in a tunnel(if u are interested to know what I use u can check ngrok) and then I test with metatags.io, I got all the info, everything okay there. But then on production doesn't work, I always crawl the main domain. When testing local, the difference is that the entire files are placed on a root folder, with an exclusive generated domain, in real life when I submit the deployed version(and after enable production path on my vue.config), I need to do it under a folder:

www.main-domain---

  • myproject folder
    • my website files
  • another project
  • another project
  • another pro..

And then when testing, I am not getting any information from my the miniwebsite, but the main domain, I was reading about that maybe the robot.txt file can help, I tested to enable the bots in the domain, however, but it didn't work, in general I think I need to do it, but still this will enable the bots for the main domain, but how do you specific to enabled only in a deeper level if is even possible?

Just in case my vue.config.js

const webpack = require('webpack');
var PrerenderSpaPlugin = require('prerender-spa-plugin')
var path = require('path')

module.exports = {
  publicPath: process.env.NODE_ENV ==='production'?'/my-subfolder/':'/',
  outputDir:'dist',
  devServer: {
host:"localhost"
  },
  configureWebpack: config => {
    if (process.env.NODE_ENV !== 'production') return
 
    return {
      plugins: [
        new PrerenderSpaPlugin(
          // Absolute path to compiled SPA
          path.resolve(__dirname, 'dist'),
          // List of routes to prerender
          [ '/', '/path/dummy-path','/path-two/dummy-path-two','/path-three','/path-four','/path-five/dummie-path-three'],
        ),
      ]
    }
  }

}

And my current robot.txt

User-agent: Twitterbot
User-agent: Facebot
User-agent: LinkedInBot

Disallow:

User-agent: *
Disallow: /

Thank you in advance for any advice

Comments


  • Abdon

    status code: 403 means client-side error and from the server-side such type of error is not responsible for meaning the website is allowed to extract data. To get ride of 403 error you must need to inject something with requests like headers and most of the time but not always will solve this problem just injecting User-Agent as header. Here is an example how to inject User-Agent using requests module with BeautifulSoup.

    import requests
    from bs4 import BeautifulSoup
    
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
    }
    
    response = requests.get("Your url", headers=headers)
    print(response.status_code)
    
    #soup = BeautifulSoup(response .content, "lxml")
    
  • Aabbye

    There's a website that I need to crawl, I have no financial purpose just to study.

    I checked the robots.txt and it was as follows.

    User-agent: *
    
    Allow: /
    
    Disallow: /*.notfound.html
    

    Can I crawl this website using request and beautifulSoup?

    I checked that crawling without a header causes a 403 error. Does this mean that crawling is not allowed?

Add Comment