计算机 · 2021年8月13日 0

Robots Txt File

robots.txt文件规范

  • robots.txt文件必须位于主机的顶级目录,抓取工具不会检查子目录中是否包含robots.txt文件
  • robots.txt只对该主机、域名、协议和端口号上的所有子目录中的文件生效
  • 子网域上的robots.txt仅对该子网域有效
  • ftp也可以设置robots.txt

正式语法定义

  robotstxt = *(group / emptyline)
  group = startgroupline                    ; We start with a user-agent
          *(startgroupline / emptyline)     ; ... and possibly more user-agents
          *(rule / emptyline)               ; followed by rules relevant for UAs

  startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL

  rule = *WS ("allow" / "disallow") *WS ":" *WS (path-pattern / empty-pattern) EOL

  ; parser implementors: add additional lines you need (for example, Sitemaps), and
  ; be lenient when reading lines that don’t conform. Apply Postel’s law.

  product-token = identifier / "*"
  path-pattern = "/" *(UTF8-char-noctl)    ; valid URI path pattern; see 3.2.2
  empty-pattern = *WS

  identifier = 1*(%x2d / %x41-5a / %x5f / %x61-7a)
  comment = "#" *(UTF8-char-noctl / WS / "#")
  emptyline = EOL
  EOL = *WS [comment] NL         ; end-of-line may have optional trailing comment
  NL = %x0D / %x0A / %x0D.0A
  WS = %x20 / %x09

  ; UTF8 derived from RFC3629, but excluding control characters
  UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4
  UTF8-1-noctl    = %x21 / %x22 / %x24-7F  ; excluding control, space, '#'
  UTF8-2          = %xC2-DF UTF8-tail
  UTF8-3          = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                    %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
  UTF8-4          = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                    %xF4 %x80-8F 2( UTF8-tail )
  UTF8-tail       = %x80-BF
  • 行和规则分组
  • 用户代理的优先顺序
  • 组成员规则
  • 基于路径值的网址匹配
  • Google支持的非组成员行
    可以在robots.txt文件中添加一行sitemap: [absoluteURL]指定sitemap。
  • 组成员行的优先顺序

这些规则抄一遍也没啥意义,按需去google的官网页面查阅就好。