网站禁止爬虫收录的方法

phx 1477 阅读 1 评论 107 点赞

新研分享：网站禁止爬虫收录的方法

1、通过 robots.txt 文件屏蔽（sample）

User-agent: Baiduspider

Disallow: /

User-agent: Googlebot

Disallow: /

User-agent: Googlebot-Mobile

Disallow: /

User-agent: Googlebot-Image

Disallow:/

User-agent: Mediapartners-Google

Disallow: /

User-agent: Adsbot-Google

Disallow: /

User-agent:Feedfetcher-Google

Disallow: /

User-agent: Yahoo! Slurp

Disallow: /

User-agent: Yahoo! Slurp China

Disallow: /

User-agent: Yahoo!-AdCrawler

Disallow: /

User-agent: YoudaoBot

Disallow: /

User-agent: Sosospider

Disallow: /

User-agent: Sogou spider

Disallow: /

User-agent: Sogou web spider

Disallow: /

User-agent: MSNBot

Disallow: /

User-agent: ia_archiver

Disallow: /

User-agent: Tomato Bot

Disallow: /

User-agent: *

Disallow: /

2、通过 meta tag 屏蔽，在所有的网页头部文件添加，添加如下语句：

3、通过服务器（如linux/nginx）配置文件设置直接过滤 spider/robots 的 IP 段

温馨提示：第1招和第2招只对“君子”有效，防止“小人”要用到第3招（“君子”和“小人”分别泛指指遵守与不遵守 robots.txt 协议的 spider/robots），所以网站上线之后要不断跟踪分析日志，筛选出这些 badbot 的ip

点赞(107) 打赏

本文分类：网络
本文标签：spider network 网络
浏览次数：1477 次浏览
发布日期：2019-09-19 09:02:19
本文链接：https://phx99.com/a/2.html

下一篇 > HTTP消息头（HTTP headers）－常用的HTTP请求头与响应头

网站禁止爬虫收录的方法

评论列表共有 1 条评论

发表评论取消回复

网站禁止爬虫收录的方法

网站禁止爬虫收录的方法

评论列表 共有 1 条评论

发表评论 取消回复

评论列表共有 1 条评论

发表评论取消回复