您好,登錄后才能下訂單哦!
這篇文章主要介紹“怎么配置Nutch模擬瀏覽器繞過反爬蟲限制”,在日常操作中,相信很多人在怎么配置Nutch模擬瀏覽器繞過反爬蟲限制問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”怎么配置Nutch模擬瀏覽器繞過反爬蟲限制”的疑惑有所幫助!接下來,請跟著小編一起來學習吧!
當我們配置Nutch抓取 http://yangshangchuan.iteye.com 的時候,抓取的所有頁面內容均為:您的訪問請求被拒絕 ...... 這是最簡單的反爬蟲策略(該策略簡單地讀取HTTP請求頭User-Agent的值來判斷是人(瀏覽器)還是機器爬蟲),我們只需要簡單地配置Nutch來模擬瀏覽器(simulate web browser)就可以繞過這種限制。
在nutch-default.xml中有5項配置是和User-Agent相關的:
<property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-1.7</value> <description>A version string to advertise in the User-Agent header.</description> </property>
在類nutch2.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到這5項配置是如何構成User-Agent的:
this.userAgent = getAgentString( conf.get("http.agent.name"), conf.get("http.agent.version"), conf.get("http.agent.description"), conf.get("http.agent.url"), conf.get("http.agent.email") );
private static String getAgentString(String agentName, String agentVersion, String agentDesc, String agentURL, String agentEmail) { if ( (agentName == null) || (agentName.trim().length() == 0) ) { // TODO : NUTCH-258 if (LOGGER.isErrorEnabled()) { LOGGER.error("No User-Agent string set (http.agent.name)!"); } } StringBuffer buf= new StringBuffer(); buf.append(agentName); if (agentVersion != null) { buf.append("/"); buf.append(agentVersion); } if ( ((agentDesc != null) && (agentDesc.length() != 0)) || ((agentEmail != null) && (agentEmail.length() != 0)) || ((agentURL != null) && (agentURL.length() != 0)) ) { buf.append(" ("); if ((agentDesc != null) && (agentDesc.length() != 0)) { buf.append(agentDesc); if ( (agentURL != null) || (agentEmail != null) ) buf.append("; "); } if ((agentURL != null) && (agentURL.length() != 0)) { buf.append(agentURL); if (agentEmail != null) buf.append("; "); } if ((agentEmail != null) && (agentEmail.length() != 0)) buf.append(agentEmail); buf.append(")"); } return buf.toString(); }
在類nutch2.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent請求頭,這里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:
String userAgent = http.getUserAgent(); if ((userAgent == null) || (userAgent.length() == 0)) { if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); } } else { reqStr.append("User-Agent: "); reqStr.append(userAgent); reqStr.append("\r\n"); }
通過上面的分析可知:在nutch-site.xml中只需要增加如下幾種配置之一便可以模擬一個特定的瀏覽器(Imitating a specific browser):
1、模擬Firefox瀏覽器:
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value> </property> <property> <name>http.agent.version</name> <value>20100101 Firefox/27.0</value> </property>
2、模擬IE瀏覽器:
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value> </property> <property> <name>http.agent.version</name> <value>6.0)</value> </property>
3、模擬Chrome瀏覽器:
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value> </property> <property> <name>http.agent.version</name> <value>537.36</value> </property>
4、模擬Safari瀏覽器:
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value> </property> <property> <name>http.agent.version</name> <value>534.57.2</value> </property>
5、模擬Opera瀏覽器:
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value> </property> <property> <name>http.agent.version</name> <value>19.0.1326.59</value> </property>
后記:查看User-Agent的方法:
1、http://www.useragentstring.com
2、http://whatsmyuseragent.com
3、http://www.enhanceie.com/ua.aspx
到此,關于“怎么配置Nutch模擬瀏覽器繞過反爬蟲限制”的學習就結束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習,快去試試吧!若想繼續學習更多相關知識,請繼續關注億速云網站,小編會繼續努力為大家帶來更多實用的文章!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。