您好,登錄后才能下訂單哦!
這篇文章給大家分享的是有關python分布式爬蟲中scrapy.spider的使用示例的內容。小編覺得挺實用的,因此分享給大家做個參考。一起跟隨小編過來看看吧。
源碼:
""" Base class for Scrapy spiders See documentation in docs/topics/spiders.rst """ import logging import warnings from scrapy import signals from scrapy.http import Request from scrapy.utils.trackref import object_ref from scrapy.utils.url import url_is_from_spider from scrapy.utils.deprecate import create_deprecated_class from scrapy.exceptions import ScrapyDeprecationWarning from scrapy.utils.deprecate import method_is_overridden class Spider(object_ref): """Base class for scrapy spiders. All spiders must inherit from this class. """ name = None custom_settings = None def __init__(self, name=None, **kwargs): if name is not None: self.name = name elif not getattr(self, 'name', None): raise ValueError("%s must have a name" % type(self).__name__) self.__dict__.update(kwargs) if not hasattr(self, 'start_urls'): self.start_urls = [] @property def logger(self): logger = logging.getLogger(self.name) return logging.LoggerAdapter(logger, {'spider': self}) def log(self, message, level=logging.DEBUG, **kw): """Log the given message at the given log level This helper wraps a log call to the logger within the spider, but you can use it directly (e.g. Spider.logger.info('msg')) or use any other Python logger too. """ self.logger.log(level, message, **kw) @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = cls(*args, **kwargs) spider._set_crawler(crawler) return spider def set_crawler(self, crawler): warnings.warn("set_crawler is deprecated, instantiate and bound the " "spider to this crawler with from_crawler method " "instead.", category=ScrapyDeprecationWarning, stacklevel=2) assert not hasattr(self, 'crawler'), "Spider already bounded to a " \ "crawler" self._set_crawler(crawler) def _set_crawler(self, crawler): self.crawler = crawler self.settings = crawler.settings crawler.signals.connect(self.close, signals.spider_closed) def start_requests(self): cls = self.__class__ if method_is_overridden(cls, Spider, 'make_requests_from_url'): warnings.warn( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls.__module__, cls.__name__ ), ) for url in self.start_urls: yield self.make_requests_from_url(url) else: for url in self.start_urls: yield Request(url, dont_filter=True) def make_requests_from_url(self, url): """ This method is deprecated. """ return Request(url, dont_filter=True) def parse(self, response): raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__)) @classmethod def update_settings(cls, settings): settings.setdict(cls.custom_settings or {}, priority='spider') @classmethod def handles_request(cls, request): return url_is_from_spider(request.url, cls) @staticmethod def close(spider, reason): closed = getattr(spider, 'closed', None) if callable(closed): return closed(reason) def __str__(self): return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self)) __repr__ = __str__ BaseSpider = create_deprecated_class('BaseSpider', Spider) class ObsoleteClass(object): def __init__(self, message): self.message = message def __getattr__(self, name): raise AttributeError(self.message) spiders = ObsoleteClass( '"from scrapy.spider import spiders" no longer works - use ' '"from scrapy.spiderloader import SpiderLoader" and instantiate ' 'it with your project settings"' ) # Top-level imports from scrapy.spiders.crawl import CrawlSpider, Rule from scrapy.spiders.feed import XMLFeedSpider, CSVFeedSpider from scrapy.spiders.sitemap import SitemapSpider
其中需要關注的是name(爬蟲名字)、start_urls(抓取的起始url列表)、allowed_domains(限定抓取的url所在域名)、start_requests(開始抓取的方法)
name、start_urls、allowed_domains是屬性,在創建創建項目的時候已經建好了,稍作修改即可。start_requests是起始的抓取方法,一般是默認的遍歷start_urls列表生成Request對象,在scrapy中需要登錄的時候可以復寫該方法,這個比較簡單不在贅述。
感謝各位的閱讀!關于python分布式爬蟲中scrapy.spider的使用示例就分享到這里了,希望以上內容可以對大家有一定的幫助,讓大家可以學到更多知識。如果覺得文章不錯,可以把它分享出去讓更多的人看到吧!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。