久久福利_99r_国产日韩在线视频_直接看av的网站_中文欧美日韩_久久一

您的位置:首頁技術(shù)文章
文章詳情頁

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

瀏覽:120日期:2022-06-14 16:32:24
使用Scrapy爬取豆瓣某影星的所有個人圖片

以莫妮卡·貝魯奇為例

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

1.首先我們在命令行進入到我們要創(chuàng)建的目錄,輸入 scrapy startproject banciyuan 創(chuàng)建scrapy項目

創(chuàng)建的項目結(jié)構(gòu)如下

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

2.為了方便使用pycharm執(zhí)行scrapy項目,新建main.py

from scrapy import cmdlinecmdline.execute('scrapy crawl banciyuan'.split())

再edit configuration

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

然后進行如下設(shè)置,設(shè)置后之后就能通過運行main.py運行scrapy項目了

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

3.分析該HTML頁面,創(chuàng)建對應(yīng)spider

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

from scrapy import Spiderimport scrapyfrom banciyuan.items import BanciyuanItemclass BanciyuanSpider(Spider): name = ’banciyuan’ allowed_domains = [’movie.douban.com’] start_urls = ['https://movie.douban.com/celebrity/1025156/photos/'] url = 'https://movie.douban.com/celebrity/1025156/photos/' def parse(self, response):num = response.xpath(’//div[@class='paginator']/a[last()]/text()’).extract_first(’’)print(num)for i in range(int(num)): suffix = ’?type=C&start=’ + str(i * 30) + ’&sortby=like&size=a&subtype=a’ yield scrapy.Request(url=self.url + suffix, callback=self.get_page) def get_page(self, response):href_list = response.xpath(’//div[@class='article']//div[@class='cover']/a/@href’).extract()# print(href_list)for href in href_list: yield scrapy.Request(url=href, callback=self.get_info) def get_info(self, response):src = response.xpath( ’//div[@class='article']//div[@class='photo-show']//div[@class='photo-wp']/a[1]/img/@src’).extract_first(’’)title = response.xpath(’//div[@id='content']/h1/text()’).extract_first(’’)# print(response.body)item = BanciyuanItem()item[’title’] = titleitem[’src’] = [src]yield item

4.items.py

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass BanciyuanItem(scrapy.Item): # define the fields for your item here like: src = scrapy.Field() title = scrapy.Field()

pipelines.py

# Define your item pipelines here## Don’t forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterfrom scrapy.pipelines.images import ImagesPipelineimport scrapyclass BanciyuanPipeline(ImagesPipeline): def get_media_requests(self, item, info):yield scrapy.Request(url=item[’src’][0], meta={’item’: item}) def file_path(self, request, response=None, info=None, *, item=None):item = request.meta[’item’]image_name = item[’src’][0].split(’/’)[-1]# image_name.replace(’.webp’, ’.jpg’)path = ’%s/%s’ % (item[’title’].split(’ ’)[0], image_name)return path

settings.py

# Scrapy settings for banciyuan project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = ’banciyuan’SPIDER_MODULES = [’banciyuan.spiders’]NEWSPIDER_MODULE = ’banciyuan.spiders’# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = {’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36’}# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {# ’Accept’: ’text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8’,# ’Accept-Language’: ’en’,#}# Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# ’banciyuan.middlewares.BanciyuanSpiderMiddleware’: 543,#}# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# ’banciyuan.middlewares.BanciyuanDownloaderMiddleware’: 543,#}# Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# ’scrapy.extensions.telnet.TelnetConsole’: None,#}# Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { ’banciyuan.pipelines.BanciyuanPipeline’: 1,}IMAGES_STORE = ’./images’# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = ’httpcache’#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = ’scrapy.extensions.httpcache.FilesystemCacheStorage’

5.爬取結(jié)果

Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片

reference

源碼

到此這篇關(guān)于Python爬蟲實戰(zhàn)之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關(guān)Scrapy爬取豆瓣圖片內(nèi)容請搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)!

標(biāo)簽: 豆瓣 Python 編程語言
相關(guān)文章:
主站蜘蛛池模板: 五月婷婷综合激情网 | 一级毛片观看 | 精品国产乱码简爱久久久久久 | 成人免费毛片嘿嘿连载视频 | 国产精品18久久久久久久久久久久 | 久久免费99精品久久久久久 | 欧美久久久久久 | 狠狠操操操 | 精品在线看| 99av| 亚洲国产成人av好男人在线观看 | 国产视频一区二区 | 国产大学生援交视频在线观看 | 超碰官网 | 爱爱网av| 欧美激情精品久久久久 | 欧美午夜一区二区三区免费大片 | 91精品久久久久久 | 懂色一区二区三区免费观看 | 少妇精品视频在线观看 | 国产一区999| 久久视频精品 | 国产精品欧美久久久久一区二区 | 日韩精品免费在线观看 | 亚洲综合在线视频 | 亚洲午夜剧场 | 青娱乐网站 | 成人性视频免费网站 | 国产小视频在线 | 国产精品久久久久久久久久久久冷 | 国产成人在线免费观看视频 | 久久一区二区视频 | 久久九九精品视频 | 国产精品99久久久久久久vr | 久久精品国产精品亚洲 | 在线观看精品自拍私拍 | 91日韩精品一区二区三区 | 粉嫩国产精品一区二区在线观看 | 四虎成人在线播放 | 日韩一区二区在线观看 | 中文字幕av一区 |