Estoy aprendiendo scrapy y quería eliminar algunos elementos de esta página: https://www.gumtree.com/ search? sort = date & search_category = pisos-casas & q = box & search_location = Vale + de + Glamorgan

Para evitar las políticas de robots.txt, etc., guardé la página en mi disco duro y probé mis xpaths usando scrapy shell. Parecen funcionar como se esperaba. Pero cuando ejecuto mi araña con el comando scrapy crawl basic (como se recomienda en el libro que estoy leyendo) obtuve el siguiente resultado:

    2017-09-27 12:05:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties)
2017-09-27 12:05:02 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozila/5.0', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders'}
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-27 12:05:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-27 12:05:03 [scrapy.core.engine] INFO: Spider opened
2017-09-27 12:05:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-27 12:05:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-27 12:05:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None)
2017-09-27 12:05:04 [basic] DEBUG: title: 
2017-09-27 12:05:04 [basic] DEBUG: price: 
2017-09-27 12:05:04 [basic] DEBUG: description: 
2017-09-27 12:05:04 [basic] DEBUG: address: 
2017-09-27 12:05:04 [basic] DEBUG: image_urls: 
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-27 12:05:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 262,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 270547,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 27, 9, 5, 4, 91741),
 'log_count/DEBUG': 7,
 'log_count/INFO': 7,
 'memusage/max': 50790400,
 'memusage/startup': 50790400,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 9, 27, 9, 5, 3, 718976)}
2017-09-27 12:05:04 [scrapy.core.engine] INFO: Spider closed (finished)
igor@foobard:properties$ scrapy crawl basic
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: properties)
2017-09-27 12:10:13 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties', 'NEWSPIDER_MODULE': 'properties.spiders', 'USER_AGENT': 'Mozila/5.0'}
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-27 12:10:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider opened
2017-09-27 12:10:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-27 12:10:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-27 12:10:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/albert/Documents/programming/python/scrapy/properties/properties/tests/test_page.html> (referer: None)
2017-09-27 12:10:13 [basic] DEBUG: title: 
2017-09-27 12:10:13 [basic] DEBUG: price: 
2017-09-27 12:10:13 [basic] DEBUG: description: 
2017-09-27 12:10:13 [basic] DEBUG: address: 
2017-09-27 12:10:13 [basic] DEBUG: image_urls: 
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-27 12:10:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 262,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 270547,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 927817),
 'log_count/DEBUG': 7,
 'log_count/INFO': 7,
 'memusage/max': 51032064,
 'memusage/startup': 51032064,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 9, 27, 9, 10, 13, 722731)}
2017-09-27 12:10:13 [scrapy.core.engine] INFO: Spider closed (finished)

Aquí está mi items.py :

from scrapy.item import Item, Field


class PropertiesItem(Item):
    title = Field()
    price = Field()
    description = Field()
    address = Field()
    image_urls = Field()

    images = Field()
    location = Field()

    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()

Y aquí está la araña basic.py :

import scrapy


class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html']

    def parse(self, response):
        self.log('title: '.format(response.xpath(
            "//h2[@class='listing-title' and not(span)]/text()").extract()))
        self.log('price: '.format(response.xpath(
            "//meta[@itemprop='price']/@content").extract()))
        self.log("description: ".format(response.xpath(
            "//p[@itemprop='description' and not(span)]/text()").extract()))
        self.log('address: '.format(response.xpath(
            "//span[@class='truncate-line']/text()[2]").re('\|(\s+\w+.+)')))
        self.log('image_urls: '.format(response.xpath(
            "//noscript/img/@src").extract()))

Los xpaths son un poco torpes pero funcionan. Sin embargo, los artículos no se recogen. Me gustaría saber por qué.

0
Albert 27 sep. 2017 a las 13:07

2 respuestas

La mejor respuesta

Su problema es que no ha insertado la salida de la función de formato en ninguna parte de la cadena. Por lo tanto, debe cambiar title a title {}, por lo que el formato inserta los valores. También use extract_first() en lugar de extract(). Entonces obtienes una salida de cadena en lugar de una matriz

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html']

    def parse(self, response):
        self.log('title: {}'.format(response.xpath(
            "//h2[@class='listing-title' and not(span)]/text()").extract_first()))
        self.log('price: {}'.format(response.xpath(
            "//meta[@itemprop='price']/@content").extract_first()))
        self.log("description: {}".format(response.xpath(
            "//p[@itemprop='description' and not(span)]/text()").extract_first()))
        self.log('address: {}'.format(response.xpath(
            "//span[@class='truncate-line']/text()[2]").re('\|(\s+\w+.+)')))
        self.log('image_urls: {}'.format(response.xpath(
            "//noscript/img/@src").extract_first()))
1
Tarun Lalwani 27 sep. 2017 a las 11:22

No pruebo Scrapy en un archivo local, pero si quieres eliminar algo, primero debes iniciar Items y asignar Item como dict en Python, finalmente {{X3 }} a pipeline

import scrapy
from properties.items import PropertiesItem

class BasicSpider(scrapy.Spider):
    name = 'basic'
    start_urls = ['file:///home/albert/Documents/programming/python/scrapy/properties/properties/site/test_page.html']

    def parse(self, response):
        item = PropertiesItem()     # init Item
        # assignment 
        item['title'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract()
        item['price'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract()
        item['description'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract()
        item['address'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract()
        item['image_urls'] = response.xpath("//h2[@class='listing-title' and not(span)]/text()").extract()
        # yield item
        yield item
1
zhongjiajie 27 sep. 2017 a las 11:31