Creé un script usando la biblioteca concurrent.futures para imprimir el resultado de la función fetch_links. Cuando uso la instrucción print dentro de la función, obtengo los resultados en consecuencia. Lo que deseo hacer ahora es imprimir el resultado de esa función usando la declaración de rendimiento.

¿Hay alguna forma de que pueda modificar cosas en la función main para imprimir el resultado de la función fetch_links manteniéndolo como está, es decir, manteniendo la declaración de rendimiento?

import requests
from bs4 import BeautifulSoup
import concurrent.futures as cf

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
]

base = 'https://stackoverflow.com{}'

def fetch_links(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select(".summary .question-hyperlink"):
        # print(base.format(item.get("href")))
        yield base.format(item.get("href"))

if __name__ == '__main__':
    with requests.Session() as s:
        with cf.ThreadPoolExecutor(max_workers=5) as exe:
            future_to_url = {exe.submit(fetch_links,s,url): url for url in links}
            cf.as_completed(future_to_url)
0
MITHU 12 oct. 2020 a las 20:51

1 respuesta

La mejor respuesta

Tu fetch_links es un generador, por lo que también debes recorrerlo para obtener los resultados:

import requests
from bs4 import BeautifulSoup
import concurrent.futures as cf

links = [
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50",
    "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=4&pagesize=50"
]

base = 'https://stackoverflow.com{}'


def fetch_links(s, link):
    r = s.get(link)
    soup = BeautifulSoup(r.text, "lxml")
    for item in soup.select(".summary .question-hyperlink"):
        yield base.format(item.get("href"))


if __name__ == '__main__':
    with requests.Session() as s:
        with cf.ThreadPoolExecutor(max_workers=5) as exe:
            future_to_url = {exe.submit(fetch_links, s, url): url for url in links}
            for future in cf.as_completed(future_to_url):
                for result in future.result():
                    print(result)

Salida:

https://stackoverflow.com/questions/64298886/rvest-webscraping-in-r-with-form-inputs
https://stackoverflow.com/questions/64298879/is-this-site-not-suited-for-web-scraping-using-beautifulsoup
https://stackoverflow.com/questions/64297907/python-3-extract-html-data-from-sports-site
https://stackoverflow.com/questions/64297728/cant-get-the-fully-loaded-html-for-a-page-using-puppeteer
https://stackoverflow.com/questions/64296859/scrape-text-from-a-span-tag-containing-nested-span-tag-in-beautifulsoup
https://stackoverflow.com/questions/64296656/scrapy-nameerror-name-items-is-not-defined
https://stackoverflow.com/questions/64296201/missing-values-while-scraping-using-beautifulsoup-in-python
https://stackoverflow.com/questions/64296130/how-can-i-identify-the-element-containing-the-link-to-my-linkedin-profile-after
https://stackoverflow.com/questions/64295959/why-use-scrapy-or-beautifulsoup-vs-just-parsing-html-with-regex-v2
https://stackoverflow.com/questions/64295842/how-to-retreive-scrapping-data-from-web-to-json-like-format
https://stackoverflow.com/questions/64295559/how-to-iterate-through-a-supermarket-website-and-getting-the-product-name-and-pr
https://stackoverflow.com/questions/64295509/cant-stop-asyncio-request-for-some-delay
https://stackoverflow.com/questions/64295244/paginate-with-network-requests-scraper
and so on ...
1
baduker 12 oct. 2020 a las 18:04