Soy nuevo en webscraping y uso beautifulsoup y selenio. Estoy tratando de raspar datos de la siguiente página web:

    https://epl.bibliocommons.com/item/show/2300646980

Estoy tratando de eliminar la sección: "Listas de personal que incluyen ese título". En particular, quería obtener el número de etiquetas <li>, ya que solo necesito el número de elementos / enlaces en esa lista de personal.

He intentado lo siguiente en el código HTML proporcionado por "Inspeccionar" -ing la página. El siguiente es el bloque de código HTML del que estoy tratando de raspar:

<div class="ugc_bandage">
  <div class="lists_heading clearfix">
    <h3 data-test-id="ugc-lists-heading">
      Listed
    </h3>
    <div class="ugc_add_link">
      <div class="dropdown saveToButton clearfix" id="save_to_2300646980_id_7a3ateh0panp1uv0he1v7aqmj9" data-test-id="add-to-list-dropdown-container">
  <a href="#" aria-expanded="false" aria-haspopup="true" class=" dropdown-toggle dropdown-toggle hide_trigger_icon" data-test-id="add-to-list-save-button" data-toggle="dropdown" id="save_button_2300646980_id_7a3ateh0panp1uv0he1v7aqmj9" rel="nofollow">
       <i aria-hidden="true" class=" icon-plus"></i>
<span aria-hidden="true">Add</span><span class="sr-only" data-js="sr-only-dropdown-toggle" data-text-collapsed="Add, collapsed" data-text-expanded="Add, expanded">Add, collapsed</span><span aria-hidden="true" class="icon-arrow"></span></a>  
  <ul class="dropdown-menu">
      <li>
        <a href="/user_lists/new?bib=2300646980&amp;origin=https%3A%2F%2Fepl.bibliocommons.com%2Fitem%2Fload_ugc_content%2F2300646980" class="newList">Create a New List</a>
      </li>
      <li>
        <a href="/lists/add_bib/mine?bib=2300646980_fangirl" data-js="cp-overlay" id="more_lists_id_7a3ateh0panp1uv0he1v7aqmj9">Existing Lists »</a>
      </li>

  </ul>
</div>

    </div>
  </div>
  <h4 data-test-id="staff-lists-that-include-this-title">Staff Lists that include this Title</h4>
  <div data-analytics="{ &quot;SubFeature&quot;: &quot;Lists that include this title&quot; }" class="expand clearfix" id="all_lists_expand" testid="text_listsincluding">
    <ul class="further_list">
      <li> [LIST ENTRIES START HERE, BUT THERE'S SO MANY, IT WOULD MAKE THIS POST TO LONG.] </li>

  1. He raspado el código anterior usando el xpath, copiado de la inspección de la sección de la lista de personal (id="all_lists_expand"):
    element = driver.find_elements_by_xpath('//*[@id="rightBar"]/div[3]/div')
  1. Traté de raspar la sección raspando usando el nombre de la clase:
    element = driver.find_element_by_class_name('expand clearfix')
  1. También intenté raspar usando el selector css:
    element = driver.find_element_by_css_selector('#all_lists_expand')

También he hecho otras variantes del código anterior, buscando clases de los padres del elemento, xpaths, etc.

Todos los intentos anteriores devuelven NONE. No estoy seguro de lo que estoy haciendo mal, ¿se supone que debo activar un evento o algo con selenio? Ni siquiera estoy haciendo clic en ninguno de los enlaces enumerados en la lista, o incluso manteniendo una lista de los enlaces, solo necesito contar cuántos enlaces hay para empezar.

2
B.M. Corwen 30 sep. 2019 a las 21:33

3 respuestas

La mejor respuesta

No necesita los gastos de selenio. Puede hacer la misma solicitud GET que la página hace que ese contenido extraiga el html del json devuelto y analice con bs4 y extraiga enlaces

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://epl.bibliocommons.com/item/load_ugc_content/2300646980').json()
soup = bs(r['html'], 'lxml')
links = [i['href'] for i in soup.select('[data-test-id="staff-lists-that-include-this-title"] + div [href]')]
print(len(links))
print(links)
2
QHarr 30 sep. 2019 a las 19:43

Para obtener toda la etiqueta de anclaje en Staff Lists that Include that Title induzca WebDriverWait y presence_of_all_elements_located () Esto le dará 100 enlaces.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver=webdriver.Chrome()
driver.get("https://epl.bibliocommons.com/item/show/2300646980")
elements=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH,'//h4[@data-test-id="staff-lists-that-include-this-title"]/following::div[1]//li/a')))
print(len(elements))
for ele in elements:
    print(ele.get_attribute('href'))

Salida :

https://epl.bibliocommons.com/list/share/114110843_schoolcorps1/1495892159_native_american,_rl_k-3,_spanish_middle_amp_high_school_multcolib_assignments
https://epl.bibliocommons.com/list/share/1467158627_stpl_crystal/1491354799_am_i_seeing_double
https://epl.bibliocommons.com/list/share/568630227_vpl_childrens_teens_info/1490175639_books_just_for_you_-_thought_provoking_amp_charming_ya_reads
https://epl.bibliocommons.com/list/share/1176606007_overdue_finds/1485773789_overdue_finds_episode_39_guilty_pleasures
https://epl.bibliocommons.com/list/share/1312082177_aloha_youthservices/1468001367_its_okay_to_not_be_okay_for_teens
https://epl.bibliocommons.com/list/share/631739687_eplpersonalpicks2/1484211504_epl_personal_picks_ya_novels
https://epl.bibliocommons.com/list/share/186066773_jclemmaf/837858917_favorite_and_my_best
https://epl.bibliocommons.com/list/share/569286917_oplteenbooklists/1476340687_teen_lit_chat_booklist_august_2019
https://epl.bibliocommons.com/list/share/569286917_oplteenbooklists/1459365327_astrology_teen_booklist_books_you_might_like_if_youre_a_virgo
https://epl.bibliocommons.com/list/share/1058529507_pplteen/1258199057_best_back_to_school_reads
https://epl.bibliocommons.com/list/share/1216909347_anna_libraryt/1478214359_ya_novels_about_school
https://epl.bibliocommons.com/list/share/106274081_wplstaffpicks/1477722487_wpl_summer_reads_2019
https://epl.bibliocommons.com/list/share/173100305_jclangelicar/1226682237_amazing_reads_for_teens_and_up
https://epl.bibliocommons.com/list/share/73092242_pickeringteens/1117926097_tag_recommends_continued
https://epl.bibliocommons.com/list/share/73092242_pickeringteens/744582537_tag_recommends_2018
https://epl.bibliocommons.com/list/share/73092242_pickeringteens/1184991797_lets_talk_mental_health
https://epl.bibliocommons.com/list/share/73092242_pickeringteens/822272858_ppl_teens_love,_loss,_and_all_the_feels
https://epl.bibliocommons.com/list/share/73092242_pickeringteens/692256398_aampe_picks
https://epl.bibliocommons.com/list/share/73977058_jclbeckyc/1385964387_the_best_books_of_2019
https://epl.bibliocommons.com/list/share/1059338207_readingadviser_sally/1439607877_books_for_20_somethings-fvrl-2019
https://epl.bibliocommons.com/list/share/279600817_lpl_readersservices/1457670767_2019_squad_goals_read_a_book_set_on_a_college_or_university_campus
https://epl.bibliocommons.com/list/share/631739687_eplpersonalpicks2/1458857587_epl_personal_picks_just_a_little_bit_of_love
https://epl.bibliocommons.com/list/share/1275085237_beaverton_teens/1291469057_female_pov
https://epl.bibliocommons.com/list/share/104627853_princetonpl/1128194327_susans_picks
https://epl.bibliocommons.com/list/share/69155564_kantoniw/376769097_teen_-_terrific_titles
https://epl.bibliocommons.com/list/share/1275085237_beaverton_teens/1292121977_realistic_fiction
https://epl.bibliocommons.com/list/share/1300215227_beaverton_iand/1303358407_books_where_the_parents_are_cool
https://epl.bibliocommons.com/list/share/215214545_multcolib_dianaa/1450141617_casting_a_wide_net_for_tammy_from_multcolib_my_librarian_diana
https://epl.bibliocommons.com/list/share/681590123_scl_kaylin/1030053197_kaylins_picks
https://epl.bibliocommons.com/list/share/173530091_jclhebaha/1171128547_hebahs_staff_picks
https://epl.bibliocommons.com/list/share/1275085237_beaverton_teens/1288931697_recommended_reads_11-12
https://epl.bibliocommons.com/list/share/275252227_martinregionalreads/1369306597_diversity_teenya_books
https://epl.bibliocommons.com/list/share/72152117_steacy_library/1204064657_classic_teen_reads
https://epl.bibliocommons.com/list/share/700233957_snoislelib_suggests/1436626997_harry_potter_y_la_piedra_filosofal
https://epl.bibliocommons.com/list/share/235700377_pomolibrary/1436872057_pomo_picks_-_teen_-_tsrc_2019_-_book_that_is_not_in_a_series_-_grades_9,_10,_11,_12
https://epl.bibliocommons.com/list/share/694280209_kimberlyreads/752020447_level_up_your_reading_-_books_for_gamers_(teen_edition)
https://epl.bibliocommons.com/list/share/1216909347_anna_libraryt/1220688167_ya_reads_for_reluctant_readers
https://epl.bibliocommons.com/list/share/569286917_oplteenbooklists/1405453637_teen_book_chat_april_2019
https://epl.bibliocommons.com/list/share/223261407_burien_teens_read/1424507527_srp_book_talk_glendale_lutheran_8th_grade
https://epl.bibliocommons.com/list/share/1216909347_anna_libraryt/1412382807_top_10_ya_coming-of-age_reads
https://epl.bibliocommons.com/list/share/80402800_vpl_booksjustforyou11/1413011449_vpl_-_books_just_for_you_-_biography,_humour,_inspiration,_short_stories,_and_animal_fiction
https://epl.bibliocommons.com/list/share/760546357_scteenprogramming/1411563307_cmlibrary_suggests_imagicon_2019
https://epl.bibliocommons.com/list/share/1078894377_lisadempster/1411364207_celebrate_your_inner_geek
https://epl.bibliocommons.com/list/share/682768697_arapahoekati/1055224107_published_nanowrimo_authors
https://epl.bibliocommons.com/list/share/1382187347_mollywally/1404738807_mental_health
https://epl.bibliocommons.com/list/share/568630227_vpl_childrens_teens_info/1395459037_books_just_for_you_-_ya_contemporary_amp_mystery
https://epl.bibliocommons.com/list/share/550038607_spl_brittany/1322718057_one_word_titles
https://epl.bibliocommons.com/list/share/1170754297_sppl_recommends/1383661857_no,_you_cant_read_these_books
https://epl.bibliocommons.com/list/share/639095537_sausalito_staff_erin/1377322417_ya_realistic_fiction_for_middle_schoolers
https://epl.bibliocommons.com/list/share/1060442917_readingadviser_jacque/1364177797_teen_favorites
https://epl.bibliocommons.com/list/share/69193241_pepl_knoeske/269126130_ya_reads
https://epl.bibliocommons.com/list/share/155181971_surreylibraries_teens/385766437_hilarity_ensues
https://epl.bibliocommons.com/list/share/1136103357_hfxpl_teens/1374745777_hey_what_are_you_reading
https://epl.bibliocommons.com/list/share/155181971_surreylibraries_teens/1349496509_valentines_day_2019_young_adult_fiction
https://epl.bibliocommons.com/list/share/138070021_surreylibraries_reads/1304148677_staff_picks_what_we_loved_in_2014
https://epl.bibliocommons.com/list/share/80402800_vpl_booksjustforyou11/1365444807_vpl_-_new_adult_-_top_picks
https://epl.bibliocommons.com/list/share/715647058_st8ceyw8/1365437547_recommendations_for_teen_girls
https://epl.bibliocommons.com/list/share/1131250757_lvccld_saharawest/1363494177_geeks_rule_books_for_teens
https://epl.bibliocommons.com/list/share/548538121_spl_merley/1358151383_help_for_anxious_teens
https://epl.bibliocommons.com/list/share/679797892_dbrl_idaf/1355664913_matryoshka_fiction
https://epl.bibliocommons.com/list/share/1315907392_indypl_kirstenw/1315916377_staff_recommendations_great_reads_for_teens
https://epl.bibliocommons.com/list/share/1303998627_tigard_teens/1351425041_put_a_heart_on_it
https://epl.bibliocommons.com/list/share/515946100_tacomalibrary/1343962909_a_book_about_books,_as_part_of_the_extreme_reader_challenge
https://epl.bibliocommons.com/list/share/1216909347_anna_libraryt/1342688089_ya_with_geek_themes
https://epl.bibliocommons.com/list/share/1282688857_indypl_katieb/1285699927_nanowrimo-_a_survival_guide
https://epl.bibliocommons.com/list/share/104627853_princetonpl/1333071229_libfaves
https://epl.bibliocommons.com/list/share/550038607_spl_brittany/1329175977_fresh_starts,_new_beginnings_and_second_chances
https://epl.bibliocommons.com/list/share/710260400_annag_kcmo/1322113517_fandoms
https://epl.bibliocommons.com/list/share/558294898_jclemilyd/1326533547_monticello_youth_services_recommendsya_books
https://epl.bibliocommons.com/list/share/429022740_loganlib_meg/1324424287_2019_reading_challenge
https://epl.bibliocommons.com/list/share/95681271_samcmar/1318184807_mpl_2019_reading_challenge_-_a_one_word_title
https://epl.bibliocommons.com/list/share/768705057_dcpl_teens/1322057871_if_you_like_dumplin
https://epl.bibliocommons.com/list/share/803717002_adult_custom_reading_list/1321396267_omaha_custom_list_page-turners_122018
https://epl.bibliocommons.com/list/share/134340301_vpl_booksjustforyou/1160285087_vpl_-_books_just_for_you_-_fun_reads
https://epl.bibliocommons.com/list/share/1303998627_tigard_teens/1320248908_do_you_ship_them
https://epl.bibliocommons.com/list/share/768705057_dcpl_teens/1030069518_a_fandom_life_for_me
https://epl.bibliocommons.com/list/share/1066057257_mcpl_readerslounge/1314212917_woodneath_staff_picks_babysitters_club_reads
https://epl.bibliocommons.com/list/share/1081387957_pacl_teens/1313796687_tlab_recommends_romance_for_teens
https://epl.bibliocommons.com/list/share/768695927_dcpl_adults/1311059977_dcpl_staff_picks_for_2018
https://epl.bibliocommons.com/list/share/186066773_jclemmaf/1313674757_ya_books_about_teen_writers
https://epl.bibliocommons.com/list/share/888940897_cmlibrary_corvolunteens/1306009547_calians_favorites
https://epl.bibliocommons.com/list/share/344916587_chapel_hill_teenstaff/687974851_unusual_formats
https://epl.bibliocommons.com/list/share/1204935759_jclmegb/1303553797_teen_reads_to_tickle_your_funny_bone_amp_warm_your_heart
https://epl.bibliocommons.com/list/share/95796007_jessicagma/1302711427_book_smack_j%C3%B3lab%C3%B3kafl%C3%B3%C3%B0i%C3%B0_2018_jessica
https://epl.bibliocommons.com/list/share/219559045_kclsaarene/1302650609_best-selling_nanowrimo_winners
https://epl.bibliocommons.com/list/share/569520567_hholley/710149067_opl_staff_picks
https://epl.bibliocommons.com/list/share/491055517_cals_readers/1298323449_nanowrimo_books_that_got_published
https://epl.bibliocommons.com/list/share/73877511_jcltracim/1296589167_nanowrimo_-_published_wrimos
https://epl.bibliocommons.com/list/share/219559045_kclsaarene/1296304497_pizza_and_books_einstein_ms_november_2018
https://epl.bibliocommons.com/list/share/104627853_princetonpl/1295497427_nanowrimo
https://epl.bibliocommons.com/list/share/675410617_orlreads/1295410127_orl_recommends_-_nanowrimo_reads
https://epl.bibliocommons.com/list/share/768705057_dcpl_teens/1294054347_family_stories
https://epl.bibliocommons.com/list/share/1165043747_sppl_teens/1282475677_lets_talk_about_mental_health
https://epl.bibliocommons.com/list/share/685936385_arapahoebridget/723765118_breaking_out_of_nanowrimo_writers_block
https://epl.bibliocommons.com/list/share/1106377937_mckenzingtonc/1277464857_disability_awareness
https://epl.bibliocommons.com/list/share/105396413_youthcollection/1260776227_fall_2018_must-read_ya_novels
https://epl.bibliocommons.com/list/share/105396413_youthcollection/1261651207_ya_books_about_social_anxiety
https://epl.bibliocommons.com/list/share/1244999997_jcls_youth_services/1259372807_libraries_rock_talent_teen_five_star_books
https://epl.bibliocommons.com/list/share/79828372_vpl_informationservice/1254087617_vpl_-_new_adult_fiction
https://epl.bibliocommons.com/list/share/308506797_kclsreads/1253264637_to_all_the_boys_ive_loved_before
0
KunduK 30 sep. 2019 a las 19:42

Raspé tu página y escribí un XPath que encontrará todos los elementos li en 'Listas de personal que incluyen este título'. Actualizado para incluir un wait para que todos los elementos li relevantes estén presentes.

WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPath, "//div[h4[text()='Staff Lists that include this Title']]/div[2]/ul/li[@class='']")))
driver.find_elements_by_xpath("//div[h4[text()='Staff Lists that include this Title']]/div[2]/ul/li[not(contains(@class, 'extra'))]")

Este XPath consulta el elemento principal div que contiene todos los elementos li bajo el elemento h4 que contiene el texto 'Listas de personal que incluyen este título'. Luego consultamos div[2] que contiene los elementos li relevantes. La consulta final es sobre elementos li con el nombre de clase VACÍO. Como podemos ver en la fuente de la página, hay muchos elementos ocultos li con el atributo class='extra'. No queremos estos elementos li, por lo que consultamos en not(contains(@class=, 'extra')) para obtener elementos li sin nombre de clase extra.

Si el XPath anterior no funciona, también modifiqué otro XPath que publicaste en tu problema original:

WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPath, "//*[@id="rightBar"]/div[3]/div/div[2]/ul/li[not(contains(@class, 'extra'))]")))
driver.find_elements_by_xpath("//*[@id="rightBar"]/div[3]/div/div[2]/ul/li[not(contains(@class, 'extra'))]")

Para la URL que proporcionó, ambas consultas obtuvieron 5 resultados:

XPath query

1
Christine 30 sep. 2019 a las 19:16