Advanced/annoying web scraping, Shadow DOM Anon 07/25/2024 (Thu) 13:35 No.10739 del

(61.94 KB 2560x625 Selenium_logo.svg.png)

If you put shadow DOMs in your webpages, then you are certainly going to hell.

>>10737
>Horrible web design at https://archive.org/details/@[username_here] = each item tile is wrapped in #shadow-root so you may need to do some nerdy shit
IA user profile web pages weren't shit like that months ago. wget/curl/grab-site/HTTrack can't get shadow DOM nodes, so you have to use Selenium or mitmdump. Selenium is an important thing to understand in order to be better at web scraping and archiving. Install it:
>$ pip install selenium # or "pip3 install selenium"
Download a webpage:
>$ python3 -c "from selenium import webdriver; driver = webdriver.Chrome(); driver.get('https://stackoverflow.com/questions/42900214/how-to-download-a-html-webpage-using-selenium-with-python'); print(driver.page_source)" > 42900214.htm
Look inside nested shadow nodes then get the contents of non-nested shadow nodes under a certain DOM/shadowroot4:

$ python3 -c "import time; from selenium import webdriver; \
from selenium.webdriver.common.by import By; from selenium.webdriver.c\
hrome.service import Service; from selenium.webdriver.chrome.options i\
mport Options; from selenium.webdriver.support.ui import WebDriverWait\
; from selenium.webdriver.support import expected_conditions as EC; dr\
iver = webdriver.Chrome(); driver.get('https://archive.org/details/@od\
dgrenadier'); shadowhost = driver.find_element(By.XPATH, '//app-root')\
; shadowroot = driver.execute_script('return arguments[0].shadowRoot',\
 shadowhost); shadowhost2 = shadowroot.find_element(By.CSS_SELECTOR, '\
user-profile'); shadowroot2 = driver.execute_script('return arguments[\
0].shadowRoot', shadowhost2); shadowhost3 = shadowroot2.find_element(B\
y.CSS_SELECTOR, 'collection-browser'); shadowroot3 = driver.execute_sc\
ript('return arguments[0].shadowRoot', shadowhost3); shadowhost4 = sha\
dowroot3.find_element(By.CSS_SELECTOR, 'infinite-scroller'); shadowroo\
t4 = driver.execute_script('return arguments[0].shadowRoot', shadowhos\
t4); i = 1
while i < 9999:
 tiles = shadowroot4.find_element(By.CSS_SELECTOR, \"article[aria-posin\
set='\" + str(i) + \"']\"); time.sleep(5); sh = tiles.find_element(By.T\
AG_NAME, 'tile-dispatcher'); sr = driver.execute_script('return argumen\
ts[0].shadowRoot', sh); print(sr.find_element(By.CSS_SELECTOR, '#contai\
ner').get_attribute('innerHTML')); i += 1" > items.txt

Helpful: https://scrapeops.io/selenium-web-scraping-playbook/python-selenium-find-elements-css/ . The biggest problem I have right now with web scraping via selenium is timing ( https://www.selenium.dev/documentation/webdriver/waits/ ). Without "time.sleep(5)" you will get
>selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"tag name","selector":"tile-dispatcher"}
> (Session info: chrome=125.0.6422.141); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
With a 5-second wait, "time.sleep(5)", then you can get inconsistent results: one time it got a bunch of nodes, and the next time it got only a small amount of them. Another problem is this: nodes only seem to load when you scroll to them, which then unloads other nodes. Also, the initial infinite-scroller/shadowroot4 might not contain much.