Anon 03/12/2024 (Tue) 04:28 No.9816 del

Not blocked yet:
start -> 2024-03-11T10:11:57.051663083Z
now ---> 2024-03-12T03:36:41.353799992Z
unix1 -> 1710151917 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/id" and normal stat says 2024-03-11 10:11:57.553640017 +0000)
unix2 -> 1710215173 ("stat --format=%Y ./www.canterlot.com-2024-03-11-9b3a5a01/wpull.log" and normal stat says 2024-03-12 03:46:03.762802180 +0000)
size --> 237 MB /z9/warc/012/www.canterlot.com-2024-03-11-9b3a5a01
200x3 -> https://www.canterlot.com/gallery/image/8158-yama-san-from-the-mountains/ + https://www.canterlot.com/gallery/album/1167-ondrea + https://www.canterlot.com/gallery/image/8380-rainpng/ (all recent)
est. --> 100 GB final size (with may image files, it could be 200 GB)
ran ---> 63,256 seconds (1710215173 - 1710151917)
down --> 3.747 KB/s (237/63256)
left --> roughly 99 GB (99,000,000 KB)
eta ---> roughly 26,421,137 seconds or 306 days (99000000/3.747 and 26421137/60/60/24)
notes -> The delay file can be change to contain "3000" (or whatever number) while grab-site is still running and it will then have that delay instead. Doing that seems to result in no problems. grab-site option of interest = --permanent-error-status-codes STATUS_CODES = "A comma-separated list of HTTP status codes to treat as a permanent error and therefore *not* retry (default: 401,403,404,405,410)". The wpull.db file can be opened by running "sqlite3 -column -header -csv 'wpull.db'"; then view tables by running ".tables"; then view rows by running "select * from tablename;". What's the fate of this grab? "Probably" my computer will crash/reboot then I won't return to it, so I'll just get a small portion of that site which requires a delay between requests. Or, I could keep working on it in various ways. A dealy of 5000ms-10000ms will take 306 or 612 days; let's say it will take about a year. A delay of 5000ms will maybe take "only" 150 days to download all of that website. I wish that grab-site was more fault-tolerant. Apparently Common Crawl has a lot of www.canterlot.com, but it doesn't have content.invisioncic.com outlinks and recent data.

Anyone rsync millions of files? It was a drag that bash deleted my paused job that was doing that:
>$ utc; rsync -a --info=progress2 /d1/path1/ /d2/path1/; utc # ~2,072,198 items
>2024-03-10T14:47:03.267513346Z

Message too long. Click here to view full text.