Anon 12/11/2023 (Mon) 01:44 No.9027 del
>>8980
>>9019
Getting closer. (Flow: Thread download->Extractor->Analyzer which works for two formats, all in Bash.)

>[Done] Bash implementation for: 4chan rendered thread webpage ctrl+a,ctrl+c,ctrl+v -> TXT file -> extractor -> text files per-post
Doesn't really matter, due to live threads representing a small amount of all threads. Was still a bit of a learning experience making an analyzer for that.

>[Done] Bash implementation for: Desuarchive thread webpage source code -> HTML file -> extractor -> HTMLs files per-post
- Extractor: could use more work, or a different implementation, if the goal is for it to go very fast. It goes quick enough, like a minute to two minutes per 300-post thread. Current extractor: https://ipfs.io/ipfs/bafkreiafoznatynuosk5xfn7ue5uvg6fcndjb5w3blmbtxm2rcli5u7wyi?filename=1how1.txt
- Analyzer: see below.

Analyzer: I was thinking of using htmlq, which is like jq. Considered using it to properly parse HTML, but you have to install it via "sudo apt install cargo" then "cargo install htmlq" then optionally add it to path via 'export PATH="$PATH:$HOME/.cargo/bin"' in .bashrc or whatever. Didn't feel like using htmlq due to needing to install something else. Here's what I got:
> grep -Paoil "class=\"greentext\".*?>" * | grep "\.html" | xargs -d "\n" sh -c 'for args do tlines=$(expr $(grep -Paoi "<br" $args | wc -l) + 1); echo $(echo $(grep -Paoi "class=\"greentext\".*?>" $args | wc -l) / $tlines | bc -l) \* 100 | bc -l | grep -o "^......" | sed "s/%/g" | tr -d \\n; echo " for $args which has $tlines line(s)"; done' _ | sort -n
>[...]
>93.939% for 20364871.html which has 33 line(s)
>95.652% for 20409825.html which has 23 line(s)
>100.00% for 20282509.html which has 1 line(s)

Message too long. Click here to view full text.