page version: July 2, 2024
archiving stack overflow (for the forest)
First Iteration
(Note: I call any stack exchange site “stack overflow”.)
Stack overflow has some phenomenal content. I won’t wax on what precisely is great about stack overflow, but one issue is that the comments are often too good.
This is an issue because comments are meant to be more ephemeral, and I’m already generally worried about broken hyperlinks– the existence of comments is not stable.
The wayback machine is essentially the solution, except it seems to break some math rendering on stack overflow. I’m not sure how deep of an issue this is, but I’d like an archival solution that doesn’t have this shortcoming.
wget
seems to be the tool for the job, so this post is really to collect the
wget
flags we should use for our own web snapshotting.
wget --timestamping --convert-links --page-requisites --wait=1 --random-wait LINK
This will download a single html page, with (hopefully) any resources required for that page, in a hierarchical directory structure.
Second Iteration
So I was completely and utterly wrong.
I just spent a low several hours realising that the command should also have --adjust-extension
, so the full wget
command should be
wget --timestamping --convert-links --page-requisites --adjust-extension --wait=1 --random-wait LINK
Except that the stack overflow “Show \(n\) more comment” button is javascript
(uses javascript to load the rest of the comments).
Which means any wget
solution fails to save any collapsed comments.
And the much of the point was to save the (full of content) comments.
So we instead need a tool to expand the comments by running the javascript
click events, then save the page as it appears.
The method we settled on and hacked up is to use selenium webdriver
to run javascript to expand all comments,
then save the page with Firefox’s “Save Page As…”, scripted with xdotool
.
The slightly less gross way would be to dump the page from selenium, but for some reason the stack overflow MathJax initially loads fine, then disappears with “math processing error”.
Yeah… not the way I wanted this to go but I think this second version works.
#!/usr/bin/env ruby
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :firefox
url = ARGV[0]
url =~ Regexp.new('/([^/]+)$')
fail unless Regexp.last_match
post_name = $1
puts "post #{post_name}"
driver.get(url)
# https://meta.stackoverflow.com/questions/268400/is-there-a-get-query-string-to-expand-all-comments-by-default#comment555104_268456
# wait for the page to load (yes, I know this isn't the proper way, but I really don't know how to tell if the page is fully loaded)
sleep 3
driver.execute_script(
"javascript:for(x of document.getElementsByClassName('js-add-link comments-link'))x?.click()"
)
# We get MathJax rendering errors for some reason, so let's just use the
# browser's "Save Page As"
#puts driver.page_source
# For some reason, the selenium action for C-s doesn't work, but things like
# C-a, tab, S-tab, work
# Not that it matters much, since we wouldn't be able to interact with the
# popup anyways
#driver.action.key_down(:control).key_down('s').key_up(:control).key_up('s').perform
#
# somehow, we always come back to xdotool...
# stack overflow has a giant sign-in pop-up we want to close
# let's just spam Esc a couple times to get it to go away
for _ in 1..3 do
sleep 1
`xdotool key Escape`
end
# how to close the other smaller pop-ups:
# https://www.w3schools.com/jsref/met_html_click.asp
sleep 1
# close cookies request pop-up
driver.execute_script(
"javascript:document.getElementById('onetrust-reject-all-handler')?.click()"
)
sleep 1
# close google sign-in pop-up
driver.execute_script(
"javascript:document.getElementById('credential_picker_container')?.remove()"
)
# and finally, we save by simulating <C-s> with xdotool
sleep 1
`xdotool key ctrl+s`
sleep 1
system('xdotool', 'type', File.expand_path(File.join(ARGV[1], post_name)))
sleep 1
`xdotool key alt+s`
sleep 1
driver.quit