How to Spider the web archive.org to recover your old website/webpage -

There are problably a million blog posts telling you how to do it, but none of those guides worked for me. Things can often be different sites and other "unknowns", so there is no guarantee this works for you or for every site.

First we'll use curl to search for all the relevant files for the domain and based on the snapshot/timestamp in bold below.

curl -s "https://web.archive.org/cdx/search/cdx?url=yourdomain.com/*&to=20190416204741&output=txt&fl=timestamp,original" | awk '{ print "https://web.archive.org/web/" $1 "id_/" $2 }' | sort -u > urls.txt

The above gets output to urls.txt, then we'll feed urls.txt to wget to retrieve all of the URLs.

wget --input-file=urls.txt --force-directories --protocol-directories --adjust-extension --convert-links --no-clobber

What you end up with is a directory with relatively the same path and contents as the original site. It's not perfect, especially because sometimes not all pages are archived even though the hyperlink is referenced from indexed pages.

This is very helpful for businesses that had a disruption but have no saved content of their own. It's not perfect but can restore content that was there previously but don't expect 24-48 hours fresh, think in terms of weeks or months.

Tags:

archive, org, website, webpagethere, problably, blog, posts, guides, sites, quot, unknowns, ll, curl, relevant, domain, snapshot, timestamp, bold, https, cdx, url, yourdomain, output, txt, fl, awk, id_, urls, wget, retrieve, input, directories, protocol, adjust, extension, convert, links, clobber, directory, relatively, contents, archived, hyperlink, referenced, indexed, businesses, disruption, content, restore, previously,

How to Spider the web archive.org to recover your old website/webpage

Latest Articles