How to Spider the web archive.org to recover your old website/webpage

There are problably a million blog posts telling you how to do it, but none of those guides worked for me.  Things can often be different sites and other "unknowns", so there is no guarantee this works for you or for every site.

First we'll use curl to search for all the relevant files for the domain and based on the snapshot/timestamp in bold below.

curl -s "https://web.archive.org/cdx/search/cdx?url=yourdomain.com/*&to=20190416204741&output=txt&fl=timestamp,original" | awk '{ print "https://web.archive.org/web/" $1 "id_/" $2 }' | sort -u > urls.txt

The above gets output to urls.txt, then we'll feed urls.txt to wget to retrieve all of the URLs.

wget --input-file=urls.txt --force-directories --protocol-directories --adjust-extension --convert-links --no-clobber

What you end up with is a directory with relatively the same path and contents as the original site.  It's not perfect, especially because sometimes not all pages are archived even though the hyperlink is referenced from indexed pages.

This is very helpful for businesses that had a disruption but have no saved content of their own.  It's not perfect but can restore content that was there previously but don't expect 24-48 hours fresh, think in terms of weeks or months.


Tags:

archive, org, website, webpagethere, problably, blog, posts, guides, sites, quot, unknowns, ll, curl, relevant, domain, snapshot, timestamp, bold, https, cdx, url, yourdomain, output, txt, fl, awk, id_, urls, wget, retrieve, input, directories, protocol, adjust, extension, convert, links, clobber, directory, relatively, contents, archived, hyperlink, referenced, indexed, businesses, disruption, content, restore, previously,

Latest Articles

  • How To Force Flash an AMD Instinct GPU To Another Model Using Debian Ubuntu Mint Linux
  • How To compile ollama from source to use unsupported AMD GPU with rocm in Ubuntu Debian
  • QEMU KVM Virtio GPU Windows Cannot Select 1080P
  • Linux Gnome Desktop Ubuntu Mint Debian Gets Slower After Weeks
  • Firefox How to Save Full Page As Screenshot/PDF
  • Nvidia Datacenter Driver Tesla Slow nvidia-smi response and high utilization with 0 usage
  • ffmpeg how to normalize / increase the volume of your audio
  • kdenlive audio blips pops cracks artifacts solution fix
  • haproxy / nginx certbot SSL issues
  • nginx how to see the real IP when behind a CDN
  • Docker how to find real container child process ID
  • Alibaba Aliyun how to reset password solution 'Setup does not meet the requirements, please resetting'
  • RTL88X Series 80Mhz hostapd mode for Linux Debian Kali
  • How To Deploy Your Own Mastodon Server in Docker
  • ffmpeg burning subtitles in non-English errors [Parsed_subtitles_0 @ 0x561d3a0b3b80] Glyph 0x6709 not found, selecting one more font for (Sans, 700, 0)
  • rsyslog in container config
  • Interesting Whisper AI CPU vs GPU Test
  • How to install pytorch with cuda capability for AI acceleration with Nvidia Tesla etc.. GPUs
  • How to Spider the web archive.org to recover your old website/webpage
  • Debian 10 /etc/apt/sources.list