I have two dynamic sites running on a server I’d like to decommission; however, I want to make a backup of the sites as they look on their final day. Part of me can’t bear to shut them down, even when they haven’t been touched in any way in nearly a decade. The bigger motivator now is that I don’t want to keep paying to host them. Making static mirrors of the sites is the next best thing: I keep the pages and the content for nostalgia, but I can shut down the hosting server.
Initially, I searched for the term web site scraper, and then web site crawler, but what I really wanted was more like a web site cloner. I thought I’d need a complex toolchain to make this work. Or at least write some code myself. Turns out this can be done by a program you may already know:
wget—the non-interactive network downloader. Here’s the command I used to create the mirror:
wget -mpEk http://example.com
The flags are what make this powerful. This recurses through the entire site to create a mirror; downloads all the HTML, images, JS, and CSS each page uses; saves each dynamic URL to a static file; and, lastly, converts all the links on the site to refer to the files saved by
wget. This last feature allows you to navigate this mirrored site as seamlessly as if you were on the actual site itself. See the shell command explained or
wget --help for a detailed explanation.
The downside is that on Scrawlpoint some content is only visible when logged in. Does
wget support crawling a website when logged in? Indeed, it does. Here’s the commands I used to create the mirror as a particular user.
# Log in and save the credentials. wget --save-cookies cookies.txt \ --keep-session-cookies \ --post-data 'name=username&password=secret' \ --delete-after \ http://example.com/login.php # Use the credentials to create the mirror. wget -mpEk --load-cookies cookies.txt http://example.com
The first command logs in as a user and saves the cookies to a file. These cookies are then given to a second command that creates the mirror. See the wget save cookies and wget load cookies shell commands explained for details. This worked nicely to back up a Wordpress site and Scrawlpoint, a hobby site I created when first learning web development. Now, I’ve got two copies of Scrawlpoint: one as an existing user, one as a new visitor.
It would be possible to keep both sites available by hosting the static mirrors just made once the orignal server turned off. That’s cool! My next projects are to back up the code (this was before I knewa bout version control) and databases before shutting the server down permanently.
I’m always thankful for the information I find that helps point me in the right direction. I used a blog post and stackoverflow questions to help me with this project.