EDIT: Call off the dogs. Matt was able to recover everything after a lot of hard work. We can stop cache-hunting now -- although I have to state that I'm pretty impressed we were able to collect over 525 posts in the email inbox of our cache-dump email account. Excellent crowdsourcing. Sorry it ended up being unnecessary, but had the website been unable to be recovered, it was pretty important that we get things from the cache before the caches expired. Thanks to everyone who was a part of this, and my apologies for anyone who feels it was a waste of time.
I woke up today and went to Hardwood Paroxysm, intending to look up an old piece I read every now and then for inspiration. Imagine my surprise when I found, well... nothing. I immediately checked Twitter and heard the news -- server got hacked, entire blog was deleted, things looked grim. Very sad story. I've actually had limited experience trying to recover lost websites before. Specifically, I had a forum I ran in high school whose website was unexpectedly wiped. We tried to save as many posts as we could, but we didn't get much. Most of it (including the tales of Spiderdude, a bro-ified Spiderman knock-off that only a high schooler like me would find funny) was lost to the endless ether of the internet. In trying to recover everything, though, I became at least a little more knowledgable in figuring out how to go about recovering a site when the server-side data unexpectedly vanishes. To the uninitiated, here are two key points to keep in mind.
- Caches have everything. ... Sort of. There are three main cache servers that spider virtually everything on the web and keep records for varying lengths of time. Google, Yahoo, and the Wayback Machine are my three mainstays -- there are quite a lot more, but those tend to have everything you need (with the others coming into play only later in the process). The process of accessing files cached by Google is simple -- you search for something, hover over it, then click on the "Cached" link that comes up on the right side of the page. As seen below, on the far right side of the image.
- Time is of the essence. This is why I say "sort of." Caches have a catch. They've got a relatively quick churn rate, and because of this, a webpage that no longer exists only stays cached in Google for a limited amount of time. The time varies based on how popular the website is -- I'm not sure what the algorithm is, exactly, but after a certain amount of time if the webpage no longer exists the Google cache picks up on it and removes the file. The Wayback machine doesn't work like that, however, it picks up historical data quite a bit less often than the Google/Yahoo caches. So it may not be as useful for this exercise.
Why is this relevant? We can still backup Hardwood Paroxysm. There are two ways we can do this -- either through sifting through the RSS feeds of people who don't delete old articles, or by downloading articles based on cache data. I've already started the second process, but given the incredible amount of material amassed by the Hardwood Paroxysm crew, there's absolutely no way I can do it alone. And that's where you come in. After the jump, I outline the ways that you can help save Hardwood Paroxysm's archives and preserve the content of one of the best basketball blogs to ever grace the web. Let's get to it.
• • •
From my local drive, I was able to save many of the key style elements from HP's page -- the logo, the CSS stylesheets, et cetera. I also was able to save the text of HP's last 10 posts, which is good, because most caches don't seem to have them. When thinking about how to best organize the task of sifting through caches for over a thousand posts, I came across what I think is a relatively good structure for backing up HP before the content churns out of the cache. We start by searching the Google cache for specific HP authors, crowdsourcing the task to one or two authors per person so that work isn't duplicated. We can collect the document text by copying the cached page text into emails and sending them to an email account set up specifically to take old HP articles, so they're saved in a place we know they won't vanish any time soon. Then we can turn the archives over to the HP writers so they can undertake the task of repopulating the blog with content. That may have been a bit hard to follow, so here's an easier to follow instruction manual:
STEP 1: PICK AN AUTHOR
Here is an incomplete list of Hardwood Paroxysm authors, provided to me by Matt Moore in no particular order. Italicized authors are ones whose work is already backed up:
- Rob Mahoney -- (Articles to be found by Mogias)
- David Sparks -- (Articles backed up.)
- Zach Harper -- (Articles to be found by AJ)
- Jared Wade -- (Articles to be found by Alex Dewey)
- Matt Moore -- (Articles to be found by Iz)
- Scott Leedy -- (Articles to be found by Adam Koscielak)
- Curtis Harris -- (Articles to be found by Jordan White)
- James Herbert -- (Articles backed up.)
- Jovan Buha
- Steve McPherson -- (Articles backed up.)
- Sean Highkin -- (Articles to be found by Jordan White)
- Danny Chau -- (Articles to be found by Blake Potosh)
- Connor Huchton -- (Articles to be found by Moglas)
- Jared Dubin -- (Articles backed up.)
- Jon Nichols
- Amin Vafa
- Eric Maroun -- (Articles backed up.)
- Noam Schiller -- (Articles to be found by Ian Dougherty)
- Conrad Kaczmarek -- (Articles backed up.)
- Andrew Lynch -- (Articles backed up.)
- Joey Whelan -- (Articles to be found by Aaron McGuire)
- Josh Tucker -- (Articles to be found by Tim)
Please comment on this post with your name and the name of an author, if you'd like to take the task of helping to back up their work. We'll put your name here, so that nobody else duplicates your work in getting their articles back. Once you've got an author, move on to Step 2.
STEP 2: LOOK UP THEIR WORK
Go to Google and search for the following string:
site:http://www.hardwoodparoxysm.com "[author name]"
This should bring up several pages of results, with each of their posts on Hardwood Paroxysm.
STEP 3: EXAMINE THAT CACHE, DOGGS
So, you have a list of articles. By hovering over the article link, you'll get a menu on the right side of the screen with a screenshot of what the article once looked like and a link that reads "Cached". Click the link. You may notice that it takes a long time to load -- if that's the case, just click through to the text-only version, which'll be in the right corner of the box on the top of the screen. Like this:
STEP 4: EMAIL
This is the important part. Copy the text of the article -- including the title, author, and date-stamp -- into your email program. Then send it to email@example.com, with the email titled as so:
[Article Author] - [Article Title]
Then go back to the original google search from step two, go to their next article, and repeat.
• • •
It's a tedious, mind-numbing process. But it's probably the easiest and most organized way to get HP's content stored before it gets churned out from the cache. If anyone has better ideas, I'd definitely be up for editing this post or altering the strategy. But I thought it'd be good to start something now, before articles start dropping like flies and the cache gets emptied. It's all at our fingertips right now, if we can organize enough and get it before it goes. This whole thing sucks, but hopefully we can minimize the damage and recover as much as we can. Good luck, campers.