Help Us Save Hardwood Paroxysm: a Bloggissist's Plea

EDIT: Call off the dogs. Matt was able to recover everything after a lot of hard work. We can stop cache-hunting now -- although I have to state that I'm pretty impressed we were able to collect over 525 posts in the email inbox of our cache-dump email account. Excellent crowdsourcing. Sorry it ended up being unnecessary, but had the website been unable to be recovered, it was pretty important that we get things from the cache before the caches expired. Thanks to everyone who was a part of this, and my apologies for anyone who feels it was a waste of time.

I woke up today and went to Hardwood Paroxysm, intending to look up an old piece I read every now and then for inspiration. Imagine my surprise when I found, well... nothing. I immediately checked Twitter and heard the news -- server got hacked, entire blog was deleted, things looked grim. Very sad story. I've actually had limited experience trying to recover lost websites before. Specifically, I had a forum I ran in high school whose website was unexpectedly wiped. We tried to save as many posts as we could, but we didn't get much. Most of it (including the tales of Spiderdude, a bro-ified Spiderman knock-off that only a high schooler like me would find funny) was lost to the endless ether of the internet. In trying to recover everything, though, I became at least a little more knowledgable in figuring out how to go about recovering a site when the server-side data unexpectedly vanishes. To the uninitiated, here are two key points to keep in mind.

  • Caches have everything. ... Sort of. There are three main cache servers that spider virtually everything on the web and keep records for varying lengths of time. Google, Yahoo, and the Wayback Machine are my three mainstays -- there are quite a lot more, but those tend to have everything you need (with the others coming into play only later in the process). The process of accessing files cached by Google is simple -- you search for something, hover over it, then click on the "Cached" link that comes up on the right side of the page. As seen below, on the far right side of the image.

  • Time is of the essence. This is why I say "sort of." Caches have a catch. They've got a relatively quick churn rate, and because of this, a webpage that no longer exists only stays cached in Google for a limited amount of time. The time varies based on how popular the website is -- I'm not sure what the algorithm is, exactly, but after a certain amount of time if the webpage no longer exists the Google cache picks up on it and removes the file. The Wayback machine doesn't work like that, however, it picks up historical data quite a bit less often than the Google/Yahoo caches. So it may not be as useful for this exercise.

Why is this relevant? We can still backup Hardwood Paroxysm. There are two ways we can do this -- either through sifting through the RSS feeds of people who don't delete old articles, or by downloading articles based on cache data. I've already started the second process, but given the incredible amount of material amassed by the Hardwood Paroxysm crew, there's absolutely no way I can do it alone. And that's where you come in. After the jump, I outline the ways that you can help save Hardwood Paroxysm's archives and preserve the content of one of the best basketball blogs to ever grace the web. Let's get to it.

 • • •

From my local drive, I was able to save many of the key style elements from HP's page -- the logo, the CSS stylesheets, et cetera. I also was able to save the text of HP's last 10 posts, which is good, because most caches don't seem to have them. When thinking about how to best organize the task of sifting through caches for over a thousand posts, I came across what I think is a relatively good structure for backing up HP before the content churns out of the cache. We start by searching the Google cache for specific HP authors, crowdsourcing the task to one or two authors per person so that work isn't duplicated. We can collect the document text by copying the cached page text into emails and sending them to an email account set up specifically to take old HP articles, so they're saved in a place we know they won't vanish any time soon. Then we can turn the archives over to the HP writers so they can undertake the task of repopulating the blog with content. That may have been a bit hard to follow, so here's an easier to follow instruction manual:

Page 1 of 2 | Next page