I was recently issued a challenge: backup a blogsome blog and the content of a Slashdot journal and merge them into a single database. I foolishly accepted this challenge, knowing that Blogsome is based on WordPress. Come to find out that Blogsome doesn’t allow you to backup or export their WordPress content. Also, Slashdot doesn’t provide you a way to export or backup your journal. The prize was sweet: a brand new, fairly expensive, unlocked mobile phone.
If you want to make a mirror of your blogsome blog, you can use a single very powerful command to generate a snapshot of it from any Linux machine or Windows with Cygwin installed:
wget -k -m -r http://url |
But this will only create a static HTML mirror of your website. It won’t allow you manipulate content or put it into another database. That leaves only one way to do it – request the entire site page by page, and parse each page individually. RSS is not reliable here, as most people have it set to only 15 or so items and parsing an enormous page make make PHP or your server run out of memory or alloted script execution time.
It’s a multi-step process, to be certain. It was actually painful to go through the process. Requesting over and over, debugging the script line by line. It takes several steps to get things right. But, eventually, I did it. I was able to export a Blogsome blog in its entirety – every entry, all the categories, all the comments with emails and websites …everything.
Slashdot was the same. It took some tinkering, but eventually, I was able to backup a very lengthy Slashdot journal. Again, in its entirety. I got every post, the date, etc. It was not simple, but it worked well. And I not only got them backed up, I merged them into a single database serving up… Small Axe 0.6 (which was a whole adventure of its own, taking current blog.adamscheinberg.com code and “neutralizing” it). Suffice it to say, when I saw all entries working and served in Small Axe, I had a huge smile. Turns out that the person I was doing this for decided against Small Axe (only because it doesn’t yet offer all the bells and whistles WordPress does, even if it is a beast). But it was irrelevant – the hard part was getting the data properly, and that’s done. Migrating to *any* blog database is possible if you have the time, inclination, and skill to write a SQL export/import script.
Here’s how it works:
* Cycle through each page of the blogsome blog. On each page, we get the entries, the URL, the postid, and other relevant info. We set a flag on each item to 0.
* As we retreive the items, we correct the path to images, spacing, smilies, etc.
* Then we cycle through each page individually. We have all the URLs already, so we go through each one and parse comments. It’s important to know that comment owner comments are marked up differently. As we get the comments, we upodate the flag and let the script run on its own. Our 900+ items took about an hour by meta-refreshing the fully rendered page every 3 seconds.
* As we go through the comments, we strip tags we don’t want, we fix emoticons, we fix internal links, spacing, etc. We must expose emails temporarily if we want them to transfer over.
* Finally, we import them all into a central database with an agreed schema.
If you have a Blogsome blog and/or a Slashdot journal you need backed up, I can help you do it. It’s not a simple process, but it is very accurate and preserves whatever data is exposed via HTML. So for the right barter, I would be very motivated to help. If I can simplify the process, I may create an open script to do this. But for now, I’ve got the code.
On a related note, I’ll probably release an updated version of Small Axe sometime in the not too distant future, because the amount of changing I’ve done and all the features I’ve just implemented are killer. Small Axe is FAR from WordPress caliber tested, but it’s SUPER simple and can do all the basics of a normal blog, including templating, smart per-domain caching, blocking by ip, username, email, or keyword, gravatar support, tons of configuration options, RSS and Atom support, threaded commenting, post locking, post expiring, browser identification, slug-based permalinking, and much more.
Sounds interesting. Any chance we’ll see this code as a library in the near future?
Sounds interesting. Any chance we’ll see this code as a library in the near future?
Maybe, but not in the very near future. For one, it’s not fully automated. Also, the process has to be tweaked depending on the “theme.” And lastly, it’s going to take some practice to get it right… and practice means I need some blogs to practice on. So, maybe one day, but for now, it’s still fairly manual parsing.