Migrating blogs to jekyll

Published: Wed, Jul 2, 2014
categories: blog
tags: django, jekyll, blogging, python

Finally got fed up with my slow publishing, tried out jekyll on github.io and really liked it so decided to migrate this from django zinna and some custom models over to jekyll. Been running happily in Django for a while now and have been pleased with the function and performance but the difficulty in posting was slowing me down, so I was looking for something with a lower entry point. I stumbled across a number of Jekyll and Octopress blogs and since it is pretty easy to try out on github.io it gave me a chance to try it out for a week or so, needless to say I was happy with the ease of use and that I can write in markdown and easily post content.

Once I decided to make it happen it took a few hours to export the content and confirm it was properly migrated. First was to get the data out of Django, I did a quick dumpdata to loaddata to get a local copy to work with.

vl4rl.com [~/public_html]# python manage.py dumpdata --indent=4 --exclude=contenttypes --exclude=auth.Permission  > data-prod.20140630-4.json

Now that I have the data exported, I download it to my development environment and import it.

(ve-blog) sgwilbur@gura:~/workspaces/vl4rl.com$ python manage.py flush
(ve-blog) sgwilbur@gura:~/workspaces/vl4rl.com$ python manage.py loaddata ../data-prod.20140630-4.json

With a working setup locally to play with in my interactive console I was able to figure out what I needed to do to extract it, after a bit more playing around I wrote this quick helper script to assist.

#!/usr/bin/env python

import sys

sys.path.insert(0, "~/workspaces/virtualenvs/vl4rl")
sys.path.insert(1, "~/workspaces/vl4rl.com/")

import os, django, cgi
from django.utils.text import slugify

os.environ['DJANGO_SETTINGS_MODULE'] = 'vl4rl.settings-dev'

from django.conf import settings
from django.db.models import Model
from zinnia.models.entry import Entry
from pages.models import Page

output_dir = '~/workspaces/vl4rl.com_jekyll/_posts'

es = Entry.objects.all()

for e in es:
  post = {}
  
#  published = e.published # couldn't get directly from the queryset, so I skipped it...
  post['title']     = e.title
  post['tags']      = ' '.join(e.tags_list)
  post['excerpt']   = cgi.escape( e.excerpt )
  post['slug']      = e.slug
  post['content']   = e.content.replace("\r", "")
  post['created']   = '{:%Y-%m-%d}'.format(e.creation_date)
  post['modified']  = '{:%Y-%m-%d %H:%M}'.format(e.last_update)
  
  post_output = '''---
layout: post
title: %(title)s
description: %(excerpt)s
published: true
created: %(created)s
modified: %(modified)s
category:
tags: %(tags)s
---

%(content)s
''' % post

  output_file = output_dir + '/' + post['created'] + '-' + post['slug'] + '.md'
  print "Creating post: ", output_file
  
  with open( output_file, 'w+') as f:
    f.write( post_output.encode('UTF-8') )
    
ps = Page.objects.all()

for p in ps:
  page = {}
  page['title']     = p.title
  page['slug']      = slugify( p.title )
  page['content']   = p.content.replace("\r", "")
  page['created']   = '{:%Y-%m-%d}'.format(p.pub_date)
  page['modified']  = '{:%Y-%m-%d %H:%M}'.format(p.mod_date)
  
  page_output = '''---
layout: post
title: %(title)s
published: true
created: %(created)s
modified: %(modified)s
category:
tags:
---

%(content)s
''' % page

  output_file = output_dir + '/' + page['created'] + '-' + page['slug'] + '.md'
  print "Creating post: ", output_file
  
  with open( output_file, 'w+') as f:
    f.write( page_output.encode('UTF-8') )

So this did the lion’s share of the work, and got my content from Django out and ready to use. But unfortunately even though I converted to Markdown style content in the blog it is still a bit hit and miss, so I needed to go through and inspect each post. That is when I noticed that a few characters had been escaped. Could have gone back and corrected it the first time, but I had already started content cleanup, so I pulled out my old friends grep and sed to find the issues and do some in-place edits, the end result was these three commands, hopefully I didn’t miss any other characters, but since it they are simple markdown files, it will be easy to cleanup in the future if that is the case.

sgwilbur@gura:~/workspaces/vl4rl.com_jekyll$ sed -i.bak "s/&quot;/\"/g" _posts/*.md
sgwilbur@gura:~/workspaces/vl4rl.com_jekyll$ sed -i.bak "s/&lt;/</g" _posts/*.md
sgwilbur@gura:~/workspaces/vl4rl.com_jekyll$ sed -i.bak "s/&gt;/>/g" _posts/*.md
sgwilbur@gura:~/workspaces/vl4rl.com_jekyll$ rm _posts/*.md.bak

There were a few other minor issues to cleanup and found that you can’t have : characters in your frontmatter description, which makes sense. So I am up and running in a day and already posting some new content :)

Also working on migrating a few other blogs into this new format, found an old mysql dump of my SimpleLog blog from a while back. First I had to figure out what the schema looked like:

mysql> show tables;
+---------------------+
| Tables_in_simplelog |
+---------------------+
| authors             |
| blacklist           |
| comments            |
| pages               |
| posts               |
| preferences         |
| schema_info         |
| sessions            |
| tags                |
| tags_posts          |
| updates             |
+---------------------+
11 rows in set (0.00 sec)

mysql> desc posts
    -> ;
+---------------------+--------------+------+-----+---------+----------------+
| Field               | Type         | Null | Key | Default | Extra          |
+---------------------+--------------+------+-----+---------+----------------+
| id                  | int(11)      | NO   | PRI | NULL    | auto_increment |
| author_id           | int(11)      | NO   |     | 0       |                |
| created_at          | datetime     | NO   |     | NULL    |                |
| modified_at         | datetime     | NO   |     | NULL    |                |
| permalink           | varchar(128) | YES  |     | NULL    |                |
| title               | varchar(255) | YES  |     | NULL    |                |
| synd_title          | varchar(255) | YES  |     | NULL    |                |
| summary             | text         | YES  |     | NULL    |                |
| body_raw            | text         | YES  |     | NULL    |                |
| extended_raw        | text         | YES  |     | NULL    |                |
| body                | text         | YES  |     | NULL    |                |
| extended            | text         | YES  |     | NULL    |                |
| is_active           | tinyint(1)   | YES  |     | 1       |                |
| custom_field_1      | varchar(255) | YES  |     | NULL    |                |
| custom_field_2      | varchar(255) | YES  |     | NULL    |                |
| custom_field_3      | varchar(255) | YES  |     | NULL    |                |
| body_searchable     | text         | YES  |     | NULL    |                |
| extended_searchable | text         | YES  |     | NULL    |                |
| text_filter         | varchar(255) | YES  |     | NULL    |                |
| comment_status      | int(11)      | YES  |     | 0       |                |
+---------------------+--------------+------+-----+---------+----------------+
20 rows in set (0.00 sec)

I loaded it up and used the same script above with the MySQL Python Connector:

#!/usr/bin/python
import cgi
import re
import mysql.connector

output_dir = "./_posts"

db = mysql.connector.connect(host="localhost", user="root", passwd="***", db="simplelog")

cur = db.cursor()
cur.execute("SELECT id, title, created_at, modified_at, body_raw, summary FROM posts")

for row in cur.fetchall() :
  post = {}
  
  post['title'] = row[1]
  post['excerpt'] = cgi.escape( row[5] )
  post['slug'] = re.sub('[-\s]+', '-', re.sub('[^\w\s-]', '', row[1] ).strip().lower() )
  post['content'] = row[4].replace("\r", "")
  post['created'] = row[2].strftime('%Y-%m-%d')
  post['modified'] = row[3]
  
  post_output = '''---
layout: post
title: %(title)s
description: %(excerpt)s
published: false
created: %(created)s
modified: %(modified)s
category:
tags:
---
  
%(content)s

__Migrated: from simplelog 2014-07-03__

##### Reference:

''' % post

  output_file = output_dir + '/' + post['created'] + '-' + post['slug'] + '.md'
  print "Creating post: ", output_file
  
  with open( output_file, 'w+') as f:
    f.write( post_output.encode('UTF-8') )

Only after I was done did I realize that I forgot the tags, no big deal since I updated each post manually to check the formatting and just added some relavent tags anyway. And I got a few earlier posts on the new blog, it’s kinda like time traveling seeing some of this old stuff ;)

Probably have a few more sources out there that I will slowly bring back into the fold to see if I can really get down to one location for them all.