Jupyter Notebooks in Jekyll

May 30, 2021


        
        python


        
        auto


        
        data-science

4 min read.

jupyter notebooks are a very important instrument in a data scientists arsenal, they allow quick prototyping and testing without having to set up a development environment, with cloud services such as google colab and kaggle its much easier to start experimenting with exploratory data analysis, machine learning and deep learning, another advantage jupyter notebooks have is their form interoperability, notebooks can be converted to HTML, pdf and my favorite markdown.

most of my data science work these days is done with notebooks on kaggle or Google colab, i have finished entire projects on just a notebook (don’t judge me), through time i have made a collection of those notebooks and now that i have time to work on side projects why not add them to my personal website?

at first i searched for ways to embed a pdf in jekyll and found few ways but none of which seemed like a good option because pdf’s cannot be styled, next option was just convert to html, that worked but again faced a similar issue, jupyter exported html has its own css making it hard to override globally, this website is built using Jekyll which uses markdown as the primary form of posts (projects are just posts with project set to true) so i settled on markdown in the end.

why spend 1 hour converting few notebooks when i can spend few hours automating the whole process.

i decided to just script it, the initial steps are very clear

create a folder to keep all notebook files (make sure to add it to .gitignore).
go through all notebooks and convert them to markdown.
move all markdown files to _posts/
move all assets to site assets

for file in filenames:
    # get filename without directory prefix and extension
    name = file.split("/")[1].split(".")[0]
    new_name = f"{notebooks_dir}{current_date}-{name}.md"
    # convert
    os.system(f"jupyter nbconvert --to markdown {file}")

    # rename with date
    os.system(f"mv {notebooks_dir}{name}.md {new_name}")

    # move file and assets
    os.system(f"mv {new_name} {post_dir}")
    os.system(f"rm -rf {assets_dir}{name}_files")
    os.system(f"mv {notebooks_dir}{name}_files {assets_dir}")

Hiccups

asset images path needs to be updated in each post
- open the post
- add front matter
- update assets path
- paste the rest of the post

f = open(new_name, "r+")
content = f.readlines()
for i, line in enumerate(content):
    # fix assets path
    content[i] = content[i].replace("![png](", "![png](/assets/images/")
f.seek(0)
f.write(front_matter.format(name.replace("-", " ").title()))
f.writelines(content)
f.close()

prevent regenerating posts for notebooks that have already been generated: the conversion might take time with large notebooks so we want to prevent doing it twice for any of the notebooks, we check the posts directory if there is any file with a name that contains the current notebook name.

def check_exists(file):
    """check if a post from a notebook has already been created"""
    for post in glob.glob(post_dir + "/*.md"):
        if file in post:
            return True
    return False

fix transparent image backgrounds: most graphs have a transparent background which is not ideal on the web specially on a site with a dark background, imagemagick can easily add a background to any image and remove its transparency.

# add background color so text is visible in images
os.system(f"mogrify -background white -flatten {notebooks_dir}{name}_files/*")

Fix table header text alignment: by default the exported markdown tables are in html with the table header being aligned to the right and the tables taking the full width of the page, to fix these 2 issues set table display to block to fix its width according to content and force the header text to be aligned left.

table {
  display: block;
}
.dataframe thead th {
  text-align: left !important;
}

and with that everything is set, the script will do all the work for us, what is left is just to push the changes (not lazy enough to automate that yet). example results.

Full script

#!/usr/bin/env python

'''
notebook_converter.py

Copyright 2021 Abubakar Yagoub
Contact: blacksuan19.tk

This script converts a jupyter notebook to markdown
and moves converted markdown to posts and images to assets.
Requirements:
- jupyter (for converting notebooks to md)
- imagemagick (for adding background to images)
'''

import glob
import os
from datetime import datetime

post_dir = "_posts/"
notebooks_dir = "notebooks/"
assets_dir = "assets/images/"
current_date = datetime.today().strftime('%Y-%m-%d')
front_matter = """---
title: {}
layout: post
project: true
permalink: "/projects/:title/"
image: /assets/images/ds.jpg
source:
tags:
  - data-science
  - machine-learning
  - project
---\n\n"""

# get all notebook files
filenames = glob.glob(notebooks_dir + '*.ipynb')

def check_exists(file):
    """check if a post from a notebook has already been created"""
    for post in glob.glob(post_dir + "/*.md"):
        if file in post:
            return True
    return False

for file in filenames:
    # get filename without directory prefix and extension
    name = file.split("/")[1].split(".")[0]
    new_name = f"{notebooks_dir}{current_date}-{name}.md"
    if check_exists(name):
        print(f"post for {name} has already been created.")
        continue

    os.system(f"jupyter nbconvert --to markdown {file}")

    # rename with date
    os.system(f"mv {notebooks_dir}{name}.md {new_name}")

    f = open(new_name, "r+")
    content = f.readlines()
    for i, line in enumerate(content):
        # fix assets path
        content[i] = content[i].replace("![png](", "![png](/assets/images/")
    f.seek(0)
    f.write(front_matter.format(name.replace("-", " ").title()))
    f.writelines(content)
    f.close()

    # add background color so text is visible in images
    os.system(
        f"mogrify -background white -flatten {notebooks_dir}{name}_files/*")

    # move file and assets
    os.system(f"mv {new_name} {post_dir}")
    os.system(f"rm -rf {assets_dir}{name}_files")
    os.system(f"mv {notebooks_dir}{name}_files {assets_dir}")