Case study: removing not-so-obvious duplication

I detected a not-so-obvious case of code duplication in my blog's compiler and had to rethink my architecture to handle it. Here's the story.

Before we start, a bit of context: I write my articles in markdown files and use a python program to build the blog from them. It takes .md files as input and produces .html files as output.

For future reference, here's what a typical markdown file looks like. The block at the beginning is called the frontmatter.

my-article.md
---
title: Markdown preview
tags: markdown, introductory
category: dummy
---

This is the article's content. Markdown syntax looks like this: 
*italic* ; **bold** ; `some_code()` ; [a link](https://julienharbulot.com).
My generator will also parse [internal links]({filename}my-other-article.md)

By the way, programs that generate websites from source files are commonly called static site generators.

Static site generator

A pipeline to generate the blog

I decided on a whim to code my own blog generator, spared a few hours one night, and started coding. Given the time constraints, I went with a very simple architecture: the generator will take the input files in markdown format, apply a series of transformations to gradually modify them and produce HTML files as output.

To emphasize modularity, each step of the pipeline is implemented in a separate function. I initially thought about using classes that have a process(articles, context) method, but then remembered that:

If it has two methods and one of them is __init__: it's not a class.

Back to the pipeline. The most important steps are:

compile.py
processors = [
    load_markdown_files,
    read_frontmatter,
    parse_markdown,
    render_html,
    write_html_files,
]

articles, context = {}, {}
for p in processors:
    p.process(articles, context)

The actual pipeline is a lot longer. Here's what it looked like before I removed the nasty duplication:

compile.py
processors = [
    load_from_disk(dictclass=super_dict),
    read_frontmatter,
    remove_drafts,
    add_default_metas(default_meta),
    add_common_meta,
    compute_output_filepath,
    sort_by_creation_date,
    rename_meta('category', 'categories'),
    convert_meta_str_to_list(['tags', 'categories']),
    parse_meta_list('tags'),
    parse_meta_list('categories'),
    cross_ref('tags', 'categories'),
    add_meta_for_categories,
    add_articles_to_context,
    parse_youtube_tags,
    parse_markdown,
    translate_links,
    parse_table_of_content,
    render_html,
    generate_articles,
    generate_categories,
    generate_index,
    copy_images,
    remove_old_files
]

articles, context = {}, {}
for p in processors:
    p.process(articles, context)

Apart from some configuration details, what you see above was the full code in my compile.py file. That's it. All the logic is implemented in the processors. And if I want to change anything, I just have to add or remove one.

I find very appealing this idea of having a simple pipeline at the core of my compiler and all the logic implemented in the processors that behave like small plugin functions.

Hacks to fit everything into the pipeline

Up until generate_articles, the pipeline is just a series of transformations applied on the articles. For instance, render_html uses Jinja2 templates to produce the html and generate_articles writes it to html files:

def render_html(articles, context):
    jinja2 = setup_skipped_for_brevity()

    for article in articles.values():
        template = jinja2.get_template(article.template)
        article.html = template.render(article=article, site=context)
def generate_articles(articles, context):
    for article in articles.values():
        content = article.html
        filepath = article.output_path
        filepath.open('w').write(content)

The last few processors, however, are pseudo processors. They don't use the articles argument provided and just do something else. That's a neat trick I found to keep everything in the pipeline.

For instance, the copy_images processors takes the images in the input folder and copies them into the html folder:

# notice how the first argument is ignored
def copy_images(_, context):
    copytree(context.input_dir_img, context.output_dir_img)
    # okay, that's not the full code, but you get the idea.
    # ps: input_dir is a pathlib.Path object so using / is valid python code

And generate_index uses another Jinja2 template to create the index page. It looks like this:

# notice how the firt argument is ignored
def generate_index(_, context):
    additional_data = skipped_for_brevity()
    jinja2 = setup_skipped_for_brevity()

    template = jinja2.get_template('index.template')
    html = template.render(data=additional_data, site=context)
    open('index.html', 'w').write(html)

So far, so good. I can put unused arguments in the interface of my functions to fit them into the pipeline... But wait.

Identifying duplication

Duplication don't necessarily mean copy & paste. I consider that duplication happens when we have to change the code in more than one place in order to change one functionality. It is error prone because we might forget one place. And it's a waste of time.

Duplication within the same, short function is not that bad. We can use macros and search/replace to edit it. But duplication across different functions or different files? The worst.

The snippet for generate_index shown above is full of duplication. First, the function does two distinct things: render the html and write it to file. They were put in different processors for the articles pipeline, so why are they together here?

It might not seem like a big deal. After all, it's only three lines of code. But what if I change library for rendering html? Or what if I implement a new processor named remove_old files and this processor requires that each newly written files be put in the context, to make sure it won't get removed? Then, one duplicated line for writing a file becomes two duplicated lines:

open('index.html', 'w').write(html)
context.written_files.append('index.html')

And what If I add a cache and this cache requires... Well you got the idea.

Handling duplication

I already have some code to render html in render_html and to write files in generate_articles. Can I reuse it to render and write my index page?

The code for html rendering has exactly the same structure in both functions and I can also reuse the code for writing a file as-is. So if my index were in the articles dictionary I could transparently reuse them both. Let's do that.

def create_index(articles, context):
    articles['index.html'] = dict(
        output_path = 'index.html',
        template = 'index.template',
        ...
    )

Likewise, I can rename articles_generator to simply be write_html_files:

def write_html_files(articles, context):
    for article in articles.values():
        content = article.html
        filepath = article.output_path
        filepath.open('w').write(content)

And the pipeline becomes:

compile.py
processors = [
    ...,
    create_index,
    render_html,
    write_html_files,
    ...
]

Actually, since my pipeline is now fit to handle generic pages and not just articles, let's rename the variable articles to pages:

pages, context = {}, {}
for p in processors:
    p.process(pages, context)

Hello again, MVC

After these modifications, a Model / View / Controller (MVC) architecture emerges :

  • My controller is the for loop to execute the pipeline.
  • My model is the (articles, context) pair that contains the data for all pages.

In other words, I have an abstract concept of pages that can be either an article or an index page or a category page or whatever. Then, I have a pipeline to modify these pages' attributes until they're fit to be written on disk.

On the other hand, before the refactoring, generate_index was writing a new html file in the output directory in a completely ad hoc way.

Benefits of a clean architecture

When I decided to implement a cache, this new code structure came in handy :

  • each page's version is identified with a hash of what it depends on. The hash is computed when the page is created. For the articles, it's their markdown content ; for the index it's the concatenated metadata used to build the listing.
  • so I can add a processor remove_if_cached that uses this hash and check if the page output is in cache. If it is, then simply remove the page from the pipeline.
processors = [
    ...,
    create_index,
    remove_if_cached,  # <---
    render_html,
    write_html_files,
    ...
]

And the processor will indifferently cache the article and the newly injected index page. Whereas with my previous generate_index I would have had to duplicate the cache logic to cache the index.

Final words

My simple static generator is far from being perfectly designed and will organically evolve as I add new features. If I implement something and end up having to make modifications in multiple functions or files, I'll know that I have another case of duplication and fight it like it should.

That's it for today! I hope this case study inspired you to find duplication in your own codebases and relentlessly hunt it.