Loading config files in Python

Brett Weir, May 29, 2023

Config files are everywhere. There are lots of reasons your app might need to have one:

In all of these cases, the structure of the config is very important and likely long-lived. Mistakes in your config syntax will be hard to undo, so it pays to have a plan upfront, and design for it to be extended and documented.

In this article, we'll learn how to load YAML config files in a way that is clean, easy to support, and easy to extend. We'll do this by creating our own YAML task automation syntax, which we'll call taskbook files:

# taskbook.yml

group: # name of group

tasks: # list of tasks
  - name: # name of task
    module: # module to use
    options:
      # key / value options

  # ...

We'll write a program to read them, which we'll call Taskable*. When finished, it will be easy to determine fields that are supported, validate config values safely, add more fields for future needs, and even access config values within our program as properties.

*Any similarity to Ansible playbook syntax, real or imagined, is purely coincidental. 😂

Create a command line tool

Let's create a file called taskable.py, to contain our implementation of Taskable:

# taskable.py
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()

if __name__ == "__main__":
    main()

This provides the scaffolding for an argparse command line interface (for more info, see our article on Python CLIs).

You can run the script as follows:

python3 taskable.py
$ python3 taskable.py
usage: taskable.py [-h] file
taskable.py: error: the following arguments are required: file

To be able to read in a file, we need to create the file first, which we'll do next.

Create a taskbook file

I'll be using YAML for the config files because it's easy to read and I'm comfortable with it, but you can easily support JSON or TOML, as they offer similar APIs.

Create a taskbook.yml file and add the following:

# taskbook.yml
group: localhost
tasks:
  - name: copy file.txt to the place
    module: saucy.copy
    options:
      source: file.txt
      dest: /etc/file.txt

  - name: install a package
    module: cheesy.package
    options:
      name:
        - fzf
        - tree
      upgrade: true

  - name: enable the service
    module: lettuce.service
    options:
      enable: true
      start: true

At this point, we'll be able to run the following:

python3 taskable.py taskbook.yaml

However, nothing will happen because our app doesn't print anything yet.

Read in the YAML file

YAML files are easy to read with Python. There are multiple libraries available, but pyyaml is the de facto standard and is often installed on whatever system you're already on.

If you don't have pyyaml (or you're using a virtual environment because you're awesome), install it now:

pip install pyyaml

Then, in your taskable.py file, import the yaml package and read in the YAML file:

# taskable.py
import argparse
import yaml

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()
    data = yaml.safe_load(args.file)

if __name__ == "__main__":
    main()

At this point, you will be able to read in the YAML file, but there's still no output just yet. We could stop here and access its values as nested dictionaries and arrays, like so:

data["tasks"][0]["module"]

...but there are a couple problems with this.

First, there's no validation at all, so a malformed config file has unpredictable results. Second, strings are opaque data, so IDE auto-completion won't work; changing a field name will require manually searching through the code to do so; and I hope you never misspell a field name.

No, we can do a lot better, and we will, starting by building a model of our data in the next section.

Create the data model

We need a way to express our data format so that it's functional. For this purpose, I prefer to use attrs, which gives us data validation, makes our classes more performant, allows us to access our fields as properties with dramatically less boilerplate, and more.

Let's install attrs:

pip install attrs

Then add the following to your taskable.py file:

# taskable.py
import argparse
from typing import Any

import yaml
from attrs import define, field

@define
class Task:
    name: str
    module: str
    options: dict[str, Any] = field(factory=dict)

@define
class Taskbook:
    group: str
    tasks: list[Task]

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()
    data = yaml.safe_load(args.file)

if __name__ == "__main__":
    main()

These two classes—Task and Taskbook—fully express the taskbook format. We won't instantiate them ourselves though, because we'll learn a method to do so automagically in the next section.

Structurize into models

"Structurize" is a $6 word (that I may have made up) that translates to, "load all your data into fancy model classes." I'm using it because "de-serialize" sounds awful and is harder to type. 😝

The easiest way to structurize your YAML data into attrs classes is by using the cattrs package. The simplest usage looks like this:

import cattrs
taskbook = cattrs.structure(data, Taskbook)

Let's add it to our taskable.py file:

# taskable.py
import argparse
from typing import Any

import cattrs
import yaml
from attrs import define, field

@define
class Task:
    name: str
    module: str
    options: dict[str, Any] = field(factory=dict)

@define
class Taskbook:
    group: str
    tasks: list[Task]

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()
    data = yaml.safe_load(args.file)
    taskbook = cattrs.structure(data, Taskbook)

if __name__ == "__main__":
    main()

That's all you need! cattrs will load the data into attrs classes after only being given the expected top-level class, which is Taskbook here.

If you need to tweak the behavior, cattrs provides a hook mechanism. It's a bit cumbersome, but it's easier than writing all the structurization code from scratch.

In the next section, we'll work on doing something useful with our data.

Use the data

At this point, we've fully structurized our data into classes, which means we can access our config data like this:

taskbook.tasks[0].module

This makes our code much easier to read and work with. Now we'll try using it to do stuff.

"Run" tasks

What good is our script if it can't run tasks? Let's add something to simulate "running" our hypothetical tasks, by adding the following to our taskable.py file:

# taskable.py
import argparse
from typing import Any

import cattrs
import yaml
from attrs import define, field

@define
class Task:
    name: str
    module: str
    options: dict[str, Any] = field(factory=dict)

@define
class Taskbook:
    group: str
    tasks: list[Task]

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()
    data = yaml.safe_load(args.file)
    taskbook = cattrs.structure(data, Taskbook)
    print("group", taskbook.group)
    for task in taskbook.tasks:
        print(f"run {task.module}: {task.name}")

if __name__ == "__main__":
    main()

Running our script will output the following:

python3 taskable.py taskbook.yml
$ python3 taskable.py taskbook.yml
group localhost
run saucy.copy: copy file.txt to the place
run cheesy.package: install a package
run lettuce.service: enable the service

It's not hard to imagine connecting this skeleton to real module implementations to drive real task execution.

List used modules

Maybe we'd like to inspect our taskbook to find out what modules it uses. This would be useful, for example, to install necessary modules before running our tasks.

Let's add a -l / --list option to list used modules and exit without running the tasks:

# taskable.py
import argparse
from typing import Any

import cattrs
import yaml
from attrs import define, field

@define
class Task:
    name: str
    module: str
    options: dict[str, Any] = field(factory=dict)

@define
class Taskbook:
    group: str
    tasks: list[Task]

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-l", "--list", action="store_true")
    parser.add_argument("file", type=argparse.FileType("r"))
    args = parser.parse_args()
    data = yaml.safe_load(args.file)
    taskbook = cattrs.structure(data, Taskbook)
    if args.list:
        used_modules = sorted(list(set(task.module for task in taskbook.tasks)))
        for module in used_modules:
            print(module)
        return
    print("group", taskbook.group)
    for task in taskbook.tasks:
        print(f"run {task.module}: {task.name}")

if __name__ == "__main__":
    main()

Running taskable.py with the list mode enabled:

python3 taskable.py -l taskbook.yaml
$ python3 taskable.py -l taskbook.yaml
cheesy.package
lettuce.service
saucy.copy

Woot! Static analysis! And it was easy to implement because our data model is so well-defined.

Summary

In this tutorial, we've built up a versatile config loading mechanism.

This setup works equally well for tiny command line utilities as it does for large and complex data formats like task workflows, specifications, and so on. You can continue growing your application by adding new fields and new data models, and avoid the malignant technical debt that springs from a muddled early config implementation.

The best part? Your configuration will be stable and serve as the bedrock and foundation of your application, now and in the future. In the words of Eric S. Raymond:

Stay smart, people! 😄