Pedro Rodriguez

Research Scientist in Natural Language Processing

Debugging Machine Learning Code

Developing new machine learning code is often error prone and takes many iterations of the write-run-debug loop. In this context, I specifically refer to saving time fixing errors that crash the program--not than those that cause models to be incorrect in subtle ways (for that, see Andrej Karpathy's blog post). Here are a few tricks I use to preserve my sanity while developing new model code written in python.

A quick outline of the tips:

  1. Debuggers
  2. Document tensor shapes
  3. Verbose logging with logging module
  4. Debugging dataset
  5. Unit tests

Use Debuggers

Modern machine learning code often contains several abstraction levels---a good thing!---which unfortunately makes it more difficult to dig deep into plumbing to fix data loading or tensor shape errors. Debuggers are exceptionally useful in these cases. I use them in one of two ways.

If I know that the program fails then you can start a debugger on failure by using one of:

  1. import pdb;pdb.set_trace(), breakpoint() on the line that fails
  2. If it fails for only some iterations (e.g., one data point is bad), then python -m pdb to start a session and press c to start the program. The interpreter will drop you into a debug session when the failure occurs.
  3. For allennlp specifically, this command works well: ipython -m ipdb (which allennlp) -- train config.jsonnet

Sidenote: I use ipdb instead of pdb since its more similar to the ipython terminal.

Document Tensor Shapes

Its extremely helpful to know tensor shapes during development and helps reduce time when looking at code again. Here is a sample forward pass of a pytorch model with shape annotations:

def forward(self, text, length):
    output = {}
    # (batch_size, seq_length)
    text = text['tokens'].cuda()

    # (batch_size, 1)
    length = length.cuda()

    # (batch_size, seq_length, word_dim)
    text_embed = self._word_embeddings(text)

    # (batch_size, word_dim)
    text_embed = self._word_dropout(text_embed.sum(1) / length)

    # (batch_size, hidden_dim)
    hidden = self._encoder(text_embed)

    # (batch_size, n_classes)
    output['logits'] = self._classifier(hidden)

Verbose Logging to Terminal and Files

I often see print statements, but not much usage of the python logging module in model code. Although it takes some setup, there are several benefits to using over print.

  1. Timestamps are logged "for free" which is helpful to understanding where most of the execution time is spent.
  2. Logging can be configured to output the module a statement is from which makes debugging faster.
  3. Logging can also be configured to write to a file. This has saved me a few times when I didn't expect to need print output when I ran the model, but later needed it.
  4. This leads me to: be verbose in what you log. I love that the logging in allennlp includes things like model parameters (Sample Log).

I typically include this code in my package for logging to the standard error and a file

# In a file like
import logging

def get(name):
    log = logging.getLogger(name)

    if len(log.handlers) < 2:
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

        fh = logging.FileHandler('mylogfile.log')

        sh = logging.StreamHandler()

    return log

# In my model file
from mypackage import util

log = util.get(__name__)"hello world")

Small Debug Dataset

Another common issue is debugging an error when dataset loading takes a long time. Its very annoying to debug shape issues when your dataset takes twenty minutes to load. One trick---aside from using pdb---is to make a small debug dataset and testing with this before using the full dataset. If your dataset is in a line delimited format like jsonlines, then it may be as easy as $ cat mydata.jsonl | head -n 100 > debug_dataset.jsonl

Unit Tests

Last but not least, writing a few unit tests is often helpful. Specifically, I like writing unit tests for data or metric code that is not obviously correct by inspection. PyTest has worked very well for this purpose since its easy to use and configure.

Here is a simple example of my configuration (pytest.ini)

testpaths = awesomemodel/tests/
filterwarnings =

A sample test in awesomemodel/tests/

from numpy.testing import assert_almost_equal
import pytest

from awesomemodel.numbers import AlmostZero

def test_zero():
    zero = AlmostZero()
    assert_almost_equal(0.0, zero())

And running the test

$ pytest
======================================================================================= test session starts ========================================================================================
platform linux -- Python 3.8.0, pytest-5.3.1, py-1.8.0, pluggy-0.13.1
rootdir: /tmp/src, inifile: pytest.ini, testpaths: awesomemodel/tests/
collected 1 item                                                                                                                                                                                   

awesomemodel/tests/ .                                                                                                                                                            [100%]

======================================================================================== 1 passed in 0.13s =========================================================================================

For reference, the directory structure

$ tree
├── awesomemodel
│   ├──
│   ├── tests
│   │   ├──
│   │   └──
│   └──
└── pytest.ini

2 directories, 5 files

Hopefully you'll find some of these tricks helpful for efficiently developing your own models! Thanks to Joe Barrow for the discussion inspiring the post and to Shi Feng for edits and comments. In my next post I'll briefly describe how I use Semantic Scholar for writing literature reviews or related work sections in papers.