Pedro Rodriguez

Research Scientist in Natural Language Processing


Here are a bunch of software/tips/resources I've found. I like to have them in a publicly accessible webpage, but hope that its helpful to others as well. I've framed some of these as FAQs to improve discoverability.

What are useful python cheatsheets?


What software is useful for writing research papers?

How do you do X in Plotnine?

Operating Systems

  • I use Arch Linux on machines I own.
  • For cloud instances, I use Ubuntu

What are some awesome (rust-based) command line tools?

How can I install software on linux without root?

What is some software for managing ML experiments?

What are ways to make git better?

What are good python debugging tools/tricks?

  • ipdb and pdb are fantastic for command line debugging
  • To start debugger on if allennlp errors: ipython -m ipdb (which allennlp) -- train config.jsonnet and press c to continue when terminal starts
  • Anaconda pip installations from source packages causing g++ errors like "file format not recognized", rename anaconda's ld to ld_ so that pip uses the system version

How can I search for types of Wikipedia pages?

What is some software for data analytics/distributed computing?

What are good python libraries for creating websites?

  • For small APIs, FastAPI or websites that you don't need/want pre-made user system
  • For more "out of the box", but more opinionated use Django
  • For static sites Static site (like this page) Pelican

What are some good NLP libraries?

  • Allennlp is an amazing library for research in natural language processing, use it! *Spacy: Fantastic, easy to use tools for tokenization, dependency parsing, named entity recognition and more, often used in other NLP software.

What data formats should I use?

  • Unless you have a very good reason and have purely numerical data, never use csv; saying a file is csv format is insufficient information to be able to parse the file
  • Default to json
  • For large json files that are table-like (the root object is an array, and looks like rows), consider JSON lines/jsonl. Large JSON objects can be expensive to parse, and make it difficult to run parallel jobs (eg Apache Spark uses line delimited rows from text files)
  • For data you expect to analyze, you might consider creating a read-only SQlite database and running analysis in SQL.


How can I improve my ergonomics and avoid repetetive strain injuries?

  • I bought a vari electric sit/stand desk and love it. It encourages me to stretch, improves my posture, and gets me moving throughout the day

How can I improve my home internet?

AllenNLP Tips

  • allennlp sets random seeds deterministically which helps improve reproducibility of experiments. Occasionally, when doing things like running multiple trials of identical hyper parameters, this behavior causes results for each trial to be identical. In these cases, its helpful to manually specify a random seed; for example using the trial number as the random seed.

Tips from Others


Where do you keep your configuration files for applications?

Where can I find resources related to the UMD CLIP lab?


Where can I store files at UMD?

UMIACS offers long term file storage and hosting through object stores using a set of s3-like utilities. Specific to the clip-quiz group, you should mirror the layout of /fs/clip-quiz and the contents of the clip-quiz bucket to make storing/restoring files easy. For example, moving a file /fs/clip-quiz/code/old-big-project/ could be done using: cpobj -V -r -f /fs/clip-quiz/code/old-big-project clip-quiz:code/


What format should figures be in?

  • Create PDF version of figures

What is ~? Non-breaking space in latex?

  • LaTeX will not break lines between alpha and beta in alpha~beta


How should I advertise my work on Twitter?