Pedro Rodriguez

PhD Candidate in
Artificial Intelligence, Machine Learning, and Natural Language Processing


Over my career I've made many mistakes, occasionally learn from them, sometimes find useful software/tips/resources, and such. I don't expect to remember all or even most of these so I compile everything here so that I have a quick and easy way to reference them on the web; hopefully in doing so it also turns out to be helpful for others as well.


Data Formats

  • Unless you have a very good reason and have purely numerical data, never use csv; saying a file is csv format is insufficient information to be able to parse the file
  • Default to json
  • For large json files that are table-like (the root object is an array, and looks like rows), consider JSON lines/jsonl. Large JSON objects can be expensive to parse, and make it difficult to run parallel jobs (eg Apache Spark uses line delimited rows from text files)




Tips from Others



UMD CLIP Resources



UMIACS offers long term file storage and hosting through object stores using a set of s3-like utilities. Specific to the clip-quiz group, you should mirror the layout of /fs/clip-quiz and the contents of the clip-quiz bucket to make storing/restoring files easy. For example, moving a file /fs/clip-quiz/code/old-big-project/ could be done using: cpobj -V -r -f /fs/clip-quiz/code/old-big-project clip-quiz:code/


  • What is ~? Non-breaking space, LaTeX will not break lines between alpha and beta in alpha~beta
  • Create PDF version of figures


  • ipdb and pdb are fantastic for command line debugging
  • To start debugger on if allennlp errors: ipython -m (which allennlp) -- train config.jsonnet and press c to continue when terminal starts
  • Anaconda pip installations from source packages causing g++ errors like "file format not recognized", rename anaconda's ld to ld_ so that pip uses the system version