Pedro Rodriguez


Research Scientist in Natural Language Processing

Tips

Here are a bunch of software/tips/resources I've found. I like to have them in a publicly accessible webpage, but hope that its helpful to others as well. I've framed some of these as FAQs to improve discoverability.

What are useful python cheatsheets?

Software

What software is useful for writing research papers?

How do you do X in Plotnine?

Operating Systems

  • I use Arch Linux on machines I own.
  • For cloud instances, I use Ubuntu

What are some awesome (rust-based) command line tools?

How can I install software on linux without root?

What is some software for managing ML experiments?

What are good python debugging tools/tricks?

  • ipdb and pdb are fantastic for command line debugging
  • To start debugger on if allennlp errors: ipython -m ipdb (which allennlp) -- train config.jsonnet and press c to continue when terminal starts
  • Anaconda pip installations from source packages causing g++ errors like "file format not recognized", rename anaconda's ld to ld_ so that pip uses the system version https://github.com/pytorch/pytorch/issues/16683#issuecomment-459982988

How can I search for types of Wikipedia pages?

What is some software for data analytics/distributed computing?

What are good python libraries for creating websites?

  • For small APIs, FastAPI or websites that you don't need/want pre-made user system
  • For more "out of the box", but more opinionated use Django
  • For static sites Static site (like this page) Pelican

What are some good NLP libraries?

  • Allennlp is an amazing library for research in natural language processing, use it! *Spacy: Fantastic, easy to use tools for tokenization, dependency parsing, named entity recognition and more, often used in other NLP software.

What data formats should I use?

  • Unless you have a very good reason and have purely numerical data, never use csv; saying a file is csv format is insufficient information to be able to parse the file
  • Default to json
  • For large json files that are table-like (the root object is an array, and looks like rows), consider JSON lines/jsonl. Large JSON objects can be expensive to parse, and make it difficult to run parallel jobs (eg Apache Spark uses line delimited rows from text files)
  • For data you expect to analyze, you might consider creating a read-only SQlite database and running analysis in SQL.

Hardware

How can I improve my ergonomics and avoid repetetive strain injuries?

  • I bought a vari electric sit/stand desk and love it. It encourages me to stretch, improves my posture, and gets me moving throughout the day

How can I improve my home internet?

AllenNLP Tips

  • allennlp sets random seeds deterministically which helps improve reproducibility of experiments. Occasionally, when doing things like running multiple trials of identical hyper parameters, this behavior causes results for each trial to be identical. In these cases, its helpful to manually specify a random seed; for example using the trial number as the random seed.

Tips from Others

Docs

Where do you keep your configuration files for applications?

Where can I find resources related to the UMD CLIP lab?

Links

Where can I store files at UMD?

UMIACS offers long term file storage and hosting through object stores using a set of s3-like utilities. Specific to the clip-quiz group, you should mirror the layout of /fs/clip-quiz and the contents of the clip-quiz bucket to make storing/restoring files easy. For example, moving a file /fs/clip-quiz/code/old-big-project/ could be done using: cpobj -V -r -f /fs/clip-quiz/code/old-big-project clip-quiz:code/

LaTeX

What format should figures be in?

  • Create PDF version of figures

What is ~? Non-breaking space in latex?

  • LaTeX will not break lines between alpha and beta in alpha~beta

Research

How should I advertise my work on Twitter?