Guidelines on using research tools

What is this document?

This document contains short and informal guidelines and notes on various research tools and coding skills based on my personal experience, prepared mainly for new graduate students. These are not complete tutorials, or carefully prepared manuals, so read and utilize at your own risk. I do not endorse any product, website, tool or company. Happy research coding!

Last update: April 2019

Secure and share your source code using git

Use an online git server for convenience:

Register to a reliable git server that allows private repositories (for free). For instance, bitbucket.org offers an academic plan.
Create a private repository for your new project, and configure it to give access to your collaborators.

Alternatively, you can setup a private git server, if necessary.

A very short tutorial on git (please look at online tutorials/stackoverflow to understand the details and find answers to your questions):

git add "file" Add a file to your repository.
git add . Add everything under the current folder recursively to your repository.
git commit Take a local backup of changes that you have made (collaborators cannot access these changes that you have made). Feel free to commit as frequently as you want, these are just backups.
git push Upload all your recent commits to the server. When collaborators do git pull, they will receive these updates.
git pull Receive updates from your collaborators
I use the bash alias alias gitsync='(git add . && git-commit-silent && git pull && git push)' to quickly do the three steps above, where alias git-commit-silent='git diff-index --quiet HEAD || git commit'
Use the file .gitignore to avoid commiting large data files to your repository. What you want to commit depends on your project. For example, you definitely do not want to commit dataset images and model files that can be downloaded from the web, or regenerated using your code. Instead, you can provide scripts to download training images, pretrained models, etc. from a particular server. (In my own code, I keep such files in a remote folder and make the root data folder a parameter instead). But if it is the latex source for a paper, you should commit images, pdfs, etc that are needed to compile the latex source.

"Check out" an old git commit for read-only investigation.

For simplicity, make sure the git repo is up-to-date, all changes have been committed, and all commits have been pushed.
git checkout "hash" will give you the version in a detached state
git switch - will revert back to the up-to-date state.

Partial-color-blindness friendly color palette

The explicit color names/values (adapted from source)


Palette = [
  ('006BA4','Cerulean/Blue'),
  ('FF800E','Pumpkin/Orange'),
  ('ABABAB','Dark Gray/Gray'),
  ('595959','Mortar/Grey'),
  ('5F9ED1','Picton Blue/Blue'),
  ('C85200','Tenne (Tawny)/Orange'),
  ('898989','Suva Grey/Grey'),
  ('A2C8EC','Sail/Blue'),
  ('FFBC79','Macaroni And Cheese/Orange'),
  ('CFCFCF','Very Light Grey/Grey')
  ]

import matplotlib.pyplot as plt
import numpy as np
th = np.linspace(0, 2*np.pi, 128)
fig, ax = plt.subplots(figsize=(3, 3))
for j in range(len(Palette)):
  ax.plot(th, np.cos(th)-j, color='#'+Palette[j][0], label=f'C{j}')  
  ax.legend()

Effectively use the collaboration tools

We use Slack, Dropbox Paper, Zoom and Google Docs (and sometimes other similar tools) heavily for coordinating research and taking logs in our group. Some suggestions for Slack:

Make sure you enable all notifications.
Immediately pin messages for ideas, to-dos, tasks that should not be forgotten.
Star messages that you personally want to remember.
Move pinned items to the main planning and results document (typically in Dropbox Paper or Google Docs), unless they will be done very soon (in a few hours).
Use the results document wisely: properly take notes in a detailed way such that someone else can fully understand the experiment that you have done.

When using online collaborative tex editors, such as overleaf.com: git-clone the repository on your local machine, cache your git password, and, take frequent back-ups via watch -n 600 git pull, in case the website goes down.

The programming language and deep learning research

There are several very good libraries for this purpose. We prefer PyTorch but TF is also fine.

In some specific cases, a good starting point might be a high-quality public source code relevant to your project.

Love the linux ecosystem

You should progressively become comfortable with using the linux development environment in the terminal.

Learn to use vim or emacs very well. The time you spend to learn these editors pays off over time.
You don't have to become a Makefile or Bazel expert. But do want to understand what they do, and why people use them for compilation-heavy projects.
You do not need root access to compile and use libraries in most cases. If you are compiling with a config script, then usually you can use ./config --prefix option to target compilation into an arbitrary folder. Most libraries (including gcc compiler) can be installed locally.
When linking to your locally installed library, you may need to use some of the following parameters/environment variables at compile/run time:
- gcc -Wl,-rpath -Wl,"libpath" to hard-code the linking path
- g++ -Wl,-rpath -Wl,"libpath" to hard-code the linking path
- $LD_LIBRARY_PATH to prioritize your library directories at run time

What OS you use in your laptop is not that important. Linux distros already come with the core utilities that you need. Mac/OS X is unix-based, so it already has native terimal support, and you can install an X11 server, if you need. You can also install various linux utilies via homebrew. On windows, use cygwin or Windows subsystem for linux to have a linux-like environment.

Simultaneously running long experiments, changing your code and keeping track of your experimental results

A brief guideline for managing multiple versions of the code, while running experiments at the same time:

Avoid making changes in a source code directory while an experiment is running based on it.
Instead, use git branching for each new experiment:

Create a new local copy of the repository (eg, you may copy-paste the original local repository, or, create a new clone from the original server; just be careful about local git-ignored data files). Suppose that the new directory is called src_newidea.
Create a new branch and switch to it via git switch --create new_awesome_idea (run inside the src_newidea folder)
Make changes, commits, etc. inside this branch and run your experiment in it.
You may switch to this branch later using git switch new_awesome_idea.

You may also want to merge the branches where it makes sense to do so:

Go to src_newidea
Switch to master branch: git checkout master
Merge your new code to the master branch and delete the temp branch (if you want): git merge new_awesome_idea, and, git branch -d new_awesome_idea
Push your master branch to the server, and pull it from all other git clones of the repository.

To see the list of all available branches, use: git branch -a.

You can keep track of the experimental results by tagging (naming) the log files and your own notes using the date and git hash of the code that executes for each one of my experiments.

In this way, you can easily check-out the exact source-code to (approximately) reproduce your experimental results (assuming you are using reproducible datasets and properly setting random seeds).
Here, you may find this experiment-ID generator useful.
This approach is also useful if you'd like to evaluate a pre-trained model, where you may need to re-define the test-mode of the computational graph carefully (eg, this seems necessary if you're using BatchNorm or input queues in TensorFlow), which may require using model-specific test codes. You can easily check-out the right version of the source code based on git-hash based experiment tag.
You may also find the scripts git-writehash and git-checkhash useful for ensuring that the right version of the source code is being used at test time.

Learn how to use remote servers

It is important to understand how to use a remote server via ssh in a fluid way. There are several tools and tricks that can make your life easier:

Use screen to create sessions that do not terminate when you disconnect. In this manner, for instance, you can keep your code running for several days.

I find the following bash definitions useful:
alias newscreen='/usr/bin/screen -S' # create a new screen session via newscreen NewScreenName
alias screen='/usr/bin/screen -dr' # attach to an existing screen session via screen ExistingScreenName
You can find my .screenrc here.

If your connection is not fast enough to write code directly on the server, you can use auto-sync tools (Dropbox, Google Drive, etc.) to quickly/interactively implement locally and test remotely. git is a must for backups and collaboration, but its commit/push/pull cycle is not appropriate for this purpose. mosh can also be handy.
Typically, writing image files to a folder that auto-syncs or under public_html is sufficient to inspect your code and its graphical outputs.
You can use X11 forwarding to see interactive graphics from remote processes. For this, understand the environment variable $DISPLAY, using which you can redirect the output of a python/lua/matlab interpreter.
You can alternatively use VNC to see interactive graphics from remote processes. Again, understand the environment variable $DISPLAY.
Other alternatives also exist. For example, see the display package for Torch.

How to structure your source code and deal with tens of parameters in your code

Make your code parameteric (ie. it should take a list of parameters in a dictionary/struct as an input). Avoid hard-coded constants. Especially when parameters are coupled, hard-coded constants easily lead to bugs.
But over-parameterizing your code can lead to overly complicated codes (happend to me!). If something will most likely remain as a constant, just keep it as is. Plus, at times, it may be just easier to clone and rewrite parts of your source code from scratch, rather than adding several new parameters to a bulky pipeline. While keeping source code clean can be considered as a form of art, being pragmatic in research programming is important.
To avoid code duplication, consider writing isolated functions in separate utility packages. Writing several packages where each does one thing well is preferable over a single mixed and cluttered utility package.
Do frequent commits with meaningful commit messages to be able to go back in time. Quite frequently, new students find themselves unable to re-produce their old results!
Another bit to have re-producable results: manually set random seeds to prefixed values in your code.

Vampire-friendly rendering for LaTeX

Add the following lines to your latex preample.

            \IfFileExists{darkmode.tex}{
                \usepackage{pagecolor} 
                \definecolor{myfg}{gray}{0.94}
                \definecolor{mybg}{gray}{0}
                \input{darkmode.tex} % myfg and mybg can be altered in darkmode.tex using definecolor.
                \pagecolor{mybg}
                \color{myfg}
            }{}

Add darkmode.tex to .gitignore
Create / delete darkmode.tex to locally create darkmode rendering without affecting the human collaborators.