Research Power Tools is a work-in-progress book project, written by Philipp Winter, that discusses tools for more productive computer science research.

The book will be available by Summer 2022. In the meanwhile, please take a look at the open review version below. You are able (and highly encouraged!) to leave comments. Let me know what you find confusing, interesting, or what you would like to read more about, and I will update the draft accordingly.

Finally, if you like this project, please tell a friend about it!

# Research Power Tools

Open review version. Please leave feedback!

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

Abraham Lincoln

# Introduction

Research is messy. Our body of knowledge is scattered across countless journals, presentations, blog posts, and tweets, making it difficult to get up to speed in a field. Data collection often requires numerous iterations and is frequently poorly documented. Subsequent data analysis is sometimes powered by fragile code and obscure plotting systems. The resulting research paper is first called paper.doc, then paper-new.doc, paper-new-2.doc, paper-final.doc, and eventually paper-final-FINAL.doc.

It doesn’t have to be like that. A well-structured research project is not only possible but well within your reach – as long as you know how. That’s what this book is about. The following chapters introduce effective tools for doing research, in particular related to organising, versioning, reading, writing, programming, visualising, automating, and communicating. These tools are software, processes, or ways of thinking.

In addition to discussing these tools, I will explain why they are effective. You will realise that logging your time will give you freedom, good writing is the opposite of academic writing, and having a Twitter account isn’t all about posting cat pictures.

Building an effective tool set has a significant return on investment in terms of time, sanity, and research quality. You will save time by automating parts of your research pipeline and putting them under version control; you will keep your sanity by having the peace of mind that your pipeline is robust; and your research will increase in quality because you’re now less likely to make unnecessary mistakes.

## Who is this book for?

I wrote this book primarily for new graduate students in computer systems.The field of computer science consists of “theory” and “systems.” I myself am working in systems, and you will find this book the most useful if you are a systems researcher – for example in a field like security, programming languages, or networking.

If this includes you, the entire book should be of interest. The secondary audience is people who program, write in LaTeX, or do research – basically anyone in a STEM field. This crowd will find a subset of this book useful.

## Why did I write this book?

I had little research experience when I started my Ph.D. I faced a steep learning curve in the first couple years and research frequently felt overwhelming. In addition to reading hundreds of papers and trying to carve out your own area of research, you have to teach, take courses, and learn how to do research. [COMMENT: I changed this because, it seems to me, the book is about adopting best practices, which many students fail to do.] I’ve always been curious about how other people work and cope, so I wrote the book that I wanted to read as a young student.

Perverse incentives in research place too much value on the number of papers published and citations accrued. As a result, people cut corners to maximise their research output – often sacrificing rigor. Poorly documented workflows and sloppy code can lead to mistakes that jeopardise the correctness of a project. By adopting strict and effective workflows, we can minimise these mistakes and save time.

An unfortunate amount of knowledge in any scientific field is implicit, meaning that it’s rarely spelled out. Pinker calls this phenomenon the “curse of knowledge” (Pinker 2015, chap. 3): Years of experience in a field make it difficult to put yourself in the shoes of a newcomer. Examples of (mostly) implicit knowledge are the reputations of conferences, collaboration etiquette in research projects, and effective organisation of your time. People do write and talk about these topics, but you may have a hard time finding a course or book that adequately addresses them. In this book, I spell out aspects of research that aren’t elaborated on very much elsewhere – which means that parts of this book may seem obvious to you.

## How should you read this book?

All of the chapters are self-contained, so dive into whatever subject appeals to you. The chapter on versioning is particularly important and is referenced several times in subsequent chapters. Finally, this is a hands-on book and I strongly encourage you to read it while in front of your laptop – ideally with an open terminal. You’ll retain more if you put what you read into practice.

# Organising

A combination of stubbornness and luck got me through my Ph.D. and postdoctoral training without a todo list or schedule. Each day, I would simply continue work where I stopped the day before. My lack of organisation didn’t get me into trouble because, for the most part, I was involved in only one or two projects at a time, which was manageable. This changed after my postdoc. I suddenly found myself juggling software projects, monthly reports, research papers, and blog posts, all on tight deadlines. I had no choice but to become more organised.

You may – like me – get away with poor organisational skills during your Ph.D., but why not improve these skills before it’s absolutely necessary? Why not become more efficient in the process, and also more prepared for your post-Ph.D. life, which will most likely be more demanding and require juggling several projects in parallel? This chapter discusses organisational tools and behavioural hacks that will help you stay on track and make steady progress. After all, research is a marathon, not a sprint.

## Incorporating new skills

Incorporating new organisational tools requires forming new habits. When you adopt new habits – such as organising, exercising, or eating healthily – you may find that you can keep to it for a few days, but then drift back into your old pattern. In his book Atomic Habits (Clear 2018), James Clear provides insight into why that is: key to forming good habits (and eliminating bad ones) is to i) create an obvious cue to trigger an action, ii) make the action’s outcome attractive, iii) make the action easy to perform, and iv) create a satisfying outcome. [COMMENT: It’s not clear to me what the difference is between making the action’s outcome attractive and creating a satisfying outcome.]

## Curate a todo list

As a Ph.D. student (and even as postdoc), I spent most of my days working on very few tasks – so few, in fact, that I was able to keep my todo list in my head. Most days, I worked on moving my research project forward, interrupted only by the occasional paper review, presentation, or work with students. I had few tasks and deadlines to keep in mind at any given time. Once I transitioned from my postdoc into the “real world”, I quickly realised that I needed a better way to keep track of tasks. I suddenly found myself having to push forward several small projects: writing papers, fixing bugs in complex code bases, organising workshops, applying for grants, and analysing data sets. It was no longer possible to keep everything in my head, so I started experimenting with todo list tools. Console tools were a bit too cumbersome and web tools were inconvenient and clunky, so I eventually settled on maintaining my list in a simple text file.

In essence, a todo list helps you keep track of the things that you need to get done. The trick is to make it a seamless part of your workflow. If adding a new item to your todo list involves spending thirty seconds finding todo.docx on your hard drive, waiting five seconds for Microsoft Office to open, and another three seconds to scroll to the end of the file to add a new task, you will soon give up.

If you feel the need to curate a todo list, but find yourself unable to do so, work on minimising friction. First, make it so that you can open your todo list as quickly as possible. For example, configure a keyboard shortcut that opens your file. Then, if a new task is assigned to you during a meeting, you can add it to your list within five seconds. Once a new task is in your list, you can forget about it, freeing you of the cognitive load of having to remember it.

If you work on multiple devices and need your todo list on all of them, you will have to sync it somehow. In this case, browser tools may be your best bet because they can do the syncing for you. Manual syncing is out of the question – it’s not sustainable and you will wind up with out-of-sync lists.

The todo list format I eventually adopted is a markdown-formatted text file which is in the same file as my work log (see the section on work logs. Some tasks are more pressing than others. To reflect this, I use three sections, “today”, “this week”, and “eventually”. Here is an excerpt of what my todo list currently looks like:

# TODO

## Today

* work on presentation for hagenberg

* write monthly team report

* figure out how to move forward with #32126

## This week

* finish python 3 port of bridgedb (#30946)

* write introduction and impacts section for ttp grant

* read salmon research paper (#29288)

## Eventually

* refactor emma with dcf's feedback (#30794)

* wrap up expired /keys issue (#17548)

* come up with a solution to bridgedb's broken captcha (#24607)

* look into moat "password" idea (#28015)

When I finish a task, I move it from my todo list to my work log, which is in the same file, so it literally takes seconds. This minimises friction and makes curating my todo list easy enough that I actually stick with it.

Of course, it doesn’t matter what I do. The best todo list is the one that works for you. I’m giving you an idea of my workflow in the hope that it helps you discover what will work for you. My workflow is heavily centered around text files and command line tools. You may be more of a browser person, in which case a pinned browser tab may work significantly better. Experiment with different workflows until you find one that you can sustain.

A habit that I picked up relatively recently, after reading Cal Newport’s excellent book Deep Work, is to plan my day (Newport 2016). Each morning, right after I turn on my laptop, I take a look at my todo list and make a plan for the day, using 30 minute blocks. I draw these blocks on my tablet but pen and paper work just as well.

A day without a schedule risks devolving into yak shaving: imagine you want to fix that annoying bug in your code that has been messing with your experiments. While thinking about how to best fix the bug, you notice that the function that contains it has poor documentation. So you spend a moment updating the documentation. While doing that, you realise that your functions follow an inconsistent documentation style, which really annoys the perfectionist in you. So you harmonise the way functions are documented in your code, and in the process, learn that the documentation tool you’re using has released a new version with convenient new features. However, your operating system doesn’t have the newest packages yet, so you set out to compile it manually. Three hours later, you find yourself hunched over your laptop, covered in sweat, finally with the new version of your documentation tool. That bug that you originally intended to fix? It’s still there.

I used to come to the office each day without a clear idea of what needed to get done. I would typically continue work where I left off the day before, or turn my attention to whatever seemed the most urgent. This approach may suffice if most of your time goes into a single project whose details you can keep in your head, but it falls apart in the face of more complex responsibilities.

You may think that a detailed plan for each day impedes your creativity. In fact, the opposite is true. By spending a few minutes planning in the morning, you reduce your cognitive load throughout the day. You won’t have to think about what to work on next, or when it’s time to switch tasks. You already took care of that at the start, leaving the rest of the day for deep thinking with minimal context switches.

Perhaps the biggest benefit of a planned day is that it helps you stay on track. I have the annoying habit of polishing finished work more than is necessary or even useful. Having a daily plan in front of my nose serves as a reminder that the perfect is the enemy of the good, and that many other tasks are still waiting to get done. This helps me move on to the next task quickly, which makes me more productive throughout the day. Higher productivity means more happiness. At the end of the day, I feel that I have accomplished what I wanted to, making it easy to get out of “work mode.” When I feel I haven’t accomplished enough, I find it difficult to leave work behind because I keep thinking about unfinished tasks. Needless to say, this is far from productive and only prevents me from relaxing and recharging. A detailed plan for the day is a good antidote that helps draw a clear line between work and personal life.

Do you know that feeling of taking a break from work, only to catch yourself an hour later watching obscure YouTube videos? Or the feeling of having spent a full day working but ending up with little or nothing to show for it? As if the day has passed and you’ve accomplished nothing? The solution to these problems is to establish a tight grip on your most precious possession: your time.

I use Time Tracker on Debian Linux. This lightweight tool lives in my system tray, allowing me to quickly open it and take note when I’m switching from one task to another. Each time I switch between tasks,Examples for tasks are “answering email,” “changing database API,” “reading research paper,” and so on. Tasks like “writing” or “programming” are likely too general while tasks like “adding second paragraph to introduction of research paper” are too specific.

I open the Time Tracker tool and jot down what I’m going to work on next. On a typical day, I end up with five to ten tasks. At the end of the day, I know exactly what I did, and can contrast it with what I was supposed to do.

Tracking your time at such granularity may feel oppressive and stressful. After all, it’s yet another thing to remember and worry about. That’s exactly what I thought until I started doing it. But I learned that policing yourself helps you stay focused. Tracking my time helps me stay on track. It’s easy to feel busy all day long, without really getting anything done. You can spend several hours going over meeting notes, mulling over the next email, or stressing about all the things that you need to accomplish. Despite feeling busy all the time, you may not have much to show for it. Keeping track of exactly what I’m doing throughout the day helps me notice when I’m actually productive: I’m productive when I finish a handful of well-defined tasks. If you are working on a big project, try to split it into tasks, so you can accomplish a handful of them each day.

Part of my day job consists of working on development tickets for sponsors. These tickets consist of software bugs, feature requests, or small projects. My employer, The Tor Project, is mostly funded through grants, and we need to have a good understanding of how much time a specific development task takes. How long does it take to set up a testbed to evaluate a new pluggable transport protocol? An hour? A day? A week? It’s important to have both experience and data for time estimation because our intuition isn’t always the most reliable predictor. I use my time tracker to record how many hours it took me to complete each bug tracking ticket, which I then compare to the hours I projected it would take. Over time, my estimates become closer and closer to how much time I really needed (minus the occasional outlier, obviously).

I am a big advocate of keeping a log of what I have accomplished throughout the day. Did you finally manage to finish the introduction of your latest research paper? Mention that in your log. Did you finish refactoring the data processing pipeline in your prototype? This should go straight into your log. Right after I complete a task worth writing down, I spend approximately five seconds adding it to my log and then move on.Occasionally, I forget to add a completed task to my log right away. I then add it later, or sometimes even the day after.

I don’t bother getting punctuation or even grammar right.

On a typical day, I jot down somewhere between five and ten tasks in my log. I don’t log every single email I write but I do sometimes log emails if they’re both lengthy and important. Here’s what my Oct 14, 2019 work log looked like:

• 2019-10-14
• filed #32064 for improved search results for “download tor” and to incorporate gettor links in our website descriptions
• replied to [redacted] and asked him if he’s willing to run default bridges for orbot
• created wiki page to formalise our “support ngos with private bridges” process
• thought about design for system that can scan the reachability of PT bridges (#31874)
• created summary of obfs4 work for quarterly race report and updated our obfs4 ticket (#30716) with current project status
• reviewed #31384 (snowflake.tp.o language switcher)
• reviewed #31253 (webext packaging target)
• responded to email with points worth communicating at otf summit
• started working on #17548 (deal with expired pgp keys)

You can see that my phrasing is rough around the edges and that’s okay: you are the primary consumer of this log and you will likely remember what you meant after the fact. You will interact with your work log several times a day, so minimise the friction of adding tasks to it. My work log is always open on a virtual desktop, so I don’t spend any time opening it. I also added a shortcut to my editor, vim, to quickly add today’s date to the bottom of the file. The format of my work log is markdown, which facilitates conversion into other document formats such as HTML or PDF.As always, do what works best for you. I’m a terminal person and enjoy fast, lightweight, and robust console tools. You may be more of a browser person, in which case it’s worth looking at web services that help you log your progress.

Using pandoc, converting a markdown-formatted file to PDF is as simple as running:

pandoc log.md -o log.pdf

As a student, I would send my log for the past month to my advisor at the end of the month. He appreciated seeing what I’d been up to at a level of granularity that was neither too coarse nor too detailed. Sharing your work log with your advisor also serves as insurance: your advisor won’t be able to ever complain that they were not kept in the loop.

Actually writing down and seeing what you’ve done throughout the day can be surprising – in either a good or bad way, depending on how much you have accomplished. A progress log is important because it allows us to monitor ourselves. If I see one or two days in a row with little progress, I realise it’s time to make a conscious effort to improve my productivity. Progress logs make it less likely that you’ll drift into a slump and perform poorly for many days or even weeks without noticing – something that happened several times during my Ph.D: I wasted many weeks going down rabbit holes and losing sight of the big picture. I mulled over attractive research ideas that were ultimately infeasible but that I was too stubborn to give up on. I was no longer on track and I didn’t realise that I needed to take a step back and re-evaluate my direction. Looking at your work log makes it easier to notice when you’re off track; sending your work log to your advisor means that they should also be there to help.

Note that tracking your time and logging your work are very similar but serve different goals. Time tracking allows you to monitor yourself on a micro scale while a work log helps on a macro scale. It’s possible to be on the right track but spend a significant part of your day watching YouTube videos (a time tracker would reveal this bad habit) or be efficient in your day-to-day business but not spend time doing the right things (ideally, a work log would help you see this).

## Take meeting notes

I had the privilege of collaborating with 25 people over the course of my research career. Several of the projects I was involved in were not led by me, so I took a back seat. In some of these projects I was surprised by the lack of note taking during meetings. We would meet and discuss the project, as I was used to, but nobody took notes (at least not for the entire group). The implicit expectation was that everyone would remember what was said and what was left to do. Needless to say, this didn’t always work out. Throughout the next few weeks, people forgot what was discussed, only to end up discussing some of the very same topics again at the next meeting. And misunderstandings along the lines of “wait, I thought you were supposed to do that?” happened.

You can avoid these issues by consistently taking notes. Before each meeting begins, designate a note taker. In fact, multiple people can take notes simultaneously in a Google Document or a Riseup Pad. The note taker(s) jot down key points during the meeting and todo items for each person. At the end of the meeting, all of the participants get a copy of the notes. If anyone disagrees with any of the notes, they speak up. This way, everyone is on the same page, there is a written record of what was discussed, and it’s easy to go back to find out when a certain task was covered. I can guarantee that your collaborators will love you for taking notes.

Note taking isn’t just for high-stakes meetings with important collaborators. I take notes almost every time I interact with somebody – including with my advisor, during my Ph.D. days. I created a simple shell function that facilitates the creation of a new file for each meeting. I simply type meet alice into a terminal, and the command automatically creates a new file, 2019-12-21-alice.md, and opens it in my text editor. Here’s the script, which you can add to your ~/.bashrc:

meet () {
d=date -u '+%Y-%m-%d'
file="${HOME}/doc/meeting/${d}-${1}.md" vim "$file"
DOCUMENTS=$(wildcard *.tex) all: pdf pdf:$(DOCUMENTS) $(FIGURES) GS_OPTIONS=-dPDFSETTINGS=/prepress rubber -f --pdf -Wrefs -Wmisc$(PAPER)

clean:
rubber --clean $(PAPER) The environment variable GS_OPTIONS ensures that all fonts that the paper uses are embedded, so the pdf looks the same on each machine, no matter what fonts are installed. This is a requirement of many conferences and generally a best practice. When using this Makefile, the indented lines containing the two rubber commands must be prefixed by a tab character and not by spaces. Take a look at chapter TBA to learn more about creating Makefiles. Makefiles are powerful and great for tasks that involve repeated processing of files. I use a Makefile to compile this book from the markdown format to HTML, epub, and pdf, and also to automatically publish new drafts. The Makefile’s target is index.html–the HTML file I want to create. The prerequisites are book.md, pandoc.css, and references.bib – the source files that are necessary to produce the HTML file. Finally, the recipe is an invocation of the tool pandoc, which converts my markdown file to an HTML file. all_input = book.md pandoc.css references.bib metadata.xml html_output = index.html epub_output = ebook.epub all_output =$(html_output) $(epub_output) publish_files = index.html pandoc.css img publish_dir = ~/web/nymity.ch/book pandoc_flags = --toc --standalone --css=pandoc.css --bibliography=references.bib --filter pandoc-citeproc all:$(all_output)

$(html_output):$(all_input)
pandoc $(pandoc_flags) book.md -o$(html_output)

$(epub_output):$(all_input)
pandoc $(pandoc_flags) --epub-metadata=metadata.xml book.md -o$(epub_output)

.PHONY: clean
clean:
-rm -f $(all_output) .PHONY: publish publish:$(html_output)
@cp -r $(publish_files)$(publish_dir)
~/web/nymity.ch/deploy_website.sh

Whenever I added more content, I type make, which compiles the source files into an HTML file, which I have open in my browser. If I type make and nothing has changed since the last build, I see:

$make make: 'index.html' is up to date. A Makefile can also contain rules that are not about compiling input into output files. To share drafts of my book, I upload it to my personal web server. This involves copying the relevant files into a directory that contains my websites, and then invoking a script that syncs web content from my laptop to my web server. All of this happens simply by running make publish. If the book’s output formats currently don’t exist, make will first compile them (hence the prerequisite on $(book_output)). Then, an invocation to cp copies the book’s HTML files to another directory on my laptop and, finally, I invoke the script that uses rsync to sync all files to my web server.

If you are not a fan of command line tools, you can still benefit from LaTeX by using one of its online development systems. Overleaf has been popular among some of my collaborators.

### A LaTeX template

Below is a LaTeX template that I use for research papers. When submitting a paper to a conference, you typically have to use the conference style. You can add that to the template, but you may also have to change or remove parts of the template, depending on how restrictive the conference style is.

Note that \input{introduction} is replaced with the contents of introduction.tex. I find it convenient to outsource sections to separate files because it makes the paper easier to manage. It also helps with version control if multiple people are working on the paper.

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[scaled=0.8]{beramono}
\usepackage[T1]{fontenc}

% For pretty tables.
\usepackage{booktabs}
% Also for pretty tables.
\usepackage{multirow}
% For using colours.
\usepackage{xcolor}
% For clickable links and back-references in the references.
\usepackage[pagebackref=true]{hyperref}
% For smart spacing in custom commands.
\usepackage{xspace}
\usepackage{amsmath}
% For embedded figures.
\usepackage{tikz}
\urlstyle{tt}

% Bibliography.
\usepackage[backend=biber,backref=true]{biblatex}
\bibliography{literature}
\renewcommand*{\bibfont}{\footnotesize}

% Add custom text right before backreferences in literature.
\renewcommand*{\backref}[1]{}
\renewcommand*{\backrefalt}[4]{
\ifcase #1
No cited.
\or
(Cited on p.~#2)
\else
(Cited on pp.~#2)
\fi}t

\definecolor{darkblue}{rgb}{0,0,0.4}
\definecolor{lightgray}{rgb}{0.93,0.93,0.93}

\newcommand\author{Alice and Bob}

\hypersetup{
urlcolor=darkblue,
citecolor=darkblue,
pdftitle={\title},
pdfauthor={\author},
pdfkeywords={foo, bar},
}

\begin{document}

\input{introduction}

...

\printbibliography

\end{document}

### Pre-submission paper checks

Conferences and journals almost always have specific requirements that paper submissions need to satisfy. It’s frustrating to have your paper rejected for unnecessary reasons like formatting violations, so it’s a good idea to spend five minutes checking the conference’s requirements before pressing the “submit” button.

• Make sure that your paper is within the page limit. The page limit sometimes includes and sometimes excludes references or appendices, so read carefully.

• LaTex shows broken references as question marks. Do a Ctrl + F for the string [?] to find broken references.

• Make sure that all fonts were properly embedded in your pdf. On Linux, I use the tool pdffonts which is part of the Debian package poppler-utils. I run it as pdffonts file.pdf and it displays a column called “emb,” which shows whether a given font is embedded. While using pdffonts to write this paragraph, I realised to my dismay that one of my old papers did not embed all of its fonts:

$pdffonts Winter2012a.pdf name type encoding emb sub uni object ID -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - -- -- -- -- -- -- GJYVBN+NimbusRomNo9L-Medi Type 1 Custom yes yes no 100 0 NLMFQI+NimbusRomNo9L-Regu Type 1 Custom yes yes no 101 0 XNJNRQ+NimbusRomNo9L-ReguItal Type 1 Custom yes yes no 102 0 ZZEWFV+CMSY10 Type 1 Builtin yes yes no 103 0 UIPGCJ+CMTT8 Type 1 Builtin yes yes no 127 0 Helvetica Type 1 Custom no no no 174 0 Helvetica Type 1 Custom no no no 180 0 HNYWOO+StandardSymL-Slant_167 Type 1 Builtin yes yes no 203 0 JHYTSG+CMR10 Type 1 Builtin yes yes no 204 0 CUJHND+CMMI10 Type 1 Builtin yes yes no 205 0 ZapfDingbats Type 1 ZapfDingbats no no no 211 0 Helvetica Type 1 Custom no no no 212 0 Helvetica Type 1 Custom no no no 218 0 XEQPPW+CMTT10 Type 1 Builtin yes yes no 242 0 ## Use git LaTeX files are all text files, which makes them prime candidates for version control. I recommend putting all of your LaTeX source files into a git repository.It doesn’t matter if you prefer subversion, CVS, or mercurial over git. What matters is that you use some sort of version control. I like git because it has emerged as the most popular system, and with that comes great documentation and tooling. Also, most people you collaborate with will have at least some understanding of it. Having your paper under version control has several advantages: • No writing is ever lost. Whatever you remove during editing is part of git’s history and can always be recovered. • You can easily determine the differences between two versions of your paper, making it easy to produce a pdf that highlights them. • You can tell who changed what. ### Use tags for milestones A specific git commit can be assigned a “tag,” which is an arbitrary label. Git tags are often used for version numbers – when you publish a new version of your software, you assign the latest commit a tag like “0.2.4.” But you can use tags for other purposes. I like to tag important milestones of my writing, for example when I submit a paper to a conference, or to the arXiv, or when I publish the final camera-ready version. You can even assign a tag to remember when you sent your paper to your advisor for feedback. * 5de077a - (tag: ndss17-camera-ready) added cs to my email (3 years, 7 months ago) <laurar> ... * 2cd29b1 - (tag: arXiv-resubmission-1) fixed last paragraph of internet scale section based on corrected plots (3 years, 9 months ago) <laurar> ... * fabf1e3 - (tag: arXiv-submission) Turn passive into active voice. (3 years, 10 months ago) <Philipp Winter> ... * 2187ef7 - (tag: NDSS-submission) Minor style harmonization and spelling fixes. (3 years, 11 months ago) <Philipp Winter> ### Learn who changed what With multiple people working on the same project, you will occasionally notice mistakes in the writing that may require discussion. Instead of asking all of your collaborators who’s responsible for a given piece of writing, you can find out yourself, by using git’s “blame” functionality. When you run git blame FILE, the output is the text file where you can see when each line was last changed, by whom, and as part of which commit. ### Help git do its job Remember to make one change per commit. Here are a few examples in the context of research papers: • Fixing one or more typos. If somebody is proof-reading an entire paper, it’s fine to have a single commit that fixes many (or all) typos in the paper. • Add a reference. Many claims need to be supported by references. Such a commit may add a new reference to the BibTeX file and then reference it in the corresponding LaTeX file. • Rephrase a paragraph or section. You may not like the way a paragraph (or entire section) is phrased. The action of rephrasing this paragraph or section should go in one commit. If you want to rephrase several pages worth of writing, consider using multiple commits. • Add more writing. Adding a coherent argument, paragraph, or section should go into a single commit. Adding two two independent paragraphs two separate sections should go into two commits. • Delete text to meet a page limit. Papers must sometimes be trimmed to meet a page limit. Unless it severely cripples the paper, it’s fine to do this in a single commit. Note that making small changes is not always possible or reasonable. As you are rewriting a paragraph, you may realise that the rewrite only makes sense if you also rewrite the paragraphs before and after. This is fine. The above recommendations are just that: recommendations. I personally find it helpful if paragraphs of text are broken into several lines spanning a maximum of 80 characters, instead of a single line of text. This makes it easier to inspect commit messages and understand what change was made. Consider the following example: @@ -1 +1 @@ -This is a paragraph that consists of a single, continuous line of text. Such long lines can make it cumbersome to determine what has changed in a lengthy diff. Instead, consider breaking a single long line into multiple lines that end at, say, 80 characters. +This is a paragraph that consists of a single, continuous line of text. Such long lines can make it cumbersome to determine what has changed in a lengthy diff! Instead, consider breaking a single long line into multiple lines that end at, say, 80 characters. Only a single character changed in this paragraph, which is formatted as one line. It’s difficult to see what changed because the line is so long. @@ -1,4 +1,4 @@ This is a paragraph that consists of a single, continuous line of text. Such long lines can make it cumbersome to determine what has changed in a lengthy -diff. Instead, consider breaking a single long line into multiple lines that +diff! Instead, consider breaking a single long line into multiple lines that end at, say, 80 characters. Here, the same paragraph (and the same change) is formatted as separate lines. It’s easier to see what character was changed in this commit. # Programming Analogous to freeware and shareware, there exists the term conferenceware – a mildly derogatory term that refers to the type of software that’s typically published as part of a research paper. Conferenceware is abandoned, outdated, poorly documented, and written in a haste. It’s often frustrating to use someone else’s conferencewareThe pride and accomplishment that academics experience when somebody reads their paper quickly turns into shame and defensiveness when somebody studies their code. and worst of all: badly written code jeopardizes the correctness of the science. A simple bug can result in incorrect data and misleading conclusions. This chapter begins by introducing general guidelines for programming that are helpful regardless of what kind of programming you do, followed by advice specifically for data analysis and systems building. Roughly speaking, academic programming falls into one of those two categories. Data analysis projects start with the collection of data sets (which may already involve some programming), followed by the analysis of the data set, which typically involves code to parse, clean, and process the data set. Systems building projects invent new systems or improving existing ones. Examples are the creation of a new routing algorithm, the adding of new security technology to the Linux kernel, or the invention of a distributed system for file sharing. In data analysis projects, one’s focus is to measure a phenomenon by collecting and analyzing data. This requires a slightly different skill set than building systems, where the focus is to write complex prototypes and potentially integrate them in existing, even-more-complex software. ## Best practices This section distills a number of best practices that I consider essential in academic programming. The advice comes from having made all of those mistakes myself, and from reading other people’s code that was better than mine. ### Avoid functions that do too many things at once Imagine you are working on a parser that takes as input a file and returns structured data. There are several ways to write the code that accomplishes all of this but one may be tempted to squeeze all of that functionality into a single function, as illustrated below: def analyse_file(file_name): total = 0 with open(file_name) as fd: for line in fd: line = line.strip() if line == "": continue num = int(line.strip()) total += num print(total) analyse_file("filename") The problem with the above code is that one function does everything: it opens the file, parses its content, and handles the parsing. That may be fine for a quick prototype but if your code is going to evolve over time – and data analysis code has a habit of doing so – it’s best to break up the code into several functions because it allows for faster and safer code changes. For example, the way your code ingests data may change from reading from a file to reading it from the network. This requires a comprehensive change of the analyse_file function. Instead of cramming several vaguely related tasks into one function, split them up into several functions. In our example, one can intuitively split our monolithic function into three separate functions that read, parse, and analyse our data: def read_file(file_name): with open(file_name) as fd: return fd.readlines() def parse_file(raw_data): content = [] for line in raw_data: line = line.strip() if line == "": continue content.append(int(line)) return content def process_file(parsed_data): total = 0 for elem in parsed_data: total += elem print(total) process_file(parse_file(read_file("filename"))) Observe that the control flow now resembles an actual pope. We call read_file and pass its output as input to parse_file and do the same with process_file. The modular code is significantly more reusable and safer to modify. For example, if the format of your data changes, you can jump straight to modifying parse_file because that’s where the parsing happens; you don’t have to go through the error-prone process of finding the relevant parsing code in a long function.This may not be bad in our strawman example but real-world data analysis code is significantly more complex. Similarly, if you want to change how your code ingests data – perhaps the data should come over the network instead of from a file – you can implement a new read_file method; the remaining code need not be touched. Note that the above is standard programming advice, often referred to as functional decomposition, which is the process of breaking down a complex function into its smaller, simpler components. This is generally best practice but I find it particularly important in the setting of data analysis. ### Document encountered issues Whenever you work with code, you are bound to run into issues. It’s not a matter of if but when. You may encounter bugs, library conflicts, or code that only works on a specific architecture. Whenever you spend more than five minutes solving one of those problems, document it. It only takes a minute to add a few sentences to your personal work log. All that’s necessary is something along the lines of “tried to get library X to work but it didn’t work because of Y. I then tried Z and managed to get it to work.” There’s a non-zero chance that you (or a colleague) will run into the same, or at least a reasonably-similar problem in the future, and it’s better to rely on documentation rather than the accuracy (or lack of) your memory. ### Organise your directory structure If your code has grown beyond a very simple prototype, avoid placing all files in the same root directory because that gets messy very quickly, and it will be difficult to find specific files. One way to organize files is by their purpose, e.g.: • bin/ for executable files. Your users should be able to run your tool by executing a file in this directory. • doc/ for documentation. This directory typically contains extensive documentation like technical specifications or automatically-generated source code documentation. • src/ for source code. • test/ for unit and integration tests. • README for an overview and usage instructions. My Python tool exitmap uses this directory structure and shows how a Python project can be structured that way. ### Make your code public It is common in academic research to treat code and writing secret until publication. Code only exists on the laptop of the researchers or in private git repositories, or otherwise somebody would take it and rush to publish a paper before you; or so the folklore goes. If somebody asks for the code, the best that one can typically expect is a copy of the code, together with a plea to not share it further. The fear of getting scooped incentivices researchers to be very careful about who they talk to about their work. This concern doesn’t come out of nowhere: the intense pressure to publish can bring out the worst in people, to the point of “stealing” others’ ideas or fabricating data. Scooping is not the only issue. It is also intimidating to publish code – or anything, really. By making your brainchild available for everyone to inspect, you are exposing yourself and taking a risk. What if people will find your code inefficient and judge you for it? The good news is twofold. First, it is very uncommon to attract negative attention for publishing free software. Second, you do get used to publishing work. As intimidating as it may see in the beginning, it does get easier, to a point where it is routine. The work output of my last two jobs at Brave Software and The Tor Project happened almost entirely in the open. We coordinate over email, IRC, and bug trackers and code is by default free. I have eventually grown used to this kind of philosophy but it used to be foreign and intimidating to me. Imposter syndrome is widespread, and few things are more intimidating than voicing an idea in public, surrounded by competent people who call bullshit when they see it. Here’s another way to look at it: Free software is a community effort. You don’t get to complain about software that’s freely available. If you are unhappy with it, fix it. The software is free, after all. People have been unhappy with my research prototypes many times, and I am fortunate to have received numerous patches over the years. I was always flattered to learn that somebody cared enough about my software to write a patch for it. Maintaining a popular library or measurement tool can not only be fulfilling but provide you with a very real advantage: people will turn to you for help (and as a result, you may end up co-authoring papers) or at the very least cite your work in their papers. There aren’t a lot of people who publish their code proactively and doing so sets you apart from your peers. You may be rationalising your secrecy by telling yourself that nobody would ever be interested in using or reading your code. You may think your code is too niche, too naive, too slow, or too clunky. You should give yourself more credit than that. If nobody has done before what you are doing now, you should consider publishing your code because it probably will be useful to somebody else in the future. Someone may face the same problem in the future and this person will be very glad to have stumbled upon your code. Sure, it may be missing a feature, or be a bit too slow or buggy, but some code is typically better than no code. Unless there are good reasons not to, consider making your code public from the very beginning. The longer you wait, the more reasons you will find to not publish your code. Don’t wait and take the risk. I promise, it is worth it. ### Use libraries When faced with the need for parsing data, using a network protocol, or anything, really, I feel tempted to write the code myself, from scratch. While that was often fun and educational, I later discovered a library out there that did the exact same thing. To add insult to injury, the library was usually faster, more complete, and less buggy. I could have saved a substantial amount of time and headache by using that library. Today, before I sit down to implement anything, I spend a few minutes searching for code that already does what I need. Most code ends up on GitHub these days, which makes the site a natural first choice for looking for libraries. You will find that libraries are no panacea and differ widely in their quality, ranging from intuitive, well-documented, and equipped with example usage to outdated, undocumented, and simply broken. Worse, it’s not always immediately clear what category a given library falls into. To quickly assess a library’s quality (or if you can choose between multiple libraries), I recommend the following heuristics: • Is the library still maintained, and actively developed? If so, it will be easier to get help, and the library is less likely to be outdated or broken. Take a look at the most recent git commit, or the latest issue that was filed in the bug tracker. If there has been no activity in years, the code is either of stellar quality or has been abandoned. • Is the API documentation comprehensive? Poor documentation will make it difficult to use the library, and is a telltale sign of poor code. Similarly, if the README file is riddled with spelling mistakes, the code is more likely to be riddled with programming mistakes. • Does the documentation provide usage examples? That helps with getting up to speed and is a sign of careful maintenance. ## …for data analysis Let us now look at tips specifically for measurement code, i.e., code whose purpose is to measure a phenomena or a system. An example of this is Internet measurement, e.g., projects that set out to understand the complex nuances of, say, how a distributed system recovers after failure. ### Use precise timestamps Whenever you measure something, use the most granular timestamps possible, which is typically millisecond or nanosecond resolution. Often, timestamps that are only accurate to the second are not granular enough to inform about the phenomena that you are measuring. On a related note, consider always using the UTC time zone when creating timestamps. You may end up comparing timestamps that were created by multiple systems, some of which don’t share the same time zone. That’s generally not a big problem because one can account for time zone differences but I find it more convenient to always deal with the same time zone. ### Automate your processing pipeline We all use some degree of automation but few automate their entire data pipeline, from data collection, to processing, and finally embedding results in the research paper. In my experience, I found it helpful and very convenient to run the entire pipeline by invoking a single command and to be able to run specific steps of the pipeline in isolation, e.g., to plot the data: Measurement code is often a loose collection of Python programs – one to collect the actual data, one to plot the results, and adding a plot to the research paper is often a manual step. I typically combine those Python program invocations in a shell script, to limit the manual work to the running of a script. Considering that I have multiple shell scripts that do similar things (e.g., operate on the same data set), it makes sense to create another shell script whose purpose is to serve as a configuration file. All it does is define variables that are used by the other shell scripts. That makes available all configurable variables in a central, easy-to-modify place, so you don’t have to remember what variable is where. Below is a small example of a bash-based configuration files, called config.sh: #!/bin/bash # Path to dataset; used by all analysis scripts. data_set=/path/to/dataset.csv Bash provides straightforward mechanisms to load the content of another shell script, e.g., by using the source keyword. Once a configuration file is sourced, the calling bash script can access its variables as if they were defined in the calling script: #!/bin/bash # Load variables from config file. source /path/to/config.sh echo "$data_set"

Simply source the configuration file from all your processing scripts and you will be able to configure them conveniently and centrally as your measurement scripts evolve over time.

Speaking of evolving over time: At some point, you may wonder how a chart changed over time. For example, were throughput numbers better before you refactored the concurrency logic of your network service? To answer questions like this, there is merit in keeping a simple archive of charts. I recommend encoding a chart’s birthday in its filename, and have a filesystem link that always points to the latest chart, e.g.:

$ls -1 syscall-timing-latest.pdf -> 2021-12-06_11:07_syscall-timing.pdf 2021-12-06_11:07_syscall-timing.pdf 2021-12-03_09:32_syscall-timing.pdf 2021-12-01_15:10_syscall-timing.pdf Whenever your shell script creates a new chart, it can assign the chart the proper filename, create a new link to the latest chart, which your research paper can then embed. Below is sample code that accomplishes this: #!/bin/bash # Create experimental data by running our data gathering script. python3 /path/to/data/gathering.py > data.csv if [$? -ne 0 ]
then
echo "Data gathering script failed with exit code ${?}." >&2 exit$?
fi

# Create the chart's desired filename.
date_time_prefix=date +"%F_%T"
chart_name="${date_time_prefix}_chart.pdf" # Create a chart. python3 /path/to/plot.py data.csv > "$chart_name"

# Create a link to the chart.
ln -sf "$chart_name" "latest_chart.pdf" # Finally, rebuild the paper. cd /path/to/paper make The automation of your data processing pipeline saves you time, prevents frustration, and is more robust to errors because there is minimal manual work. Once you have an automation pipeline that you are happy with, you can re-use and adapt it across projects. The marginal cost of re-using your automation pipeline is negligible, making it a good time investment. ### Make your processing pipeline verbose No matter your area of research, you will likely be doing data analysis as part of a research project. Conceptually, data analysis is often done with a sieve-like processing model: the raw, unfiltered data goes into your code, which then removes broken data points, special cases, outliers, does some processing, classification, what have you, until only valid data points remain. At any point during this analysis pipeline, it is helpful to log the number of data points that your code is dealing with. This will help you realise early on, by simply glancing at your code’s output, if something went wrong. Otherwise, it is all too easy to end up in a situation where you present somebody with your data, only to be asked “what happened to the other 85% of your data?” Below is command line output that illustrates the idea. The log shows at a glance that 75% of all original elements ends up surviving the filtering criteria. [2019-10-16 11:02:43] Processing raw data with 54,329 elements. [2019-10-16 11:02:44] Discarding 2,199 (4.0%) broken elements. ... [2019-10-16 11:03:12] Writing 41,049 (75.6%) elements to disk. ### Collect data as raw as possible Assume you are working on a network measurement project for which you need a large number of UDP headers. UDP’s simple 8-byte header is easily written to a database, so that’s what you do. Once the data is collected, you realize that not every UDP header is valid. The recipient rejected some of the UDP datagrams because the surrounding IP header was corrupt. Unfortunately, you are not able to figure out what headers were affected because you don’t have the full packet capture. For this reason, it often pays off to store your data as raw as possible – storage permitting, of course. In this case, one should have stored the full packet capture in pcap format, which contains the surrounding IP header and even the network layer. The same applies to other types of data sources. If Python code fails, log its full stack trace instead of just the last error message. You never know when you will need the extra data. There is however an important exception to this rule and that is privacy. If you are collecting data that pertains to people and their privacy, the opposite is called for: collect as little data as you need to answer your research question. ### Linux tools for data analysis When first presented with a new data set, one typically seeks to understand the data on a high level. We don’t yet care about every single outlier; rather, we are interested in broader trends, which exploratory analysis is meant to uncover. Exploration is meant to be quick and easy to adapt. What’s more, data sets are often encoded in a structured, textual representation like CSV or JSON. A small set of Linux command line tools is all that’s necessary for quick exploration. To me, the holy trinity of quick and dirty analysis is cut, grep, and sort, which I will briefly introduce below. #### cut Assume you have a CSV file – or any kind of text file that has one data record per line, and each data record consists of separate fields. Cut helps you select specific fields. Assume the following example; a CSV-formatted data set that maps a timestamp to the number of requests for that timestamp. time,requests 2019-10-27,582938 2019-10-28,582938 2019-10-29,519301 2019-10-30,502318 2019-10-31,510329 Given this file format, how can you select only the requests, and discard the time column? Here’s how: $ cut -d , -f 2 file.csv
requests
582938
582938
519301
502318
510329

The argument -d , tells cut to use a comma to distinguish between columns and the argument -f 2 selects the second column, which contains “requests”.

Do you want to discard the first line, which contains the file header? Pipe the output of cut into tail, to only display lines starting with line number two:

$cut -d , -f 2 file.csv | tail +2 582938 582938 519301 502318 510329 #### sort and uniq Once you combined command line tools in a way that they print numeric data to the console, you can sort that data by piping it into the sort tool. For example, to sort the numbers above, we append sort -n to the command pipeline. The -n flag tells the tool to sort the numbers numerically rather than lexicographically. $ cut -d , -f 2 file.csv | tail +2 | sort -n
502318
510329
519301
582938
582938

We often want to know about duplicate elements in a data set. The tool uniq can help us with that. To eliminate all duplicates, we need to pipe the output into uniq:

$cut -d , -f 2 file.csv | tail +2 | sort -n | uniq 502318 510329 519301 582938 Finally, when given the option -c, uniq can tell us how many duplicates it found: $ cut -d , -f 2 file.csv | tail +2 | sort -n | uniq -c
1 502318
1 510329
1 519301
2 582938

The first column identifies the number of records found in the second column. The last record shows that uniq found a duplicate that showed up twice in the data.

#### grep

Instead of the above example, we might have a considerably larger data set, spanning many months, and we may find ourselves wanting to filter the data set for specific months, e.g., August 2019. The tool grep makes this straightforward. The below example selects only lines that begin with the string “2019-08”:

$grep '^2019-08' file.csv grep accepts as input regular expressions and supports a large number of configuration options, making it by far the most powerful tool in this collection. Think of grep as the Swiss Army Knife for data filtering. We will now go through an example. #### An example Let’s take a look at a practical problem that requires a combination of the tools discussed above: What are the most frequently used words in this book? To answer this question, let’s begin by splitting the lines in this book into words by using the tool tr, which replaces spaces with newlines: cat book.md | tr ' ' '\n' --- title: Effective tools for computer systems research subtitle: "DRAFT ... The resulting list contains many lines that are markdown control characters and other fragments that aren’t English words. To select only words, I’m using grep with the regular expression ^[a-zA-Z]\+$ to select only lines that consist of the letters A-Z and a-z.

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' Effective tools for computer systems research Please do not author ... Now that we have a list of all words in this book, let’s now sort those words to make it easier to count them. This is as simple as piping the output of our existing tool chain into sort: cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort
a
a
a
a
a
a
a
a
a
a
...

Finally, it’s time to count each word. The tool uniq does this for us. It expects as input a sorted list of strings, one per line, and, if given the argument -c, shows the number of times it encountered each string:

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort | uniq -c 166 a 2 A 2 able 24 about 1 above 2 academic 1 academics 1 accept 1 access 1 accessible ... Wouldn’t it be useful to have uniq’s output sorted? Again, this is as simple as piping our current output into sort yet again. This time, we pass the argument -n to sort, instructing it to sort numerically instead of lexicographically. We also use the argument -r to reverse the output, so the most common strings show up first: cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort | uniq -c | sort -n -r
201 to
168 a
159 the
139 you
128 of
126 your
102 and
75 in
72 I
70 that
...

There are numerous command line tools that help with simple data processing. A simple Web search for “how to do X in bash” typically suffices to discover how a specific problem can be solved in bash.

## …for systems building

Let us move on to programming tips for systems building.

Imagine you are building a TCP proxy that modifies data in flight. As you work on the code that shoves data from one socket into the other, you realize that the proxy does not close its open connections properly, so you set out to fix that issue, and in the process, you restructure other parts of the code. Throughout all those steps, you run the risk of breaking the code that’s already in place, and perhaps without even noticing.

It may pay off to write a set of unit and integration tests for your code. Modern languages like Go make that reasonably easy, and the benefit is substantial: you can make complex changes to your code, run the tests and – assuming they pass – be reasonably sure that your changes did not break existing source code. You don’t have to test every single line of each function; focus on code that’s critical to the overall architecture and write a few unit tests that ensure that the code is working as expected.

### Consider a design principle

Simple proof of concept projects that consist of a couple thousand lines of code can be built without paying too much attention to architectural design but if your code is meant to grow beyond that, you may want to think about adopting an architectural design principle like The Clean Architecture (Martin 2012). Such principles make it easier to keep complexity under control as software grows.

### Tackle the riskiest component first

When designing and implementing a new system, it’s often not clear if it can be done. Your project may require to combination of several existing systems and this may lead to insurmountable issues that are difficult to predict. Some components of your new system are bound to be riskier and more likely to fail than others. If possible, try to work on those first, so you can abort the project early if you are running into insurmountable obstacles. It’s much better to realize that your system cannot work only five hours in than after days or weeks of programming.

### Learn to navigate large code bases

Most systems researchers will sooner or later have to understand and modify large code bases that have been written by others. That could be the Linux kernel, device drivers, or browsers. They all have in common that they cannot be understood in their entirety, which makes it difficult to make changes. There are hoever a number of strategies that can speed up the process of navigating a large third-party code base.

Before diving into the code, try to get it set up and running locally. Once the code is running, resist the temptation to immediately set out and try to find the part in the code that you think will hold your changes. It pays off to obtain a high-level understanding of the code before diving into the details. Crucially, one should obtain that understanding by going top to bottom instead of bottom to top, i.e. start with understanding the code on a very high level, which means class hierarchies, control flow, and interfaces. Instead of focusing on specific functions, focus on function call graphs; instead of focusing on class attributes, focus on class hierarchy. Well-documented software occasionally has architectural diagrams that help with understanding how the code is organised. Such diagrams are an excellent way to start. If you can obtain a semi-comprehensive understanding of the code – or at least of the subset that concerns you – it will be much easier to understand where your changes should land and you are less likely to run into costly dead-ends.

Another aspect that can slow you down substantially is a sloppy development setup. Complex code bases can take a while to compile, test, and set up, which makes it cumbersome and error-prone to test small changes to the source code. Be sure to spend some time setting up a proper development environment and consider reaching out to the project’s development community to get advice on efficient development. And as always, automate commands that you type repeatedly. For example, instead of running ./configure, make, and make install separately, you can run a single command that aborts if any of those steps fails: ./configure && make && make install.

Finally, large code bases are too complex to keep in memory. Don’t be afraid to reach for pen and paper to understand a program’s call graph or its startup sequence. I take extensive notes of complex code that I am studying, which resembles a call graph including source code files and line numbers.

## Summary

• To make your code more reliable and speed up the development process, use high-quality libraries.

• To make your code more structured, consider following a design principle.

• To make your work more useful to peers (and gain recognition), publish your code.

# Communicating

Regardless of what research you do, a substantial part of your job will be communication, mostly with your peers, but ideally, also with the general public. We communicate constantly, by writing papers, sending emails, talking to advisors, presenting our work, and complaining on Twitter. Being an outstanding researcher goes a long way, but to truly excel, we also have to master communication.

Effective communication creates numerous opportunities by 1) exposing your research to people who would otherwise not see it, 2) saving time, 3) “selling” your work, and by 4) earning the respect of your collaborators.

In this chapter, I will encourage you to create project pages, publish pre-prints, present effectively, engage in popular science writing, and use social media to your advantage. Regarding the more “intimate” communication with your peers, this chapter also discusses socialising, managing your collaborators, proper email etiquette, picking the right communication mode, and the reasons for communicating openly.

## …with the world

### Project pages

Have you ever stumbled upon a promising research paper that mentions that you can get its source code by emailing the authors, only to find that their email addresses no longer work? Or that they don’t respond to your email? Or when they do get back to you, they can’t find their source code anymore? The main output of a research project is the resulting scientific paper and once it’s published, there is little incentive for authors to do more.

Early on in my Ph.D. life, I made it a habit to create project pages for almost every research project I have ever been involved in:

The workload in research can be overwhelming and adding another part to a project may sound daunting. But creating a project page doesn’t take much time – maybe one afternoon, if you take your time. Once you have a template, you can re-use it for your next project, minimising the marginal cost of each new project page.

I recommend that project pages have at least the following sections:

• Project summary: Start with a paragraph that summarises your project. Similar to an abstract, it should convey (i) what problem your project solves, (ii) how it solves the problem, and (iii) what the results are. Try to write the project summary for a broad audience; write it the way you would explain your research to someone in another department, or to someone in the grocery store. In other words: use simple language and avoid jargon.

• Datasets: If your research uses a dataset, then your project page should link to the data. You may not want to host datasets yourself, especially large ones. Consider using the Internet Archive to archive your dataset; link to your Internet Archive page from your project page.

• Code: Your code matters because it allows others to reproduce your work. We therefore have an obligation to publish our code. Code is never perfect, so don’t ever be embarrassed about your code’s quality. No reasonable person will judge you by the quality of your code. As with datasets, there is no need to host code yourself: feel free to link to a GitHub or GitLab repository.

• Papers: Papers are the main outcome of a research project, so we should all make our research papers and other write-ups available on our project pages. Be sure to make your paper openly accessible instead of linking to a paywalled portal. Research papers behind a paywall are an injustice and prevent less wealthy scientists from engaging in the scientific discourse. If you are worried about legal consequences of publishing a paper outside a paywall imposed by the publisher of record: don’t be. I have yet to hear of a single case of a scientist getting into trouble for making their own work available.

• Contact information: Consider providing contact information to make it easy for fellow researchers to reach out to you. Try to use email addresses that will still work five years from now – even if this means using your personal address instead of your university email address.

I recommend keeping your project pages under your control, so you can edit them whenever you need to. It’s difficult to update the page if it’s hosted at university.edu/project/ and you are no longer employed by your former university. At some point I decided to host all my project pages on my own web server, nymity.ch, which gives me full control. But this control comes at a price: responsibility. If you host your own web server, it is now your responsibility to keep it alive, and to refresh your domain names and HTTPS certificates. If you want the same control with less responsibility, I recommend hosting your pages on a service like GitHub Pages.

It is increasingly common to buy fancy domains for project pages, often ending in the desirable “.io” top level domain. There is nothing wrong with that, but if you let these domains expire, your project page will disappear. Are you still going to pay that yearly $15 fee for myproject.io ten years from now? If not, then don’t go that route. To get you started with project pages, feel free to use the following template that gives you a simple, fast, and decent-looking project page in little time. <!doctype html> <html lang="en"> <head> <title>TODO: Page title</title> <meta charset="utf-8"> <meta name="description" content="TODO: Web page description"> <meta name="author" content="TODO: Your name"> <style> .toc { justify-content: space-between; display: flex; } body { width: 60%; font-family: sans-serif; } </style> </head> <body> <h1>Your project's title</h1> <div class="toc"> <a href="#overview">Overview</a> <a href="#writing">Writing</a> <a href="#code">Code</a> <a href="#data">Data</a> <a href="#contact">Contact</a> </div> <hr/> <h2><a id="overview">Overview</a></h2> <p>This is the project overview</p> <h2><a id="writing">Writing</a></h2> <p>An overview of what writing you published.</p> <h2><a id="code">Code</a></h2> <p>Links to your source code</p> <h2><a id="data">Data</a></h2> <p>Links to datasets</p> <h2><a id="contact">Contact</a></h2> <p>Contact information</p> <hr/> <p><i>Last update: YYYY-MM-DD</i></p> </body> </html> You can think of project pages as documentation of a finished piece of work, but I prefer to think of them as living documents that evolve as a research project progresses. The earlier you can share information about your work, the better. Research papers are often preceded by workshop papers, posters, abstracts, or presentations. All of these are worth making available early on, on a project page. In fact, a project page can serve as documentation for yourself, to keep track of your project’s output. I am not suggesting creating project pages for only altruistic reasons; you get something out of it too: • You learn about your audience by looking at your web server logs. I used to regularly check the visitor log of my project pages. It was interesting to see which universities and departments people looking at my work came from. In fact, it was gratifying to realise that anyone at all was interested in reading my work. • You expose your research to a broader audience. Research papers follow a style of writing and presentation that can be alienating to a general audience. Project pages mitigate this problem. Somebody who would not read your paper may read your project page – and perhaps then decide to take a look at the paper too. • It signals to potential employers that you go the extra mile and care about the presentation of your work, even when you don’t have to. ### Publish preprints For fear of getting scooped, researchers typically keep projects confidential until publication of a peer-reviewed paper. But getting a paper through peer review can take many months, if not years, because it is common for a paper to be submitted multiple times for review. Throughout all this time, your work could have been useful to others. In a short-lived field like computer science, this antiquated publication model causes frustrating and unnecessary delays. It does not have to be this way. While we can’t get around publishing peer-reviewed papers – it is academia’s currency, after all – we can publish a technical report before the final, peer-reviewed version of a paper is out. If you are still not convinced: Correa et al. (Correa et al. 2020) provide (not yet peer-reviewed) evidence that openly accessible papers are cited more than closed access papers. Originally created for the publication of physics pre-prints, the arXiv became computer science’s most popular pre-print publication platform too. You “publish” your work on the arXiv by uploading your research paper’s LaTeX code (be sure to first remove all cuss words in the comments). After a moderator reviews your submission, your article will appear on the arXiv – typically after one or two days. Conveniently, the arXiv provides a notification system that informs subscribers about new reports in their area of interest. This means that a non-trivial number of people who subscribe to the field “computer networks” will get a notification after the publication of your new report in computer networks. A frequent concern about the arXiv is that many conferences don’t allow paper submissions that have previously been published in a peer-reviewed venue. Fortunately, the arXiv is not peer-reviewed, so a report on arXiv typically does not count as published. In my field of computer security, all top-tier conferences accept papers that previously appeared on arXiv. Regardless, in case of doubt, ask a conference’s program chairs to clarify their policy regarding previously published (but not yet peer-reviewed) technical reports. “But Philipp,” you may ask, “why go to the extra trouble of uploading my report to the arXiv?” It’s all about exposure. Once your report is published, many of your peers will come across it: through Google Scholar, which crawls the Internet for research papers; via the arXiv’s in-house notification system; or through other aggregators. Early exposure can result in citations, potential collaboration, or at least people hearing of your work. ### Presenting A good conference presentation opens doors. Science journalists may approach you to write a popular science article about your work,Or, in the time-tested academic tradition of unpaid labour, they may ask you to do it for them. people from industry may wonder how to deploy your research, and other academics may suggest projects to collaborate on. A great presentation can elevate your research from obscure insignificance to something that people talk about. Even if your research is not spectacular, a great presentation sets you apart from other presenters. Take presentations seriously. Most conference talks I attend are a missed opportunities. The average academic talk is difficult to follow, poorly structured, and dispassionate. Entire books have been written on effective presenting and I won’t try to compete with them here. Instead, I’ll distill my advice into a few key points: • Rehearse your talks. Some people believe the myth that great presenters are born instead of made. This is wrong. My best talks were the result of numerous (up to a dozen) rehearsals. That’s why they were my best talks. With practice comes confidence. You will know what to say, so you’ll have in fewer “ehms,” poor transitions, and awkward pauses because you won’t have to try make sense of your own slides. Consider recording yourself to learn how to use your voice more effectively, improve your body language, and be mindful of and eliminate fillers like “ehm,” “you know,” and “like.” • Capture your audience’s attention. Don’t dive right into the research. Try to start with a lighthearted joke, an interesting anecdote, or anything that gets people engaged. I once presented a paper on Sybil attacks. Curiously, my name was listed twice on the conference’s list of accepted papers, so I used that fact to start my presentation with a joke that got a few laughs. • Focus on what matters. It is very common for presenters to ramble on about irrelevant details. Keep in mind that what your audience can take away from your presentation is very limited. Ask yourself: what are the two or three most important points that I want my audience to remember? Spin your presentation around these points. • Have a narrative. Every sentence you say should be directly connected to the previous sentence. If you jump from one topic to another without proper transition, you will gradually lose your audience. Even with a proper narrative it can be difficult to follow a talk. Recapitulate occasionally, e.g., by saying “now that we’ve looked at X and Y, it’s time to talk about Z.” If you would like to learn more, take a look at Patrick Winston’s excellent lecture on “How To Speak”. A good presentation uses slides sparingly but effectively. Here are my suggestions for optimal slide use: • Minimise the number of words on slides and avoid clutter. Your audience is going to read what’s on your slides and while they are reading, they cannot pay attention to you. Your slides are supporting material and are not supposed to keep your audience busy reading. • Use slide numbers, which will allow people to reference specific slides during the Q&A. • Make sure that the font (including in diagrams) is big enough that even people in the last rows can clearly read it. Most presenters get this one wrong. • When presenting charts, guide the audience. Explain the axes, discuss how to read the chart, and highlight important insights. • Optimise your slides for the 16:9 widescreen format, which is now supported by all modern projectors. To be safe, consider exporting a second slide set for the (outdated and increasingly rare) 4:3 aspect ratio. • Use a sans-serif font (e.g. Arial). Avoid serif fonts (e.g. Times New Roman) because they are optimised for reading large amounts of text. Needless to say, this is not critical advice but I find it useful nonetheless. ### Science Twitter Twitter has a (sometimes deserved) reputation for being a time sink fueled by conflict and outrage, but in my experience, the platform is all about who you follow. If you follow the right people, you will learn a lot. Over the years, I’ve compiled a set of Twitter accounts that share sharp insights and thoughtful commentary. I find that one can find high-quality discourse on Twitter that’s similar to dinner conversations at conferences. The best thing about Twitter is that you don’t have to pay$800 in registration fees to participate in these conversations.

In deciding who to follow, I use the heuristic of following somebody for a few days or weeks and if I don’t learn much from them, I unfollow them. Twitter is as much a marketing platform as it is a discussion platform and some people push the marketing aspect a little bit too far for my taste.

While you’re at it, use the opportunity to follow people outside your field. And by that, I don’t mean someone in programming language design if you’re in computer vision. Follow people in psychology, economics, or biology. You can learn a lot by observing what problems scientists in other fields struggle with, and how cultures and methods differ.

Twitter can be a great option for staying in the loop on various topics:

• Conferences and workshopsFor the non-computer scientist: original research in computer science is typically submitted to conferences instead of journals.

carry have reputations. When I was new to research, I did not know that. Eventually, I got a feeling for this (sometimes informal) “ranking” and which conferences carry the most prestige. Listening to researchers talk about conferences will help you get a sense of where you should submit your work.

• Professors occasionally talk about faculty hiring processes, what they look for in Ph.D. or grant applications, and their opinions on the peer review process. While this knowledge does not necessarily generalise to all of academia, it is still helpful.

• Some researchers openly talk about their paper rejections, which serves as an important reminder that the people you admire deal with rejection just as much (if not more) than you do. This helps calibrate your perspective.

• By engaging in discussions, you will eventually build a following, allowing you to promote your own work more effectively.

By no means do you need to use Twitter to be successful in your field, but controlled use can be an advantage. However, avoid Twitter fights – they make you look like a combative fool to bystanders. Also, make an effort to post interesting and insightful content; don’t just advertise your latest paper.

### Teach the public

Communicating about your work does not have to end with your peers. We have a responsibility to make our work accessible to the broader public. To that end, I have published two articles in The ConversationI was not paid to write these articles and have no financial interest in this site. I merely mention The Conversation because I have some experience with it.

. I was originally contacted by an editor who encouraged me to explain my research by publishing an article there. The site does not pay its authors, but I still experienced it as an interesting endeavour because I had never worked with an editor before. For both articles, I created a first draft and my editor provided plenty of suggestions and advice. After three or four more iterations, the article was at a point where it was ready to be published.

There are many other outlets that encourage scientists to explain their research to the public. Your research group or department’s blog platform is another great opportunity to practice these skills.

## …with collaborators

### Socialise

Academic conferences are where one forms new connections and collaborations. Conferences can be intimidating and uncomfortable, particularly for sufferers of impostor syndrome. You find yourself surrounded by accomplished and smart people, and believe that your research pales in comparison to theirs. I know the feeling.

While most of the networking at a conference happens in the “hallway track” during breaks, there are more networking opportunities in the evening. People often head out for dinner and drinks, which creates a less formal environment that makes it easier to strike up conversations and meet new collaborators. Consider tagging along with a group so you don’t miss out on this opportunity.

### Manage collaborators

Eventually, you are going to lead a research project. This involves coordinating collaborators, organising meetings, and keeping everyone in the loop on the project’s progress. Things inevitably get messy when people with different personalities, cultures, and communication styles work together. The following tips help make the process smoother:

• Your advisor is a collaborator too and needs “management.” Advisors differ significantly in their style and range from entirely hands-off to micromanagers. To learn more, take a look at Nick Feamster’s excellent blog post on the matter.

• If you want a collaborator to work on something, ask specific questions and provide clear instructions. Don’t expect them to realise how busy you are and offer help – they are likely too busy to notice.

• Keep people up-to-date on the project’s progress. Some people like to use email for this; others schedule regular calls to discuss progress. Consider using email for short updates and have an occasional call when there’s more to discuss.

• Don’t be afraid to express frustration, but do so respectfully and with the intent to improve the collaboration rather than assign blame. For example, if half of your team always misses meetings, strike up a conversation on how to collaborate in ways that work for everyone.

• Even more important than expressing frustration is the expression of gratitude. Let your collaborators know when they did a good job! We all love to feel appreciated.

• Conflict among collaborators is a common occurrence. If you find it difficult to resolve a conflict yourself, consider involving your advisor as mediator.

• Whenever there is something to discuss, involve all collaborators unless you have a good reason not to. Your collaborators will feel respected for being kept in the loop. (More on that below.)

• As if all of the above were not difficult enough, the average team consists of researchers from several cultures that have different customs regarding communication. Give people the benefit of the doubt and try to be clear and respectful in your communication.

### Email etiquette

Pick descriptive email subjects that make it clear what your email is about. I occasionally prefix an email subject with “FYI:” or “Action needed:” to let the recipient know that an email can be ignored or that it requires action.

Good: “FYI: Paper got accepted”

Good: “Action needed: Commit missing code to repository”

• Bad: “Need help with code”
Good: “Please commit missing code to repository”

Try to avoid top-posting when dealing with long and complicated emails because it makes it difficult to follow an email discussion:

I have strong opinions about your email.  You are wrong about X, Y, and Z.

On Tue, Jan 05, 2021 at 07:35:54PM +0000, John Doe wrote:
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
> incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
> nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
> fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
> culpa qui officia deserunt mollit anim id est laborum.

Instead, try to quote and respond to specific parts of the original email:

On Tue, Jan 05, 2021 at 07:35:54PM +0000, John Doe wrote:
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
> incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
> nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

This I agree with.

> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
> fugiat nulla pariatur.

I believe we should do X instead, because of Y.

> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
> deserunt mollit anim id est laborum.

Well said.

Use To: and Cc: wisely. Put everyone whose attention you require into the To: field and the remaining collaborators into the Cc: field. Create an email alias to make it easy to reach all of your collaborators. For more email writing tips, take a look at Philip Guo’s excellent article.

### Pick the right communication mode

Find the right balance between the least invasive and the most convenient communication method. It may be convenient for you to call your collaborator each time you need something, but they may experience this as distracting and invasive.

To discuss complicated research designs, you typically need a synchronous meeting: either a phone call or an in-person meeting. For topics that require less back-and-forth, asynchronous communications methods like email are a better fit. If you need something right now, an instant message or phone call may be the most appropriate.

Also, keep in mind that everyone’s communication preferences differ. Some people enjoy video calls while others prefer texting. Collaboration often requires compromising; try to find communication methods that work for everyone.

Regarding specific communication tools, Slack (or its free software alternative Mattermost) is useful because it allows collaborators to self-select which communications they want to participate in.

### Communicate openly

Imagine a small research project consisting of three collaborators; Alice, Bob, and Eve. There are four possible communication channels – assuming nobody talks to themselves:

1. Alice ↔︎ Bob
2. Alice ↔︎ Eve
3. Bob ↔︎ Eve
4. Alice ↔︎ Bob ↔︎ Eve

Four collaborators have eleven possible communication channels, while five collaborators have a whopping twenty-six possible communication channels!The binomial coefficient (a.k.a. choose n out of k) reveals the number of communication channels among a group of collaborators.

A project with five collaborators is by no means unusual – in fact, the top four academic security conferences now average five authors per paper (Balzarotti 2020).

The good news is that you don’t have to ponder which one of the twenty-six communication channels to opt for before writing an email. Unless you have a good reason not to, err on the side of inclusion when communicating. That is, include everyone in your email CC list by default. If any one of your collaborators feels overwhelmed by the communication, they can request to be omitted from future correspondence, or they can simply ignore your emails. Typically, it should be your collaborator’s decision what to participate in – not yours. In my experience, collaborators appreciate being kept in the loop – even if they rarely respond to email threads.

As a young Ph.D. student, I mistakenly believed that I was doing my collaborators a favour by not including them unless I really needed their help. After all, isn’t everyone busy and don’t they have better things to do? This is a fallacy. Collaborators exist to help each other and they generally like to know what’s going on. Give them the opportunity! Besides, leaving people out of communication can quickly lead to a culture of distrust. Junior collaborators, especially, will wonder if there are ulterior motives for them being left out.

However, not everything needs to be discussed with all of your collaborators. Do you need your advisor’s signature on a document? Your collaborators won’t care. The same is true if one of your collaborators is unable to log into a machine that you use for experiments. When it comes to the actual research, however, you need to have a good reason to not include someone.

I know first-hand that it’s often tempting to initiate one-on-one communication. For example, you may feel insecure about an idea and want to run it by someone before you share it further. Try to avoid this. The more you communicate in the open, the better for you and the project, and your collaborators will respect you for it.

Balzarotti, Davide. 2020. “System Security Circus 2019.” January 2020. https://s3.eurecom.fr/~balzarot/notes/top4_2019/.

Chacon, Scott, and Ben Straub. 2014. Pro Git. Apress. https://git-scm.com/book/en/v2.

Clear, James. 2018. Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones. Avery.

Correa, Juan C., Henry Laverde-Rojas, Fernando Marmolejo-Ramos, and Julian Tejada. 2020. “The Sci-Hub Effect: Sci-Hub Downloads Lead to More Article Citations.” 2020. https://arxiv.org/pdf/2006.14979.pdf.

Diffie, Whitfield, and Martin E. Hellman. 1976. “New Directions in Cryptography.” Transactions on Information Theory 22 (6). https://ee.stanford.edu/~hellman/publications/24.pdf.

Keshav, Srinivasan. 2007. “How to Read a Paper.” SIGCOMM Computer Communication Review 37 (3). http://ccr.sigcomm.org/online/files/p83-keshavA.pdf.

Martin, Robert C. 2012. “The Clean Architecture.” 2012. https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html.

Newport, Cal. 2016. Deep Work: Rules for Focused Success in a Distracted World. Grand Central Publishing.

Pinker, Steven. 2015. The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century. Penguin Books.

Pollan, Michael. 2009. In Defense of Food: An Eater’s Manifesto. Penguin Books.

Walker, Matthew. 2018. Why We Sleep. Scribner.