Research Power Tools is a work-in-progress book project, written by Philipp Winter, that discusses tools for more productive computer science research.

The book will be available by Summer 2022. In the meanwhile, please take a look at the open review version below. You are able (and highly encouraged!) to leave comments. Let me know what you find confusing, interesting, or what you would like to read more about, and I will update the draft accordingly.

Finally, if you like this project, please tell a friend about it!


Research Power Tools

Open review version. Please leave feedback!

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

Abraham Lincoln

Introduction

Research is messy. Our body of knowledge is scattered across countless journals, presentations, blog posts, and tweets, making it difficult to get up to speed in a field. Data collection often requires numerous iterations and is frequently poorly documented. Subsequent data analysis is sometimes powered by fragile code and obscure plotting systems. The resulting research paper is first called paper.doc, then paper-new.doc, paper-new-2.doc, paper-final.doc, and eventually paper-final-FINAL.doc.

It doesn’t have to be like that. A well-structured research project is not only possible but well within your reach – as long as you know how. That’s what this book is about. The following chapters introduce effective tools for doing research, in particular related to organising, versioning, reading, writing, programming, visualising, automating, and communicating. These tools are software, processes, or ways of thinking.

In addition to discussing these tools, I will explain why they are effective. You will realise that logging your time will give you freedom, good writing is the opposite of academic writing, and having a Twitter account isn’t all about posting cat pictures.

Building an effective tool set has a significant return on investment in terms of time, sanity, and research quality. You will save time by automating parts of your research pipeline and putting them under version control; you will keep your sanity by having the peace of mind that your pipeline is robust; and your research will increase in quality because you’re now less likely to make unnecessary mistakes.

Who is this book for?

I wrote this book primarily for new graduate students in computer systems.The field of computer science consists of “theory” and “systems.” I myself am working in systems, and you will find this book the most useful if you are a systems researcher – for example in a field like security, programming languages, or networking.

If this includes you, the entire book should be of interest. The secondary audience is people who program, write in LaTeX, or do research – basically anyone in a STEM field. This crowd will find a subset of this book useful.

Why did I write this book?

I had little research experience when I started my Ph.D. I faced a steep learning curve in the first couple years and research frequently felt overwhelming. In addition to reading hundreds of papers and trying to carve out your own area of research, you have to teach, take courses, and learn how to do research. [COMMENT: I changed this because, it seems to me, the book is about adopting best practices, which many students fail to do.] I’ve always been curious about how other people work and cope, so I wrote the book that I wanted to read as a young student.

Perverse incentives in research place too much value on the number of papers published and citations accrued. As a result, people cut corners to maximise their research output – often sacrificing rigor. Poorly documented workflows and sloppy code can lead to mistakes that jeopardise the correctness of a project. By adopting strict and effective workflows, we can minimise these mistakes and save time.

An unfortunate amount of knowledge in any scientific field is implicit, meaning that it’s rarely spelled out. Pinker calls this phenomenon the “curse of knowledge” (Pinker 2015, chap. 3): Years of experience in a field make it difficult to put yourself in the shoes of a newcomer. Examples of (mostly) implicit knowledge are the reputations of conferences, collaboration etiquette in research projects, and effective organisation of your time. People do write and talk about these topics, but you may have a hard time finding a course or book that adequately addresses them. In this book, I spell out aspects of research that aren’t elaborated on very much elsewhere – which means that parts of this book may seem obvious to you.

How should you read this book?

All of the chapters are self-contained, so dive into whatever subject appeals to you. The chapter on versioning is particularly important and is referenced several times in subsequent chapters. Finally, this is a hands-on book and I strongly encourage you to read it while in front of your laptop – ideally with an open terminal. You’ll retain more if you put what you read into practice.

Organising

A combination of stubbornness and luck got me through my Ph.D. and postdoctoral training without a todo list or schedule. Each day, I would simply continue work where I stopped the day before. My lack of organisation didn’t get me into trouble because, for the most part, I was involved in only one or two projects at a time, which was manageable. This changed after my postdoc. I suddenly found myself juggling software projects, monthly reports, research papers, and blog posts, all on tight deadlines. I had no choice but to become more organised.

You may – like me – get away with poor organisational skills during your Ph.D., but why not improve these skills before it’s absolutely necessary? Why not become more efficient in the process, and also more prepared for your post-Ph.D. life, which will most likely be more demanding and require juggling several projects in parallel? This chapter discusses organisational tools and behavioural hacks that will help you stay on track and make steady progress. After all, research is a marathon, not a sprint.

Incorporating new skills

Incorporating new organisational tools requires forming new habits. When you adopt new habits – such as organising, exercising, or eating healthily – you may find that you can keep to it for a few days, but then drift back into your old pattern. In his book Atomic Habits (Clear 2018), James Clear provides insight into why that is: key to forming good habits (and eliminating bad ones) is to i) create an obvious cue to trigger an action, ii) make the action’s outcome attractive, iii) make the action easy to perform, and iv) create a satisfying outcome. [COMMENT: It’s not clear to me what the difference is between making the action’s outcome attractive and creating a satisfying outcome.]

Curate a todo list

As a Ph.D. student (and even as postdoc), I spent most of my days working on very few tasks – so few, in fact, that I was able to keep my todo list in my head. Most days, I worked on moving my research project forward, interrupted only by the occasional paper review, presentation, or work with students. I had few tasks and deadlines to keep in mind at any given time. Once I transitioned from my postdoc into the “real world”, I quickly realised that I needed a better way to keep track of tasks. I suddenly found myself having to push forward several small projects: writing papers, fixing bugs in complex code bases, organising workshops, applying for grants, and analysing data sets. It was no longer possible to keep everything in my head, so I started experimenting with todo list tools. Console tools were a bit too cumbersome and web tools were inconvenient and clunky, so I eventually settled on maintaining my list in a simple text file.

In essence, a todo list helps you keep track of the things that you need to get done. The trick is to make it a seamless part of your workflow. If adding a new item to your todo list involves spending thirty seconds finding todo.docx on your hard drive, waiting five seconds for Microsoft Office to open, and another three seconds to scroll to the end of the file to add a new task, you will soon give up.

If you feel the need to curate a todo list, but find yourself unable to do so, work on minimising friction. First, make it so that you can open your todo list as quickly as possible. For example, configure a keyboard shortcut that opens your file. Then, if a new task is assigned to you during a meeting, you can add it to your list within five seconds. Once a new task is in your list, you can forget about it, freeing you of the cognitive load of having to remember it.

If you work on multiple devices and need your todo list on all of them, you will have to sync it somehow. In this case, browser tools may be your best bet because they can do the syncing for you. Manual syncing is out of the question – it’s not sustainable and you will wind up with out-of-sync lists.

The todo list format I eventually adopted is a markdown-formatted text file which is in the same file as my work log (see the section on work logs. Some tasks are more pressing than others. To reflect this, I use three sections, “today”, “this week”, and “eventually”. Here is an excerpt of what my todo list currently looks like:

# TODO

## Today

* work on presentation for hagenberg

* write monthly team report

* figure out how to move forward with #32126

## This week

* finish python 3 port of bridgedb (#30946)

* address cohosh's feedback of sharknado (#30716)

* write introduction and impacts section for ttp grant

* read salmon research paper (#29288)

## Eventually

* refactor emma with dcf's feedback (#30794)

* wrap up expired /keys issue (#17548)

* come up with a solution to bridgedb's broken captcha (#24607)

* look into moat "password" idea (#28015)

When I finish a task, I move it from my todo list to my work log, which is in the same file, so it literally takes seconds. This minimises friction and makes curating my todo list easy enough that I actually stick with it.

Of course, it doesn’t matter what I do. The best todo list is the one that works for you. I’m giving you an idea of my workflow in the hope that it helps you discover what will work for you. My workflow is heavily centered around text files and command line tools. You may be more of a browser person, in which case a pinned browser tab may work significantly better. Experiment with different workflows until you find one that you can sustain.

Plan your day

A habit that I picked up relatively recently, after reading Cal Newport’s excellent book Deep Work, is to plan my day (Newport 2016). Each morning, right after I turn on my laptop, I take a look at my todo list and make a plan for the day, using 30 minute blocks. I draw these blocks on my tablet but pen and paper work just as well.

This is an example of my daily schedule. I reserve my mornings for design and development, and other tasks that require intense concentration.

Note

A day without a schedule risks devolving into yak shaving: imagine you want to fix that annoying bug in your code that has been messing with your experiments. While thinking about how to best fix the bug, you notice that the function that contains it has poor documentation. So you spend a moment updating the documentation. While doing that, you realise that your functions follow an inconsistent documentation style, which really annoys the perfectionist in you. So you harmonise the way functions are documented in your code, and in the process, learn that the documentation tool you’re using has released a new version with convenient new features. However, your operating system doesn’t have the newest packages yet, so you set out to compile it manually. Three hours later, you find yourself hunched over your laptop, covered in sweat, finally with the new version of your documentation tool. That bug that you originally intended to fix? It’s still there.

I used to come to the office each day without a clear idea of what needed to get done. I would typically continue work where I left off the day before, or turn my attention to whatever seemed the most urgent. This approach may suffice if most of your time goes into a single project whose details you can keep in your head, but it falls apart in the face of more complex responsibilities.

You may think that a detailed plan for each day impedes your creativity. In fact, the opposite is true. By spending a few minutes planning in the morning, you reduce your cognitive load throughout the day. You won’t have to think about what to work on next, or when it’s time to switch tasks. You already took care of that at the start, leaving the rest of the day for deep thinking with minimal context switches.

Perhaps the biggest benefit of a planned day is that it helps you stay on track. I have the annoying habit of polishing finished work more than is necessary or even useful. Having a daily plan in front of my nose serves as a reminder that the perfect is the enemy of the good, and that many other tasks are still waiting to get done. This helps me move on to the next task quickly, which makes me more productive throughout the day. Higher productivity means more happiness. At the end of the day, I feel that I have accomplished what I wanted to, making it easy to get out of “work mode.” When I feel I haven’t accomplished enough, I find it difficult to leave work behind because I keep thinking about unfinished tasks. Needless to say, this is far from productive and only prevents me from relaxing and recharging. A detailed plan for the day is a good antidote that helps draw a clear line between work and personal life.

Track your time

Do you know that feeling of taking a break from work, only to catch yourself an hour later watching obscure YouTube videos? Or the feeling of having spent a full day working but ending up with little or nothing to show for it? As if the day has passed and you’ve accomplished nothing? The solution to these problems is to establish a tight grip on your most precious possession: your time.

I use Time Tracker on Debian Linux. This lightweight tool lives in my system tray, allowing me to quickly open it and take note when I’m switching from one task to another. Each time I switch between tasks,Examples for tasks are “answering email,” “changing database API,” “reading research paper,” and so on. Tasks like “writing” or “programming” are likely too general while tasks like “adding second paragraph to introduction of research paper” are too specific.

I open the Time Tracker tool and jot down what I’m going to work on next. On a typical day, I end up with five to ten tasks. At the end of the day, I know exactly what I did, and can contrast it with what I was supposed to do.

A typical day consists of a handful of tasks I worked on. The numbers next to the tasks correspond to bug tracker IDs.

Note

Tracking your time at such granularity may feel oppressive and stressful. After all, it’s yet another thing to remember and worry about. That’s exactly what I thought until I started doing it. But I learned that policing yourself helps you stay focused. Tracking my time helps me stay on track. It’s easy to feel busy all day long, without really getting anything done. You can spend several hours going over meeting notes, mulling over the next email, or stressing about all the things that you need to accomplish. Despite feeling busy all the time, you may not have much to show for it. Keeping track of exactly what I’m doing throughout the day helps me notice when I’m actually productive: I’m productive when I finish a handful of well-defined tasks. If you are working on a big project, try to split it into tasks, so you can accomplish a handful of them each day.

Part of my day job consists of working on development tickets for sponsors. These tickets consist of software bugs, feature requests, or small projects. My employer, The Tor Project, is mostly funded through grants, and we need to have a good understanding of how much time a specific development task takes. How long does it take to set up a testbed to evaluate a new pluggable transport protocol? An hour? A day? A week? It’s important to have both experience and data for time estimation because our intuition isn’t always the most reliable predictor. I use my time tracker to record how many hours it took me to complete each bug tracking ticket, which I then compare to the hours I projected it would take. Over time, my estimates become closer and closer to how much time I really needed (minus the occasional outlier, obviously).

Log your progress

I am a big advocate of keeping a log of what I have accomplished throughout the day. Did you finally manage to finish the introduction of your latest research paper? Mention that in your log. Did you finish refactoring the data processing pipeline in your prototype? This should go straight into your log. Right after I complete a task worth writing down, I spend approximately five seconds adding it to my log and then move on.Occasionally, I forget to add a completed task to my log right away. I then add it later, or sometimes even the day after.

I don’t bother getting punctuation or even grammar right.

On a typical day, I jot down somewhere between five and ten tasks in my log. I don’t log every single email I write but I do sometimes log emails if they’re both lengthy and important. Here’s what my Oct 14, 2019 work log looked like:

  • 2019-10-14
    • filed #32064 for improved search results for “download tor” and to incorporate gettor links in our website descriptions
    • replied to [redacted] and asked him if he’s willing to run default bridges for orbot
    • created wiki page to formalise our “support ngos with private bridges” process
    • thought about design for system that can scan the reachability of PT bridges (#31874)
    • created summary of obfs4 work for quarterly race report and updated our obfs4 ticket (#30716) with current project status
    • reviewed #31384 (snowflake.tp.o language switcher)
    • reviewed #31253 (webext packaging target)
    • responded to email with points worth communicating at otf summit
    • started working on #17548 (deal with expired pgp keys)

You can see that my phrasing is rough around the edges and that’s okay: you are the primary consumer of this log and you will likely remember what you meant after the fact. You will interact with your work log several times a day, so minimise the friction of adding tasks to it. My work log is always open on a virtual desktop, so I don’t spend any time opening it. I also added a shortcut to my editor, vim, to quickly add today’s date to the bottom of the file. The format of my work log is markdown, which facilitates conversion into other document formats such as HTML or PDF.As always, do what works best for you. I’m a terminal person and enjoy fast, lightweight, and robust console tools. You may be more of a browser person, in which case it’s worth looking at web services that help you log your progress.

Using pandoc, converting a markdown-formatted file to PDF is as simple as running:

pandoc log.md -o log.pdf

As a student, I would send my log for the past month to my advisor at the end of the month. He appreciated seeing what I’d been up to at a level of granularity that was neither too coarse nor too detailed. Sharing your work log with your advisor also serves as insurance: your advisor won’t be able to ever complain that they were not kept in the loop.

Actually writing down and seeing what you’ve done throughout the day can be surprising – in either a good or bad way, depending on how much you have accomplished. A progress log is important because it allows us to monitor ourselves. If I see one or two days in a row with little progress, I realise it’s time to make a conscious effort to improve my productivity. Progress logs make it less likely that you’ll drift into a slump and perform poorly for many days or even weeks without noticing – something that happened several times during my Ph.D: I wasted many weeks going down rabbit holes and losing sight of the big picture. I mulled over attractive research ideas that were ultimately infeasible but that I was too stubborn to give up on. I was no longer on track and I didn’t realise that I needed to take a step back and re-evaluate my direction. Looking at your work log makes it easier to notice when you’re off track; sending your work log to your advisor means that they should also be there to help.

Note that tracking your time and logging your work are very similar but serve different goals. Time tracking allows you to monitor yourself on a micro scale while a work log helps on a macro scale. It’s possible to be on the right track but spend a significant part of your day watching YouTube videos (a time tracker would reveal this bad habit) or be efficient in your day-to-day business but not spend time doing the right things (ideally, a work log would help you see this).

Take meeting notes

I had the privilege of collaborating with 25 people over the course of my research career. Several of the projects I was involved in were not led by me, so I took a back seat. In some of these projects I was surprised by the lack of note taking during meetings. We would meet and discuss the project, as I was used to, but nobody took notes (at least not for the entire group). The implicit expectation was that everyone would remember what was said and what was left to do. Needless to say, this didn’t always work out. Throughout the next few weeks, people forgot what was discussed, only to end up discussing some of the very same topics again at the next meeting. And misunderstandings along the lines of “wait, I thought you were supposed to do that?” happened.

You can avoid these issues by consistently taking notes. Before each meeting begins, designate a note taker. In fact, multiple people can take notes simultaneously in a Google Document or a Riseup Pad. The note taker(s) jot down key points during the meeting and todo items for each person. At the end of the meeting, all of the participants get a copy of the notes. If anyone disagrees with any of the notes, they speak up. This way, everyone is on the same page, there is a written record of what was discussed, and it’s easy to go back to find out when a certain task was covered. I can guarantee that your collaborators will love you for taking notes.

Note taking isn’t just for high-stakes meetings with important collaborators. I take notes almost every time I interact with somebody – including with my advisor, during my Ph.D. days. I created a simple shell function that facilitates the creation of a new file for each meeting. I simply type meet alice into a terminal, and the command automatically creates a new file, 2019-12-21-alice.md, and opens it in my text editor. Here’s the script, which you can add to your ~/.bashrc:

meet () {
    d=`date -u '+%Y-%m-%d'`
    file="${HOME}/doc/meeting/${d}-${1}.md"
    vim "$file"
    echo "Edited $file"
}

I take notes in markdown, which is both expressive and very simple. All my meeting notes are in the same directory, in ~/doc/meetings/. Sometimes, I’m looking for something that was said in a past meeting but I don’t remember which one. So I grepThis is as simple as running grep keyword *.

all my meeting notes for a keyword that I remember to be present in the log. Again, this is meant to minimise friction because otherwise I would be too lazy to take meeting notes.

Persevere

Graduate work takes its toll. Long work hours bring with them loneliness and isolation; paper rejections chew on your self worth; witnessing colleagues excel fuels imposter syndrome; and seeing childhood friends buy homes and have children makes you wonder if graduate school really is the right choice. It comes as no surprise that poor mental health is a problematically common, yet rarely talked about, subject.

As a Ph.D. student, I was – and still am, to some extent – struggling with insecurities. I read outstanding papers in my field and immediately became disillusioned by the depth of the work and how it combined concepts I had not even heard of before. How would I ever be able to compete with that?

It is important to understand that your seemingly perfect colleagues are anything but, and often struggle with the very same issues. I was both surprised and relieved when a friend – whose work I greatly respect – mentioned that he, too, struggles with imposter syndrome. I realised that if someone as knowledgeable and capable as him is plagued by these feelings, it’s significantly more widespread than I had thought.

Mental health takes place in your mind but a healthy mind lives in a healthy body. There are a few things you can do that greatly affect your mental well being.

  • Maximise your intake of whole foods (broccoli, potatoes, beans, coffee, apples) and avoid processed foods (cereal, pasta, candy, soda). (Pollan 2009)

  • Exercise regularly by making it a fixed part of your daily schedule. Don’t try to squeeze exercise in between other responsibilities; squeeze other responsibilities around your mandatory exercise. Convince a friend to do the same and hold each other accountable.

  • Go to sleep and wake up at regular times. Get at least seven hours of sleep. It’s fashionable to brag about how little sleep one gets, and in times of pressure it’s easy to believe that sleep is a “waste of time” but the opposite is true. What may take a tired brain three hours to accomplish can be a matter of 30 minutes for a well-rested brain. (Walker 2018)

  • Get a bit of sun every day. I like to go for a run around noon, or sometimes go for a brief walk while listening to a podcast. When I get back, I’m full of energy, calm, and ready to get back to work.

  • Practice meditation. It occupies only ten minutes of my day – right after waking up, with a hot cup of black tea in my hands. I enjoy Sam Harris’s Waking Up app. Using an app to facilitate meditation may sound ironic but there is merit in guided meditation – even if it’s just a few spoken sentences every other minute.

  • Procrastinate productively. Deep thinking requires creativity; you can’t force that.

Summary

  • Minimise friction by making organisational tasks as quick and pleasant as possible.
  • Experiment with different workflows until you find one that you can sustain.
  • Log your time to become more efficient.
  • Log your progress to become more effective.
  • Take notes during meetings to create a “paper trail” and clearly jot down todo items.

Reading

Graduate work requires an awful lot of reading. Course material, blog posts, textbooks, emails, and most importantly: research papers. This chapter (i) shows that there’s more to reading a research paper than working your way from one page to the next; (ii) presents strategies to organise your reading; (iii) shows how you can learn about relevant new papers; and (iv) explains how you can access papers that are locked away behind paywalls.

How to read a research paper

I’ve spent too much time reading research papers like novels: cover-to-cover, in the mistaken belief that that’s how I would get the most out of them. I eventually learned that my time is not always best spent trying to understand the minute details of a paper’s method section – especially if I never intend to apply that method myself. Nobody awards you a medal if you fight your way through a paper that already lost you on page two. Sometimes, there is no need to read a paper’s method section at all – you may only be interested in the conclusion, or the section on data collection. Before diving into a paper, know what you want to get out of it. Needless to say, there’s nothing wrong with reading a paper cover-to-cover. I still do it all the time, but understand that this is not universally the best approach to reading a paper.

Throughout your research career, you will look at hundreds of research papers. You may not necessarily read them all cover to cover, but you will at least skim them. When is a paper worth reading cover to cover? And when is it best to only skim a paper? And what exactly does skimming mean? There are heuristics that help answer these questions and save you time. [COMMENT: The following sentence reiterates ideas covered in the previous paragraph and doesn’t really fit here. I think it can just be cut and the ref moved to the previous paragraph: Sometimes your time is better spent understanding a specific key aspect of a paper – for example its methodIn computer science, researchers curiously started referring to their method as methodology, which is the study of methods. If you are writing a paper that compares and contrasts methods in your field, you should refer to it as methodology but otherwise as a method.

– and ignoring the rest.]

When opening a paper for the first time, you will read its title. Not all titles are descriptive or provide a good idea of what a paper is about, but as long as it sounds vaguely interesting, you want to read the abstract too. Abstracts are generally short, and can be read in one or two minutes. Ideally, an abstract reveals the problem that the paper tries to solve, explains how it solves the problem, and what the results are. A well-written abstract provides enough information for you to decide if the paper is worth diving into.

Next up is typically the introduction in which authors provide context on what research problem they are addressing (the problem statement) and on why it matters (the paper’s motivation). I often skip the introduction of papers that are in my field because I understand the context and I have already bought into the paper’s motivation. For papers far outside my field of expertise, the introduction can be the most interesting part. I occasionally read the introductions (and nothing else) of cryptography research papers because I cannot be bothered to dig into their proofs and mathematical models. The introductions, however, help me understand why a topic matters and how it relates to a broader field.

A handful of other sections can separate a paper’s introduction from its “meat.” Many papers have a dedicated background section which explains technical concepts that the reader may not be familiar with. A good rule of thumb is this: if one is unlikely to encounter a concept in a standard computer science curriculum, it may warrant a few paragraphs in the background section. A section on related work is also common, in which the authors put their own work into the context of existing work in the field. Well-written related work sections are as helpful as they are rare. Many papers approach their related work as thoughtless lists of references without any context. “Person A did this; person B did that; person C did this.” That completely misses the point. A related work section is meant to answer questions like these:

  • How does this work compare to similar work that was published in the field?

  • What advantages and disadvantages does this paper have over others?

  • How do papers overlap and complement each other?

What your readers really want to read is: “Person A did X but we decided to do Y. While X comes with stellar performance benefits, we believe that Y provides security benefits that are needed in our threat model. Future work should study a hybrid approach that incorporates both X and Y.”

After several sections of introduction and background, the actual research begins, typically introduced by a “method” section. A method can make or a break a research project. It’s what reviewers pay the most attention to because this is where flaws and biases have the biggest effect. If a given paper is very similar to your own research, you probably want to read its method. In particular, you will want to know if the paper’s method addresses issues that you failed to consider (or the other way around),

Another crucial aspects of a method are its assumptions. Each research project makes assumptions on the format of data, the number of users in a system, the way users interact with a system, the performance of underlying hardware, and so on. Ideally, these assumptions are spelled out explicitly, but sometimes they are implicit and therefore more difficult to understand. Assumptions are often flawed. For example, a paper may assume a normal distribution of its data, but the data is more likely to follow a power law. In this case, it may not be appropriate to use whatever statistical analysis tool the paper uses to study the data. When reading a paper, you want to get a good idea of what its assumptions are. Finding an issue in assumptions can quickly invalidate a project’s results.

The above is a very brief overview of what to pay attention to when reading a paper. A 2007 article by Keshav goes into more detail (Keshav 2007).

Adopt a reviewer mindset

Your mindset is a key aspect in how you approach a paper. It’s tempting to read papers the way we read textbooks; by assuming that each paragraph is an untouchable source of truth that is not to be questioned.Arguably, this is not true. Textbooks contain mistakes and have errata but we are still conditioned from an early age to believe what’s printed in books.

A significant part of graduate training consists of unlearning this mindset and replacing it with perpetual questioning. We wouldn’t be where we are today if Galileo didn’t question the geocentric world view. Adopt the mindset that you are reviewing rather than reading a paper. You aren’t passively absorbing but rather actively verifying information. Your null hypothesis should be to distrust a paper’s results and only through sound reasoning and rigorous methods can a paper change your mind.

Peer review is meant to weed out critical flaws in papers but that does not mean that peer-reviewed papers are free of flaws. Peer review is not perfect and has false positives (flawed papers pass the filter) and false negatives (decent papers get rejected – often because of reviewer antics). Be vigilant and question everything, all the time. Throughout my career, the smartest and most capable people in the room all shared one trait: they had a well-calibrated bullshit filter and would not fall for “smooth talkers.” We all need to strive for that.

Don’t just pay attention to what a paper says; pay attention to what it doesn’t say. Do you believe that a dataset requires an important preprocessing step that a paper never mentions? Can you think of an evaluation scenario that’s not discussed in the paper? Is a paper conveniently ignoring a performance analysis of the proposed database? These are examples of problems that arise through omission. Keep in mind, however, that papers are subject to page limits and therefore must omit some content. Bad reviewers often forget that and criticise papers for missing their favourite kind of analysis. The trick is to include what matters and omit the rest, which is easier said than done.

I highly recommend taking notes when reading a paper – with pen and paper, on a tablet, or in a text file on your laptop. Here are a number of questions that can get your started with thinking critically about a research paper:

  • Briefly summarise the paper’s method.

    • Is the method section detailed enough to facilitate reproduction?
  • What are the paper’s assumptions?

    • How robust and realistic are these assumptions?
    • Can you think of counter-examples to these assumptions?
  • What are the paper’s key results?

    • Do these results contradict or confirm existing work?
  • How could the paper (its method, presentation, or results) be improved?

  • What follow-up research questions come to mind?

  • What’s the paper’s conclusion?

    • Do the results support the conclusions?Competition at high-ranked journals and conferences is fierce, which creates the incentive for researchers to overstate the importance of their own work. Be wary of that.

Read and engage in actual peer review

There’s only so much you can do to train your mind to be more vigilant. Luckily, there are ways to draw on the knowledge of more experienced researchers. Some conferences publish peer-reviewed papers together with their reviews. As a young Ph.D. student without any experience in peer review, I found it fascinating to get a glimpse of how peer review works in practice. The IMC conference used to publish paper reviewsThe last iteration that still published paper reviews was the Internet Measurement Conference 2013.

but eventually stopped doing so because it constituted a significant burden for its reviewers and the extra work was not worth the use people got out of it.

Other conferences implement a “shadow program committee” that gives graduate students the opportunity to participate in an “alternate universe” reviewing process, which ultimately leads to a “shadow conference program.”Several conferences are or were running shadow program committees, for example the IEEE Symposium on Security & Privacy, the Internet Measurement Conference, EuroSys, and USENIX NSDI.

This is a great resource and I highly recommend participating at least once.

Once you get your feet wet with reviewing papers, you may want to join an actual technical program committee (TPC) – the set of people who review a conference’s paper submissions. People usually wind up on TPCs after being invited by the conference chairs. To get invited, the chairs need to know you or your research, which is a challenge for most junior researchers. To work around that, ask your advisor if they can get you onto a TPC. Alternatively, your advisor can hand you one or two papers to review for the conference they are reviewing for.

Join or organise a reading group

We get the most out of reading a paper by discussing it with colleagues. Many university departments hold regular reading groups for this purpose. Reading group meetings are similar to book clubs, but revolve around a paper. All members are supposed to read the paper in advance, and then discuss it during the meeting. Your department may already have a reading group but if not, why not organise one? It’s as simple as saying to your colleagues “Let’s meet next Tuesday for an hour to discuss the attached research paper.”Ask your advisor or your department chair if they are willing to “sponsor” the reading group by ordering pizza. Try to keep the culinary incentive small – if the food is too good, people will show up for the food without having read the paper.

N+1 brains are strictly better than N brains because everyone brings a unique perspective to the table. Every reading group I’ve ever attended has left me with a significantly better understanding of the paper and its context than I could have acquired myself. Besides, you will refine your sense of what to pay attention to when reading a paper. It’s a great way to flex your “reviewer mindset” muscle.

Here are some guidelines for successful reading groups:

  1. Nominate a session lead who gets to pick a paper, maybe gives a brief summary of the paper before the discussion, and then moderates the subsequent discussion. Session leads can change with each meeting. It can be a humbling experience to explain a paper to someone else. We often don’t know if we truly understanding something before trying to explain it to someone else.

  2. Go into the reading group with a set of questions to discuss:

    • What did you like about the paper?
    • What did you dislike about the paper?
    • What are follow-up research questions?

    Reading groups always drift off topic. Without any guidance, you will find yourself discussing the advantages of vim over emacs after five minutes. The session lead should steer the conversation back into productive territory, which is easier when keeping a few questions in mind.

  3. In my experience, the most effective reading group has fewer than a dozen attendees. As the number of participants increases, discussions become disorganised and the reading group less productive.

  4. Seek to organise reading groups around specific research areas and not entire fields – for example, “privacy and security” rather than just “computer science”.

Organise your reading

You are likely to read hundreds of research papers throughout your Ph.D. studies. If you’re anything like me, you won’t remember them all. Early on in my Ph.D. career, I was struggling with the sheer number of papers that were waiting to be read. How should I archive these papers? How could I make the list of papers that I have already read easy to search? How could I show my Ph.D. advisor the progress I had made? Eventually, I built a small web page to list all of the papers I have read. It lists the paper title, authors, proceedings, year, publisher, BibTeX record, and a pdf copy.In the hope that it would be useful to somebody else, I decided to publish this page shortly after I built an initial version. It must have been some time in 2012. Eight years after its original creation, I am still curating it because it ended up being a useful resource to me and others.

My bibliography on Internet censorship-related research papers.

Note

To facilitate the curation of this web page, I wrote bibliograpy, a Python tool that takes as input a .bib file and turns it into an HTML bibliography. There’s a good chance that you will write your research papers in LaTeX, so a BibTeX file is a convenient way to keep track of your reading. Whenever you write a new paper, you can include your BibTeX file and have your citations readily available. The BibTeX format supports custom fields that are typically not used when generating a bibliography. You can use these fields to take notes about papers. Here’s an example:

@book{Smith2020a,
  author = {John Smith},
  title = {Towards Better Computers},
  year = {2020},
  custom_note = {A bit boring at times but some useful suggestions. Section 2
  seems relevant!},
}

Once you have your own, growing BibTeX file, I would encourage you to build a bibliography – and ideally to publish it, so your colleagues can benefit from it too. Once you finish an initial version of your bibliography, maintenance requires little effort. I regularly take a look at the proceedings of relevant conferencesIn my field, these are USENIX Security, ACM CCS, IEEE Security & Privacy, the Internet Society’s NDSS, and a few others.

and add new papers to my bibliography’s .bib file. I then run a script, deploy_website.sh, which builds the web page and uploads it to my web server. Automation is key here. I would never bother to curate my bibliography if it took more than five minutes to add new papers. Occasionally, I receive patches from colleagues who stumble upon papers that I wasn’t aware of.

Remember, the best system is the one that works for you. Some people like to print papers and take notes with a pen. I used to read papers on my laptop, using the tool MendeleyI no longer use or recommend Mendeley. It is now owned by the company Elsevier, a long-standing opponent to the open access movement. Elsevier does not deserve our support.

. At some point I got a tablet, which I now routinely use instead. I like its high-resolution screen and it’s less distracting than my laptop, which allows me to better focus on the paper. Experiment with a few ways of organising your reading and stick with whatever you like.

Learn about new papers

Every conference cycle brings with it a new set of papers with the potential to affect your research. It’s important to stay up to date on these new papers because (i) you need to know if a research group has been working on the same problem, and has published before you (this is called getting “scooped”); (ii) you can learn about new research directions that you may want to pursue; and (iii) you may learn about ways to improve your own work, e.g., by building on better methods or datasets. My favourite ways to find out about new papers are to regularly skim conference proceedings, use arXiv subscriptions,and set up Google Scholar notifications.

Skim conference proceedings

The primary source of new papers in your field should be your top conferences and journals. Each scientific discipline has a few venues that are considered “top-tier” and your advisor will likely encourage you to publish in them. It’s not easy for a newcomer in a field to understand what venues are of the highest quality, so ask your advisor or colleagues. In my field of computer security, these are the Network and Distributed System Security Symposium (NDSS), the Conference on Computer and Communications Security (CCS), the USENIX Security Symposium, and the IEEE Symposium on Security & Privacy. Each of these conferences takes place once a year – one in every quarter. Once one of these conferences publishes its research papers, I spend thirty minutes skimming the list of papers to see what’s worth reading.I first skim the titles and then look at the abstracts of the papers whose titles seem promising. Depending on the abstract, I may then decide to skim or read the paper.

There are always a few papers that I find really interesting. I cannot over-emphasise the importance of following your top venues. It is the source of new research in your field.

Sign up for Google Scholar

In addition to conferences, there’s another handy way to learn about new papers. Google operates a service, Google Scholar, that provides several useful features for academics, one of which is a notification system that sends you emails about new research papers in your field. I have been using Google Scholar for many years to get emails about new papers from a given set of authors or by keyword. You can provide Google Scholar with a handful of names and it will then send you an email notification each time one of those people publishes new work. After a few months of reading papers, you will be able to name a handful of researchers whose work you find particularly relevant or insightful. Consider adding their names to Google Scholar and get notified whenever they publish a new paper. You can also get email alerts for keywords that Google Scholar extracts from papers. The more specific the keywords, the more useful the alerts will be. I currently have an alert for the keywords “censorship”, “system”, “anonymity”, and “tor.”

Every other day, I get an email with about a dozen new papers. Most are uninteresting and some are amusingly unrelated, but occasionally Google Scholar sends me highly relevant papers that I would not have found otherwise, which makes the service absolutely worth it to me. The value I get from the occasional true positive far outweighs the annoyance of the regular false positives. Google Scholar once helped me learn about an important research paper that nobody in my field knew about because it was published at a somewhat atypical conference.

In addition to research papers, Google Scholar tracks the citation count of each paper. At least among computer scientists, Google Scholar has turned into the source of truth to learn somebody’s h-index.The h-index is the largest n for which one can say “I published n papers that were all cited at least n times”. Academic hiring committees frequently use the h-index to quantify the “productivity” and “impact” of a researcher. I’m putting these two words in quotation marks because like all metrics, the h-index is a poor approximation of one’s productivity and people have learned to abuse it by forming citation rings and engaging in “salami publishing.”

It’s not healthy to obsess over one’s h-index but it can sometimes come in handy: I once had to provide my h-index as part of my green card application for the United States. I applied for a green card using the “national interest waiver” track, which requires the applicant to prove that they have an established research record. Providing one’s h-index is part of this proof.

Use arXiv subscriptions

The arXiv is a database of preprintsA preprint is a paper that has not yet been peer reviewed.

that was originally conceived by the physics community. It has since gained popularity in many other fields including computer science. Some disciplines have their own version of the arXiv, like bioRxiv in the biological sciences. Many researchers post their papers to the arXiv before (or in parallel to) submitting them to a journal or conference. Note that most journals do not accept work that has been previously published – this rule typically does not include preprints, but in case of doubt, you should check with the editors.

The arXiv manages an email subscription service that allows you to subscribe to research topics you’re interested in. Similar to Google Scholar, this email service will alert your to new papers.

Circumvent paywalls

Most research papers can be found in the databases of academic publishers like IEEE, ACM, Springer, Elsevier, Sage, or Wiley. It is the very job of these publishers to make research available to researchers. Unfortunately, access to these databases is not universally free. One can pay to access single articles but universities generally subscribe, which allows university employees to access (a subset of) all of a publisher’s articles. The cost of these subscriptions is exorbitant and has prompted numerous universities to cancel them. In 2019, after months of unfruitful negotiations, the University of California system decided to cancel its Elsevier subscription.

A paywall at the ACM Digital Library. ACM non-members have to pay $15 to purchase this article. A simple Google search for the paper title typically uncovers a freely accessible version of the same paper.

Note

Researchers without a university affiliation, or with an employer that is not wealthy enough to pay for subscriptions, are shut out by paywalls. I mentioned earlier that Ph.D. students will typically skim hundreds of research papers throughout their education. Even if that’s just 200 papers, at an average price of $15 per paper, this would amount to $3,000 dollars – unaffordable to many.

Thankfully, there are ways around paywalls. The most popular such service these days is Sci-Hub.Sci-Hub is the brainchild of the scientist Alexandra Elbakyan who is still curating the service.

The site’s minimalistic interface has a search bar that allows you to search for a paper’s URL or DOI. For the above paper, Sci-Hub promptly opened the pdf for the URL https://dl.acm.org/purchase.cfm?id=2517856. Note that publishers deem Sci-Hub a threat and keep trying to have its domains taken down. If sci-hub.se ever stops working, the website whereisscihub.now.sh can point you to its latest domain.

Sci-Hub’s website, available at sci-hub.se as of February 2020. Add a paper’s URL or DOI in the search box and get instant access.

Note

If Sci-Hub is unable to find a paper for you, then take a look at Library Genesis (often abbreviated as Libgen). The project has similar goals, providing a web frontend for a database full of scholarly articles and books. Another option is social media. On Twitter, people have started using the hashtag #icanhazpdf to ask for papers that other Twitter users will make available. The /r/scholar subreddit is also used to request scholarly literature.

The #icanhazpdf Twitter hash tag in action.

Note

Finally, you can often find a paper (or at least its preprint) by simply searching the web for the paper’s title. Many authors (including myself) make their research papers available on their personal websites.I make all of my research papers available on my personal website and I encourage you to do the same. People occasionally hesitate to put their papers online because it may be a violation of the copyright agreement they have with their respective publishers. That may be true but I’m not aware of anyone ever getting into trouble for that.

If all else fails, email the authors of the paper you need, and ask for a copy. Don’t worry about crossing a line: if anything, the authors will feel flattered that someone is showing interest in their work. Every author I’ve personally asked for a copy of their work (most of whom I have never met) have generously sent me one. Talent is everywhere but opportunity is not. Many of our colleagues cannot afford to keep up with the literature because science is being held hostage by greedy publishers that have failed to adapt to the internet. It is our duty to support our colleagues by making our work freely available.

Summary

  • Don’t feel obliged to read a research paper cover to cover. Know what you want to get out of it and focus on what matters to you.

  • Train yourself to question everything, all the time. Join reading groups and engage in (shadow) program committees to train your questioning skills.

  • Organise your reading somehow. A single BibTeX file can work surprisingly well.

Versioning

Have you ever worked on a project in which progress was reflected in file names? It may have started with report.pdf, changed to report-final.pdf, then to report-final2.pdf, and finally to report-submitted.pdf. Eventually, one is bound to wonder what document was the final version after all. Version control offers tools as a way out of this mess.

I originally planned to mention version control as part of other chapters but realised that the topic is important enough to deserve its own chapter. If you are dealing with text files – be it writing, code, or configuration files – that change over time, consider putting them under version control. A project under version control makes it possible to attribute every single change to a person and a rationale; it documents the evolution of a project and ensures that nothing is ever lost.

Git has a steep learning curve. I transitioned to git after having used Subversion, which lacks many of git’s concepts. It took me a lot of patience and practice to wrap my head around its concepts, and be comfortable when using it. Still, the efficiency, clarity, and flexibility that git will add to your workflow makes it well worth it and as with so many other tools, the Pareto principle comes to our help: learning just 20% of git’s features will let you take advantage of 80% of its usefulness. You don’t need to master git; you only need a solid grasp of its basics.

The essentials of git usage are beyond the scope of this book. Several online resources can ease you into the basics, like Chacon and Straub’s freely available book Pro Git (Chacon and Straub 2014). While some understanding of the theory behind git is useful, one can only become proficient by using git. Use it on small programming projects, research papers, or configuration files – anything that is a text file can be versioned under git.

Useful git features

Git has a large number of powerful features. For example, a rebase operation allows for arbitrary rewriting of git’s log; including reordering, renaming, and changing of past commits. And submodules make it possible to include in a repository another repository that the main one depends on. It’s perfectly possible to be a reasonably effective git user without knowing those powerful git tools. In fact, this section is going to introduce a small number of git features that are particularly common in an academic setting.

See what changed between two commits

The work towards a conference’s submission deadline typically results in a last-minute scramble to get the research paper in order. On the deadline day, you may wake up in the morning to dozens of git commits from collaborators. Being a responsible colleague, you would like to find out what changed, at a glance, but not by skimming the entire paper and looking out for changes. Git makes it possible to display the changes between two given commits:

git diff OLD NEW

Both OLD and NEW represent git commit IDs. You can find those commit IDs in git’s log by typing git log. Simply provide the last commit that you have taken a look at (OLD) and the most recent commit (NEW). Note that this command displays the changes that happened after OLD and including NEW. If you want to see all changes starting at and including OLD, run:

git diff OLD^ NEW

Aliases

Git has aliases that allow you to save time by abbreviating frequently used commands.

For example, the following command creates an alias for git status which allows me to type the shorter git st:

git config --global alias.st status

Similarly, you can create a shortened alias for git diff as follows:

git config --global alias.di diff

Note that the aliases can refer to more complex invocations of git. For example, when I type git lo to inspect git’s log, the following alias is run:

git config --global alias.lo "log --oneline --decorate --graph --all"

Make git colorful

You can enable terminal colors for all your git repositories by running the following command:

git config --global color.ui auto

One can also control git’s colors in a more precise way.

Ignore irrelevant files

Depending on the kind of project that you are taming with git, you may have files – typically automatically generated – that change frequently but have no business being in version control. That includes executable files ending with .exe, .pdf, .pyc, or .tmp. Annoyingly, those files have the habit of cluttering the output of the frequently-used git status command. To instruct git to ignore those files, one can create a file called .gitignore in the root directory of the git repository, and add the following, which ignores all files ending with .exe, .pdf, or .pyc:Specifically, the wildcard * matches arbitrary file names.

*.exe
*.pdf
*.pyc

If instead, you only want to ignore a specific .exe files and all .pyc files, add the following to .gitignore:

specific-file.exe
*.pyc

For a more pleasant experience with LaTeX, I recommend using the following .gitignore file:

*.aux
*.bbl
*.blg
*.brf
*.log
*.out
*.pdf

Git blame

In every collaborative project, you will eventually stumble upon writing that is unclear or code that is broken. You would like to get clarification from whoever wrote the problematic fragments but that’s not always easy if a given text fragment has been edited by multiple people over the weeks, and who has the time to go over git’s commit history one by one, to find out who made this very specific change?

The tool git blame solves this problem elegantly. Simply point it to a file under version control and for each line it reveals (i) the commit at which the file was last edited, (ii) the author who made the commit, and (iii) the time the commit was made:

git blame /path/to/file.tex

Below is an example from a research paper I co-authored:

8963593e (Philipp Winter  2016-08-17 17:35:02  8) Mixminion~\cite{Danezis2003a} eschew low latency in favor of
e8af40e4 (Tobias Pulls    2016-08-11 00:56:19  9) strong anonymity.
7f1ee0fd (Nick Feamster   2016-08-15 14:40:35 10) In contrast, Tor~\cite{dingledine2004a} trades off strong anonymity to
66bb1931 (Philipp Winter  2016-08-15 14:50:24 11) achieve low latency; Tor therefore
aecac2e6 (Nick Feamster   2016-08-09 11:19:24 12) enables latency-sensitive applications such as web browsing but is
aecac2e6 (Nick Feamster   2016-08-09 11:19:24 13) vulnerable to

The first column shows the commit ID as part of which the change was made; the second column shows the name of the person that made the change; the third column represents the time the change was made, and the last column shows the current line in the text file you are inspecting.

Best practices

Every activity has their best practices. Some of them are so obvious that nobody bothers talking about them. For example, one better warm up before intense physical exercise. Complex projects require a thoughtful plan, to provide another example. Applied to git, you may be experienced in forking repositories, creating branches, and fixing merge conflicts but struggle with – or be unaware of – slightly more advanced topics, like figuring out who broke the build, or adopting effective work flows. Below are a handful of simple best practices that minimise friction when using git and make you more effective:

  • Commit small, logically-connected changes: Try to commit your work in small, logically-connected chunks. If all you did was fix a bunch of typos in a document, commit those changes with the title “Fix typos.”While commits should be small, logically-connected chunks, one can also over overdo it. For example, if you fix ten typos in a document, don’t create ten commits, for every single typo that you fixed. That would be both unnecessary and distracting. Instead, fix all typos and create one commit – because those typos are a logically-connected chunk of work.

    But don’t fix typos and fix a bug in the same commit – split those two into separate commits. The smaller your changes, the easier it is for your collaborators to reconstruct the work you did. It is common (and tempting) to fix several issues in one go, and commit it all together. For example, while you’re in the process of fixing a bug, you may notice that there’s an unrelated unit test missing, and decide to add it. While not the end of the world, the unit tests ends up not properly documented in git’s history.
  • Only commit text files: Git supports but does not do well with binary files because they cannot be meaningfully versioned. For example, two pdf files may differ but the actual difference – presumably mostly gibberish that is not human readable – is not useful and pollutes git’s log by distracting from more meaningful changes. Besides, binary files are typically large, which slows down the cloning of your repository. Avoid adding binaries like executables, pdf files, or tarballs to your git repository.

  • Use descriptive commit messages: It irks me more than it should when I see commit messages like “fix,” “bug,” or maybe even just “f.” Those messages tell you little to nothing about what happened in a given commit. Take a look at the commit guidelines of the Pro Git book for more details. Instead, be as specific as possible in the first line. Instead of “bug,” write “Fix bug caused by Integer overflow.” Instead of “update,” write “Update dependency to latest version.” Instead of “added content,” write “Added paragraph with more related work.”

  • Split up your project into smaller files: Split your source code into several files (e.g., one for logging, one for data analysis, one for common functions) and split your LaTeX code into several files (e.g., one per chapter). This reduces the odds of running into merge conflicts, and keeps your project well-structured.

  • Tag significant commits: Some commits in your git repository will be significant: for example software releases (if you are versioning source code) or paper submissions (if you are versioning LaTeX code). Assign these commits a tag, so you can easily find them in your git history. Think of a tag like a commit with an arbitrary label. This label can be a version number (v1.2.3), milestone (conference submission), or anything else.

  • Learn how to fix merge conflicts: Yes, spend five minutes learning how to fix merge conflicts. Once you know, you no longer have to minimise the odds of merge conflicts by organising who gets to edit a file, so only one person edits at a time. This is a non-problem that’s solved by version control yet many “computer scientists” have yet to catch up because they have never truly understood git.

Eventually, disaster will strike and the aptly named site ohshitgit.com offers solutions to commonly encountered problems.

Practical examples

The above sections focused on isolated tips and tricks, and didn’t cover a comprehensive use case of how I use git in real-world projects. That’s the purpose of this section.

Research paper

All of my research papers have been under version control,Some of those papers’ LaTeX code is available online, to give you an idea of how other research teams use LaTeX. Note that I’m not always following my own advice because collaboration requires compromise.

starting with the first line I ever wrote. You don’t have to rely on git’s (admittedly clunky) command line tool, or even GitHubGitHub offers free private repositories for academics.

, to put your paper under version control – many prefer the popular and user friendly Overleaf, which provides at least some level of version control.

File naming

I recommend giving each section its dedicated file. That means that the main file, main.tex, includes several other files, like 01_introduction.tex,The numeric prefix in the file name guarantees that your files are ordered by the section number in the file browser, and not by name.

by using the \input{01_introduction.tex} directive. I like to name the bibliography and appendix – typically the last sections in a paper – 98_bibliography.bib and 99_appendix.bib. Splitting up files makes for a more structured and pleasant writing experience.

Commit tagging

In software development, git tags are used to assign version numbers to commits. In research papers, writing milestones can take the place of version numbers. These milestones can include paper submissions, uploaded technical reports, and drafts that you sent to your peers The benefit of assigning tags to these milestones is that it’s straightforward to reconstruct what a paper looked like at a specific milestone. For example, if you sent your friend Alice a draft of your research paper in the hope of getting feedback, and you do the same thing one week later, Alice may appreciate a diff between the first and second draft. If you tagged the commit at which you sent Alice a draft, it’s straightforward to create the diff.

Below is an example of several tags that I assigned to some of the commits in a research paper. In chronological order, those tags tell the story of a paper’s life cycle, starting with publishing a technical report on arXiv (arxiv-submission-1, followed by an updated called arxiv-submission-2); followed by an initial conference submission (wpes17-submission) that we withdrew; and another conference submission (fc18-submission) that got accepted (fc18-final).As the tags suggest, we first submitted to the Workshop on Privacy in the Electronic Society, and then to the Financial Cryptography 2018 conference.

* b09424f - (tag: fc18-final, origin/master) s/people/developers/ (1 year, 11 months ago) <Philipp Winter>
* 6b85bb1 - (tag: fc18-submission) discrete log stuff (2 years, 2 months ago) <laurar>
* da0eafd - (tag: wpes17-submission) Reformatting and rephrasing. (2 years, 3 months ago) <Philipp Winter>
* 15bf503 - (tag: arxiv-submission-2) Clarify that no weak keys are currently active on the Tor network. (2 years, 7 months ago) <George Kadianakis>
* fff2964 - (tag: arxiv-submission-1) Yorgos's suggested changes to conclusion (2 years, 8 months ago) <laurar>

Commit often

There is nothing wrong with a commit that changes a single character. The following example changes the formatting of a table, which I deemed worth a specific commit. Short commits make it significantly easier to understand how a file evolved, and are less likely to cause merge conflicts that are complicated to address.

commit 10083ec74381636bc825834e2ff9ce2621e17a77
Author: Philipp Winter <phw@nymity.ch>
Date:   Mon Dec 11 14:25:18 2017 -0500

    Center quantity to ease readability.

diff --git a/paper/results.tex b/paper/results.tex
index 01eb2ea..0edd0e1 100644
--- a/paper/results.tex
+++ b/paper/results.tex
@@ -217,7 +217,7 @@ private use, we are comfortable publishing them.
        attacked while the third column shows the duration of the attack.}
        \label{tab:targeted}
        \centering
-       \begin{tabular}{l r l}
+       \begin{tabular}{l c l}
        \toprule
        Onion service & Replicas & Attack duration \\
        \midrule

Software configuration

My email client, mutt, is configured via text files. These configuration files can become quite complex and change occasionally–for example when I add new contacts to my address book, join new mailing lists, or removing email accounts as part of transitioning to new jobs. While configuring mutt, I sometimes break its configuration, which may not be immediately obvious because I only occasionally use the feature I broke. But once I notice that something is amiss, git’s log is ready to be inspected, and its last commit it typically the culprit. This is possible because my entire configuration is a git repository and each change I make is accompanied by a git commit. Here are the last five commits I made to my configuration files:

* e1311c0 - (HEAD -> master) Update Sina's email address. (7 days ago) <Philipp Winter>
* 40daaee - Add alias for Matt. (3 weeks ago) <Philipp Winter>
* 7c58c19 - Add new mailbox. (3 weeks ago) <Philipp Winter>
* 7c30bb6 - Always send encrypted email to contact. (5 weeks ago) <Philipp Winter>
* 5a1a3f8 - Re-configure hotkeys. (5 weeks ago) <Philipp Winter>

Summary

  • Keep it simple. Don’t use any of git’s powerful features unless there is a real need.

  • Commit early and commit often. This makes the project’s evolution easy to follow and minimises the odds of merge conflicts.

  • Tag significant commits to make it easy to find and compare them later on.

Writing

It is very common for researchers to experience writing as the most frustrating part of their work. After all, people start careers in research because of the actual research and not the act of writing up results. I believe that part of the frustration with writing comes from poor strategy and organisation: writing is seen as a necessary evil that happens last minute; the idea of creating twelve pages worth of content triggers anxiety; and the obscure nature of LaTeX makes everything worse.

This chapter discusses ways to make the process more structured, effective, and – perhaps most importantly – less painful.

Start writing

What is the point of writing tips, you may ask, if you cannot get yourself to write in the first place? We have all been there, many times. Writing is a deeply creative process and therefore difficult to invoke on demand. There are, however, some hacks that help with getting you to write.

Write consistently

Have you ever accomplished something hard and meaningful, only to have people tell you that they wouldn’t have the motivation to do the same? That’s nonsense. Motivation doesn’t help you achieve goals because it comes and goes, and nobody is always motivated. You accomplished your goal because you’re disciplined. When motivation is nowhere to be seen, discipline is what keeps you going. Don’t conflate these two concepts.

And as you may expect, discipline and consistency are also the keys to getting writing done. Motivation alone won’t get you far. I’m writing these very words while feeling unmotivated to work on this book. It’s a sunny Saturday morning here in San Diego and I would rather be outside. However, each day of my todo list has “Write on book,” so that’s what I’m doing. If I always waited for motivation to strike, I would never get anything done. Sometimes I only edit two sentences and sometimes I spend an hour adding a new chunk of text. There are good and bad days, but you have to keep showing up and doing the work, even if it’s just a little. I don’t have much writing to show on any given day, but small and steady improvements keep adding up. You would be surprised by the amount of progress you can make by simply being consistent.

Consistent writing not only help your progress, it also helps your writing! I strongly believe that it is impossible to churn out good writing in a short amount of time. Instead, good writing ages, like an expensive bottle of wine. I wrote my best research papers over multiple months, in small increments, editing my writing frequently. Each iteration made the writing a tiny bit better. On some days, all I did was to add two or three sentences or rephrase the caption of an image. On others days, I felt more creative, and worked my way through several paragraphs, and maybe even sections. Don’t let anyone or anything fool you into thinking that you’re not a good writer. Just keep at it, one day at a time, and you will eventually have great writing to show. Trust me on that.

Capitalise on creativity

Writing is a deeply creative process but unfortunately one cannot force creativity. When it finally does show up, it is important to make the best of it. In the words of the insightful Naval Ravikant: “Inspiration is perishable – act on it immediately.” Open your text editor, brew yourself a delicious cup of coffee,The coffee is not optional. To engage in a task you don’t particularly enjoy, you need to make it more attractive. Applied to writing, this could mean getting your favourite (non-alcoholic) beverage, listening to relaxing music, or writing in the park, under the sun.

and get to work! Ignore your surroundings as much as you can until you feel your creativity wane. I argue above that you cannot rely on motivation alone but if you do feel motivated, make the best of it.

I produce 80% of my work in 20% of my time (the pareto principle all over again). That is not because I’m lazy and waste 80% of my time, but rather that 20% of my time is when I’m particularly focused and creative. That makes me able to produce work at a quality and quantity that is outside my reach on most days. Learn to identify these periods of peak creativity and don’t let them go to waste!

Living documents

I have met many researchers whose idea of writing is to wait until three days before a submission deadline and then engage in a manic, caffeine-fuelled sprint, churning out twelve pages. Needless to say, the output is always underwhelming. In addition, the prospect of having to write a full paper in a short amount of time is daunting and agonising, which leads to even more procrastination – a negative feedback loop.

I encourage you to instead space out your writing process over time, and let your papers evolve. Try to write a little, whenever you feel like you have something to say. Capitalise on your creativity. I treat my research papers as living documents. One of the first things I do when starting a new research project is to create a paper.tex file. I use it to jot down bullet points about research ideas, how these ideas connect, and can ultimately be presented. These bullet points eventually turn into sentences, paragraphs, and then a finished paper. Once I start writing code or collecting measurement results, I try to distill key results and start thinking about a narrative.

I find it helpful to approach a research paper by focusing on its key messages. What are your 1–3 core insights? What evidence do you have that supports these insights? The rest of the paper is then spun around these key messages. Not only will your writing quality improve drastically, you will also look at the writing process in a more favorable way, which leads to more writing – a positive feedback loop.

Modular decomposition of a research paper

A finished research paper can evoke emotions of satisfaction or anxiety – depending on where you are in the writing process. Creating a finished product may seem overwhelming, but remind yourself that you won’t be tackling it all at once; similar to how you don’t climb a tall mountain all in one go, you’ll take breaks, rest, and make steady, small, consistent process. It’s much easier to make steady progress by breaking a research paper into separate modules, making them easier to tackle. We use the concept of “modular decomposition” in programming to break down programs into separate modules that interact with each other, and you can do the same with research papers. A research paper consists of several sections: typically an introduction, related work, presentation of your method, experimental results, and so on. Each of these sections consists of several subsections. For example, your experimental results may consists of a subsection on experimental setup, data pruning, and visualisation. Drafting a subsection on data pruning is significantly less daunting then writing the entire section on experimental results.

Once you’ve broken your research paper into pieces, tackle these pieces in isolation. Ask your advisor or collaborators to help you decompose your paper into pieces.

All of this applies to any creative work, not just writing. I put it in the writing section because writing happens to be the activity that most people struggle with.

Write better

Many books have been written on writing, in general, and academic writing, in particular.

  • Delete unnecessary words. Many people often make their academic writing very fluffy, which makes it difficult and tedious for a reader to read. Notice how awful that sentence was? Let’s delete unnecessary words: Many people often make their academic writing very fluffy, which makes it difficult and tedious for a reader to read.

    Go over your writing sentence by sentence and delete every word that isn’t necessary. Removing the fluff from your writing makes it substantially easier to read. Most college students (including yours truly) eventually pick up the annoying habit of adding unnecessary words to sound sophisticated and pad their writing assignments. It is time to unlearn this habit.

  • Use active instead of passive voice. Instead of “the data was analysed,” write “we analysed the data.” Instead of “it has been shown by Turing,” write “Turing showed that.” Excessive use of the passive voice is yet another habit that’s particularly widespread in academia. Don’t imitate the lifeless and boring writing of your peers.

  • Engage your audience. Never start your paper with something entirely obvious and uncreative like “The Internet has become the world’s biggest communication network.” You are more creative than that! Nobody expects pearls of wisdom at the very beginning of a paper, but actively try to make your paper as engaging to your reader as possible. Take a look at the introduction of a seminal paper in the field of cryptography (Diffie and Hellman 1976). Its first sentence is a classic.

    Note

Finally, keep in mind that good writing is often subjective, and not everyone will agree with my advice. I once got this feedback on one of my paper submissions:

First, authors abuse of questions in the paper. Writing a scientific paper is not writing a “thriller”. No suspense is required (this makes the reading pretty much annoying). We only have to write about facts.

Reviewer A

Reviewer A is the kind of person who starts their paper by pointing out how important the Internet has become. Don’t be like reviewer A. Bring some colour to your writing.

Ask for feedback

Good writing rarely happens in isolation. Even professional novelists have editors and actively seek feedback. In academia, advisors often assume the role of an editor (if you’re lucky), but friends and colleagues can also provide feedback. As a Ph.D. student, my friends and I would often share drafts with each other even though we worked in slightly different subfields. I would review my friend’s applied cryptography paper and, while I didn’t understand every single aspect of it, I learned a lot and provided an important perspective: the one of a potential reviewer.

Feedback from someone who’s not intimately familiar with your research is so important because it’s not affected by the “curse of knowledge” – a cognitive bias that emerges when professionals have difficulty putting themselves in the shoes of a newcomer.

When asking your peers for feedback, keep in mind that someone can read your work for the first time only once. The first reading is particularly important because the reader approaches your writing with a fresh mind and an unbiased frame of reference. Don’t waste these opportunities. Instead of sending your very first draft to everyone you know, send it to one, or maybe two, people. Pay close attention to their feedback, address it, and then send the revised draft to the next person. After incorporating this “chained” feedback, your paper will end up much stronger than it would have, had you gotten feedback on a single revision of your paper.

Some of your colleagues may be inexperienced and not know exactly what they should pay attention to when reading your work. In that case, help them by letting them know what you want feedback on. Maybe that’s the narrative in the introduction, or the intelligibility of your method section, or the informative value of your diagrams in the results.

Write effectively

Writing effectively means writing in LaTeX. While LaTeX tries hard to produce good-looking output, most computer science papers are still poorly typeset. (This should come as no surprise, because hardly anyone is trained in typesetting and in computer science, unlike other fields, we are responsible for typesetting our own papers.) But again, by embracing a handful of rules, you can greatly improve the visual clarity of your writing.

Use LaTeX comments

LaTeX interprets lines that begin with a % character as comments. I recommend using comments in the following scenarios:

  • Your paper’s results may contain several numbers, e.g., the number of measurements that you collected, the false positive rate of your classifier, or the execution time of your algorithm. In my experience, these numbers frequently change as you revise your paper because you collect more measurements, improve your classifier, or make your algorithm faster. I like to add source code comments that list the commands that I need to run to update these numbers.

  • I like to use “todo” comments like

    % TODO: Compare our approach to Turing et al.

    as reminders of what needs to be done in parts of the paper. In some collaborative projects, we assign people-specific todo items like

    % TODO (Philipp): Update numbers in the table below.
  • Each word, paragraph, and section must have a purpose. If it has no purpose, delete it. I like to use comments to remind myself of the purpose of each paragraph. It helps me make sure that there’s a logical chain of argument that advances throughout the paper. For example, the introductory paragraphs of my paper may look like this:

    \section{Introduction}
    % State the problem.
    ...
    % How have others tried to solve the problem?
    ...
    % How do we intend to solve the problem?
    ...
    % How does our solution compare to others?

Balancing references

The package balance balances your list of references, making them visually more pleasing. Include the package by adding \usepackage{balance} to your list of packages and use it by adding \balance before you include your references. Compare the end of a bibliography without balance:

And here is the same bibliography after using the balance package:

Better tables

Which table looks better? The one on the left or the one on the right? We both know it’s the one on the right! Notice how horizontal and vertical lines are almost absent, removing visual clutter. Both the top and the bottom of the table has more spacing, making the table less cramped, and the quantities in the two rightmost columns are right-aligned instead of centered, rendering the numbers easier to compare.

The booktabs package helps you build more pleasing tables. Definitely read its documentation because it does a great job of explaining how to properly typeset tables. For a quick reference, here’s the LaTeX code for the above table on the left.

\begin{table}[]
\begin{tabular}{l|rr}
\toprule
 Age & \# & \% \\
\midrule
 18–25 & 186 & 35.9\\
 26–35 & 180 & 34.8 \\
 36–45 & 87 & 16.8 \\
 46–55 & 43 & 8.3 \\
 56–65 & 16 & 3.1 \\
 >65 & 3 & 0.6 \\
 n/a & 2 & 0.4 \\
\bottomrule
\end{tabular}
\end{table}

Backreferences

It is possible to add backreferences to your references. At the cost of using up a little extra space, backreferences list the pages that cite a given reference. Backreferences are not absolutely necessary, but they are a nice feature because they make it easier to find a given reference in the text. People often look for their own work in a paper’s reference section because they are curious in what context their work is cited. Backreferences make this easier.

An example of backreferences for three papers. Note the “Cited on” right after the URLs. The hyperref package makes these page numbers clickable, making it easy to jump directly to the respective references. The acronym “p.” is short for “page” and “pp.” is short for “pages.”

Note

There are several ways to implement backreferences. One option is to use the popular hyperref package’s pagebackref option as follows:

% For clickable links.
\usepackage[pagebackref=true]{hyperref}

% Add custom text right before backreferences in literature.
\renewcommand*{\backref}[1]{}
\renewcommand*{\backrefalt}[4]{
   \ifcase #1
      Not cited.
   \or
      (Cited on p.~#2)
   \else
      (Cited on pp.~#2)
   \fi}

Self-contained diagrams

The TikZ package lets you typeset diagrams directly in LaTeX. TikZ diagrams are nice because they are very small, they use LaTeX fonts (and therefore look less clunky), and they can be created with just a text editor. On my personal website I maintain several TikZ examples that can help you get started. Granted, it does take time to learn TikZ, but I recommend it if you value high quality typesetting.

Note how this TikZ digram integrates well by sharing a font with the surrounding text. It is typeset directly in LaTeX, making it lightweight and visually appealing.

Note

Ggplot2 is a plotting library for the R programming language. It has a feature that exports plots as TikZ code rather than, say, pdf. You can then embed the TikZ code in your research paper. You can use this feature by adding the following line to your R script:

library(tikzDevice)

Common LaTeX mistakes

When reviewing research papers, I see the same kinds of LaTeX mistakes over and over again. Fortunately, they are easy to fix.

  • Use decimal separators to make large numbers easier to read.
    Bad: 1000000
    Good: 1,000,000 (or 1.000.000, depending on your language)

  • Use a ~ to prevent dangling references.
    Bad: Newton et al. [1]
    Good: Newton et al.~[1]

  • Citations are not nouns.
    Bad: as discussed in~[1]
    Good: as discussed by Newton~[1]

  • Use proper LaTeX quotation marks.
    Bad: "Foo"
    Good: ``Foo'' (note that quotation marks are language-dependent.)

  • Reference more specific parts of a paper if possible.
    Bad: See Newton et al.~\cite{Newton}
    Good: See Newton et al.~\cite[\S~5]{Newton}

Don’t miss the comprehensive typesetting guides of Eddie Kohler, Markus Kuhn, and D. J. Bernstein’s to learn more about effective and beautiful typesetting in LaTeX.

Also take a look at How to write in plain English.

Building LaTeX papers

LaTeX papers with references are cumbersome to compile, remembered by most people as “a few runs of pdflatex plus a few runs of bibtex.” Wouldn’t it be much simpler if all you need to do was to type make? Here’s how: I recommend using a compilation tool such as rubber plus a Makefile. Below is an example of a Makefile that I typically use for all of my research papers. The Makefile assumes that your root document is called paper.tex.

PAPER=paper.tex
FIGURES=$(wildcard figures/*.pdf)
DOCUMENTS=$(wildcard *.tex)

all: pdf

pdf: $(DOCUMENTS) $(FIGURES)
    GS_OPTIONS=-dPDFSETTINGS=/prepress rubber -f --pdf -Wrefs -Wmisc $(PAPER)

clean:
    rubber --clean $(PAPER)

The environment variable GS_OPTIONS ensures that all fonts that the paper uses are embedded, so the pdf looks the same on each machine, no matter what fonts are installed. This is a requirement of many conferences and generally a best practice.

When using this Makefile, the indented lines containing the two rubber commands must be prefixed by a tab character and not by spaces. Take a look at chapter TBA to learn more about creating Makefiles.

Makefiles are powerful and great for tasks that involve repeated processing of files. I use a Makefile to compile this book from the markdown format to HTML, epub, and pdf, and also to automatically publish new drafts. The Makefile’s target is index.html–the HTML file I want to create. The prerequisites are book.md, pandoc.css, and references.bib – the source files that are necessary to produce the HTML file. Finally, the recipe is an invocation of the tool pandoc, which converts my markdown file to an HTML file.

all_input = book.md pandoc.css references.bib metadata.xml

html_output = index.html
epub_output = ebook.epub
all_output = $(html_output) $(epub_output)

publish_files = index.html pandoc.css img
publish_dir = ~/web/nymity.ch/book

pandoc_flags = --toc --standalone --css=pandoc.css --bibliography=references.bib --filter pandoc-citeproc

all: $(all_output)

$(html_output): $(all_input)
    pandoc $(pandoc_flags) book.md -o $(html_output)

$(epub_output): $(all_input)
    pandoc $(pandoc_flags) --epub-metadata=metadata.xml book.md -o $(epub_output)

.PHONY: clean
clean:
    -rm -f $(all_output)

.PHONY: publish
publish: $(html_output)
     @cp -r $(publish_files) $(publish_dir)
     ~/web/nymity.ch/deploy_website.sh

Whenever I added more content, I type make, which compiles the source files into an HTML file, which I have open in my browser. If I type make and nothing has changed since the last build, I see:

$ make
make: 'index.html' is up to date.

A Makefile can also contain rules that are not about compiling input into output files. To share drafts of my book, I upload it to my personal web server. This involves copying the relevant files into a directory that contains my websites, and then invoking a script that syncs web content from my laptop to my web server. All of this happens simply by running make publish. If the book’s output formats currently don’t exist, make will first compile them (hence the prerequisite on $(book_output)). Then, an invocation to cp copies the book’s HTML files to another directory on my laptop and, finally, I invoke the script that uses rsync to sync all files to my web server.

If you are not a fan of command line tools, you can still benefit from LaTeX by using one of its online development systems. Overleaf has been popular among some of my collaborators.

A LaTeX template

Below is a LaTeX template that I use for research papers. When submitting a paper to a conference, you typically have to use the conference style. You can add that to the template, but you may also have to change or remove parts of the template, depending on how restrictive the conference style is.

Note that \input{introduction} is replaced with the contents of introduction.tex. I find it convenient to outsource sections to separate files because it makes the paper easier to manage. It also helps with version control if multiple people are working on the paper.

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[scaled=0.8]{beramono}
\usepackage[T1]{fontenc}

% For pretty tables.
\usepackage{booktabs}
% Also for pretty tables.
\usepackage{multirow}
% For using colours.
\usepackage{xcolor}
% For clickable links and back-references in the references.
\usepackage[pagebackref=true]{hyperref}
% For smart spacing in custom commands.
\usepackage{xspace}
\usepackage{amsmath}
% For embedded figures.
\usepackage{tikz}
\urlstyle{tt}

% Bibliography.
\usepackage[backend=biber,backref=true]{biblatex}
\bibliography{literature}
\renewcommand*{\bibfont}{\footnotesize}

% Add custom text right before backreferences in literature.
\renewcommand*{\backref}[1]{}
\renewcommand*{\backrefalt}[4]{
   \ifcase #1
      No cited.
   \or
      (Cited on p.~#2)
   \else
      (Cited on pp.~#2)
   \fi}t

\definecolor{darkblue}{rgb}{0,0,0.4}
\definecolor{lightgray}{rgb}{0.93,0.93,0.93}

\title{Your paper's title goes here}

\newcommand\author{Alice and Bob}

\hypersetup{
    colorlinks=true,
    urlcolor=darkblue,
    linkcolor=darkblue,
    citecolor=darkblue,
    pdftitle={\title},
    pdfauthor={\author},
    pdfkeywords={foo, bar},
}

\begin{document}

\input{introduction}

...

\printbibliography

\end{document}

Pre-submission paper checks

Conferences and journals almost always have specific requirements that paper submissions need to satisfy. It’s frustrating to have your paper rejected for unnecessary reasons like formatting violations, so it’s a good idea to spend five minutes checking the conference’s requirements before pressing the “submit” button.

  • Make sure that your paper is within the page limit. The page limit sometimes includes and sometimes excludes references or appendices, so read carefully.

  • LaTex shows broken references as question marks. Do a Ctrl + F for the string [?] to find broken references.

  • Make sure that all fonts were properly embedded in your pdf. On Linux, I use the tool pdffonts which is part of the Debian package poppler-utils. I run it as pdffonts file.pdf and it displays a column called “emb,” which shows whether a given font is embedded. While using pdffonts to write this paragraph, I realised to my dismay that one of my old papers did not embed all of its fonts:

    $ pdffonts Winter2012a.pdf
    name                                 type              encoding         emb sub uni object ID
     --  --  --  --  --  --  --  --  --  --  --  --   --  --  --  --  -- --  --  --  --  --  -- -  --   --   --   --  --  -- 
    GJYVBN+NimbusRomNo9L-Medi            Type 1            Custom           yes yes no     100  0
    NLMFQI+NimbusRomNo9L-Regu            Type 1            Custom           yes yes no     101  0
    XNJNRQ+NimbusRomNo9L-ReguItal        Type 1            Custom           yes yes no     102  0
    ZZEWFV+CMSY10                        Type 1            Builtin          yes yes no     103  0
    UIPGCJ+CMTT8                         Type 1            Builtin          yes yes no     127  0
    Helvetica                            Type 1            Custom           no  no  no     174  0
    Helvetica                            Type 1            Custom           no  no  no     180  0
    HNYWOO+StandardSymL-Slant_167        Type 1            Builtin          yes yes no     203  0
    JHYTSG+CMR10                         Type 1            Builtin          yes yes no     204  0
    CUJHND+CMMI10                        Type 1            Builtin          yes yes no     205  0
    ZapfDingbats                         Type 1            ZapfDingbats     no  no  no     211  0
    Helvetica                            Type 1            Custom           no  no  no     212  0
    Helvetica                            Type 1            Custom           no  no  no     218  0
    XEQPPW+CMTT10                        Type 1            Builtin          yes yes no     242  0

Use git

LaTeX files are all text files, which makes them prime candidates for version control. I recommend putting all of your LaTeX source files into a git repository.It doesn’t matter if you prefer subversion, CVS, or mercurial over git. What matters is that you use some sort of version control. I like git because it has emerged as the most popular system, and with that comes great documentation and tooling. Also, most people you collaborate with will have at least some understanding of it.

Having your paper under version control has several advantages:

  • No writing is ever lost. Whatever you remove during editing is part of git’s history and can always be recovered.

  • You can easily determine the differences between two versions of your paper, making it easy to produce a pdf that highlights them.

  • You can tell who changed what.

Use tags for milestones

A specific git commit can be assigned a “tag,” which is an arbitrary label. Git tags are often used for version numbers – when you publish a new version of your software, you assign the latest commit a tag like “0.2.4.” But you can use tags for other purposes. I like to tag important milestones of my writing, for example when I submit a paper to a conference, or to the arXiv, or when I publish the final camera-ready version. You can even assign a tag to remember when you sent your paper to your advisor for feedback.

* 5de077a - (tag: ndss17-camera-ready) added cs to my email (3 years, 7 months ago) <laurar>
...
* 2cd29b1 - (tag: arXiv-resubmission-1) fixed last paragraph of internet scale section based on corrected plots (3 years, 9 months ago) <laurar>
...
* fabf1e3 - (tag: arXiv-submission) Turn passive into active voice. (3 years, 10 months ago) <Philipp Winter>
...
* 2187ef7 - (tag: NDSS-submission) Minor style harmonization and spelling fixes. (3 years, 11 months ago) <Philipp Winter>

Learn who changed what

With multiple people working on the same project, you will occasionally notice mistakes in the writing that may require discussion. Instead of asking all of your collaborators who’s responsible for a given piece of writing, you can find out yourself, by using git’s “blame” functionality. When you run git blame FILE, the output is the text file where you can see when each line was last changed, by whom, and as part of which commit.

Help git do its job

Remember to make one change per commit. Here are a few examples in the context of research papers:

  • Fixing one or more typos. If somebody is proof-reading an entire paper, it’s fine to have a single commit that fixes many (or all) typos in the paper.

  • Add a reference. Many claims need to be supported by references. Such a commit may add a new reference to the BibTeX file and then reference it in the corresponding LaTeX file.

  • Rephrase a paragraph or section. You may not like the way a paragraph (or entire section) is phrased. The action of rephrasing this paragraph or section should go in one commit. If you want to rephrase several pages worth of writing, consider using multiple commits.

  • Add more writing. Adding a coherent argument, paragraph, or section should go into a single commit. Adding two two independent paragraphs two separate sections should go into two commits.

  • Delete text to meet a page limit. Papers must sometimes be trimmed to meet a page limit. Unless it severely cripples the paper, it’s fine to do this in a single commit.

Note that making small changes is not always possible or reasonable. As you are rewriting a paragraph, you may realise that the rewrite only makes sense if you also rewrite the paragraphs before and after. This is fine. The above recommendations are just that: recommendations.

I personally find it helpful if paragraphs of text are broken into several lines spanning a maximum of 80 characters, instead of a single line of text. This makes it easier to inspect commit messages and understand what change was made. Consider the following example:

@@ -1 +1 @@
-This is a paragraph that consists of a single, continuous line of text.  Such long lines can make it cumbersome to determine what has changed in a lengthy diff.  Instead, consider breaking a single long line into multiple lines that end at, say, 80 characters.
+This is a paragraph that consists of a single, continuous line of text.  Such long lines can make it cumbersome to determine what has changed in a lengthy diff!  Instead, consider breaking a single long line into multiple lines that end at, say, 80 characters.

Only a single character changed in this paragraph, which is formatted as one line. It’s difficult to see what changed because the line is so long.

@@ -1,4 +1,4 @@
 This is a paragraph that consists of a single, continuous line of text.  Such
 long lines can make it cumbersome to determine what has changed in a lengthy
-diff.  Instead, consider breaking a single long line into multiple lines that
+diff!  Instead, consider breaking a single long line into multiple lines that
 end at, say, 80 characters.

Here, the same paragraph (and the same change) is formatted as separate lines. It’s easier to see what character was changed in this commit.

Programming

Analogous to freeware and shareware, there exists the term conferenceware – a mildly derogatory term that refers to the type of software that’s typically published as part of a research paper. Conferenceware is abandoned, outdated, poorly documented, and written in a haste. It’s often frustrating to use someone else’s conferencewareThe pride and accomplishment that academics experience when somebody reads their paper quickly turns into shame and defensiveness when somebody studies their code.

and worst of all: badly written code jeopardizes the correctness of the science. A simple bug can result in incorrect data and misleading conclusions.

This chapter begins by introducing general guidelines for programming that are helpful regardless of what kind of programming you do, followed by advice specifically for data analysis and systems building. Roughly speaking, academic programming falls into one of those two categories. Data analysis projects start with the collection of data sets (which may already involve some programming), followed by the analysis of the data set, which typically involves code to parse, clean, and process the data set. Systems building projects invent new systems or improving existing ones. Examples are the creation of a new routing algorithm, the adding of new security technology to the Linux kernel, or the invention of a distributed system for file sharing.

In data analysis projects, one’s focus is to measure a phenomenon by collecting and analyzing data. This requires a slightly different skill set than building systems, where the focus is to write complex prototypes and potentially integrate them in existing, even-more-complex software.

Best practices

This section distills a number of best practices that I consider essential in academic programming. The advice comes from having made all of those mistakes myself, and from reading other people’s code that was better than mine.

Avoid functions that do too many things at once

Imagine you are working on a parser that takes as input a file and returns structured data. There are several ways to write the code that accomplishes all of this but one may be tempted to squeeze all of that functionality into a single function, as illustrated below:

def analyse_file(file_name):
    total = 0
    with open(file_name) as fd:
        for line in fd:
            line = line.strip()
            if line == "":
                continue
            num = int(line.strip())
            total += num
    print(total)

analyse_file("filename")

The problem with the above code is that one function does everything: it opens the file, parses its content, and handles the parsing. That may be fine for a quick prototype but if your code is going to evolve over time – and data analysis code has a habit of doing so – it’s best to break up the code into several functions because it allows for faster and safer code changes. For example, the way your code ingests data may change from reading from a file to reading it from the network. This requires a comprehensive change of the analyse_file function. Instead of cramming several vaguely related tasks into one function, split them up into several functions. In our example, one can intuitively split our monolithic function into three separate functions that read, parse, and analyse our data:

def read_file(file_name):
  with open(file_name) as fd:
    return fd.readlines()

def parse_file(raw_data):
  content = []
  for line in raw_data:
    line = line.strip()
    if line == "":
      continue
    content.append(int(line))
  return content

def process_file(parsed_data):
  total = 0
  for elem in parsed_data:
    total += elem
  print(total)

process_file(parse_file(read_file("filename")))

Observe that the control flow now resembles an actual pope. We call read_file and pass its output as input to parse_file and do the same with process_file. The modular code is significantly more reusable and safer to modify. For example, if the format of your data changes, you can jump straight to modifying parse_file because that’s where the parsing happens; you don’t have to go through the error-prone process of finding the relevant parsing code in a long function.This may not be bad in our strawman example but real-world data analysis code is significantly more complex.

Similarly, if you want to change how your code ingests data – perhaps the data should come over the network instead of from a file – you can implement a new read_file method; the remaining code need not be touched.

Note that the above is standard programming advice, often referred to as functional decomposition, which is the process of breaking down a complex function into its smaller, simpler components. This is generally best practice but I find it particularly important in the setting of data analysis.

Document encountered issues

Whenever you work with code, you are bound to run into issues. It’s not a matter of if but when. You may encounter bugs, library conflicts, or code that only works on a specific architecture. Whenever you spend more than five minutes solving one of those problems, document it. It only takes a minute to add a few sentences to your personal work log. All that’s necessary is something along the lines of “tried to get library X to work but it didn’t work because of Y. I then tried Z and managed to get it to work.”

There’s a non-zero chance that you (or a colleague) will run into the same, or at least a reasonably-similar problem in the future, and it’s better to rely on documentation rather than the accuracy (or lack of) your memory.

Organise your directory structure

If your code has grown beyond a very simple prototype, avoid placing all files in the same root directory because that gets messy very quickly, and it will be difficult to find specific files. One way to organize files is by their purpose, e.g.:

  • bin/ for executable files. Your users should be able to run your tool by executing a file in this directory.
  • doc/ for documentation. This directory typically contains extensive documentation like technical specifications or automatically-generated source code documentation.
  • src/ for source code.
  • test/ for unit and integration tests.
  • README for an overview and usage instructions.

My Python tool exitmap uses this directory structure and shows how a Python project can be structured that way.

Make your code public

It is common in academic research to treat code and writing secret until publication. Code only exists on the laptop of the researchers or in private git repositories, or otherwise somebody would take it and rush to publish a paper before you; or so the folklore goes. If somebody asks for the code, the best that one can typically expect is a copy of the code, together with a plea to not share it further. The fear of getting scooped incentivices researchers to be very careful about who they talk to about their work. This concern doesn’t come out of nowhere: the intense pressure to publish can bring out the worst in people, to the point of “stealing” others’ ideas or fabricating data.

Scooping is not the only issue. It is also intimidating to publish code – or anything, really. By making your brainchild available for everyone to inspect, you are exposing yourself and taking a risk. What if people will find your code inefficient and judge you for it? The good news is twofold. First, it is very uncommon to attract negative attention for publishing free software. Second, you do get used to publishing work. As intimidating as it may see in the beginning, it does get easier, to a point where it is routine. The work output of my last two jobs at Brave Software and The Tor Project happened almost entirely in the open. We coordinate over email, IRC, and bug trackers and code is by default free. I have eventually grown used to this kind of philosophy but it used to be foreign and intimidating to me. Imposter syndrome is widespread, and few things are more intimidating than voicing an idea in public, surrounded by competent people who call bullshit when they see it.

Here’s another way to look at it: Free software is a community effort. You don’t get to complain about software that’s freely available. If you are unhappy with it, fix it. The software is free, after all. People have been unhappy with my research prototypes many times, and I am fortunate to have received numerous patches over the years. I was always flattered to learn that somebody cared enough about my software to write a patch for it. Maintaining a popular library or measurement tool can not only be fulfilling but provide you with a very real advantage: people will turn to you for help (and as a result, you may end up co-authoring papers) or at the very least cite your work in their papers. There aren’t a lot of people who publish their code proactively and doing so sets you apart from your peers.

You may be rationalising your secrecy by telling yourself that nobody would ever be interested in using or reading your code. You may think your code is too niche, too naive, too slow, or too clunky. You should give yourself more credit than that. If nobody has done before what you are doing now, you should consider publishing your code because it probably will be useful to somebody else in the future. Someone may face the same problem in the future and this person will be very glad to have stumbled upon your code. Sure, it may be missing a feature, or be a bit too slow or buggy, but some code is typically better than no code.

Unless there are good reasons not to, consider making your code public from the very beginning. The longer you wait, the more reasons you will find to not publish your code. Don’t wait and take the risk. I promise, it is worth it.

Use libraries

When faced with the need for parsing data, using a network protocol, or anything, really, I feel tempted to write the code myself, from scratch. While that was often fun and educational, I later discovered a library out there that did the exact same thing. To add insult to injury, the library was usually faster, more complete, and less buggy. I could have saved a substantial amount of time and headache by using that library. Today, before I sit down to implement anything, I spend a few minutes searching for code that already does what I need. Most code ends up on GitHub these days, which makes the site a natural first choice for looking for libraries.

You will find that libraries are no panacea and differ widely in their quality, ranging from intuitive, well-documented, and equipped with example usage to outdated, undocumented, and simply broken. Worse, it’s not always immediately clear what category a given library falls into. To quickly assess a library’s quality (or if you can choose between multiple libraries), I recommend the following heuristics:

  • Is the library still maintained, and actively developed? If so, it will be easier to get help, and the library is less likely to be outdated or broken. Take a look at the most recent git commit, or the latest issue that was filed in the bug tracker. If there has been no activity in years, the code is either of stellar quality or has been abandoned.

  • Is the API documentation comprehensive? Poor documentation will make it difficult to use the library, and is a telltale sign of poor code. Similarly, if the README file is riddled with spelling mistakes, the code is more likely to be riddled with programming mistakes.

  • Does the documentation provide usage examples? That helps with getting up to speed and is a sign of careful maintenance.

…for data analysis

Let us now look at tips specifically for measurement code, i.e., code whose purpose is to measure a phenomena or a system. An example of this is Internet measurement, e.g., projects that set out to understand the complex nuances of, say, how a distributed system recovers after failure.

Use precise timestamps

Whenever you measure something, use the most granular timestamps possible, which is typically millisecond or nanosecond resolution. Often, timestamps that are only accurate to the second are not granular enough to inform about the phenomena that you are measuring. On a related note, consider always using the UTC time zone when creating timestamps. You may end up comparing timestamps that were created by multiple systems, some of which don’t share the same time zone. That’s generally not a big problem because one can account for time zone differences but I find it more convenient to always deal with the same time zone.

Automate your processing pipeline

We all use some degree of automation but few automate their entire data pipeline, from data collection, to processing, and finally embedding results in the research paper. In my experience, I found it helpful and very convenient to run the entire pipeline by invoking a single command and to be able to run specific steps of the pipeline in isolation, e.g., to plot the data:

A conceptual overview of a data processing pipeline. The script run-all.sh invokes all other shell scripts and ensures that the entire data processing pipeline completes. Still, it’s possible to invoke single scripts in isolation, e.g., to re-create new diagrams.

Measurement code is often a loose collection of Python programs – one to collect the actual data, one to plot the results, and adding a plot to the research paper is often a manual step. I typically combine those Python program invocations in a shell script, to limit the manual work to the running of a script. Considering that I have multiple shell scripts that do similar things (e.g., operate on the same data set), it makes sense to create another shell script whose purpose is to serve as a configuration file. All it does is define variables that are used by the other shell scripts. That makes available all configurable variables in a central, easy-to-modify place, so you don’t have to remember what variable is where. Below is a small example of a bash-based configuration files, called config.sh:

#!/bin/bash

# Path to dataset; used by all analysis scripts.
data_set=/path/to/dataset.csv

Bash provides straightforward mechanisms to load the content of another shell script, e.g., by using the source keyword. Once a configuration file is sourced, the calling bash script can access its variables as if they were defined in the calling script:

#!/bin/bash

# Load variables from config file.
source /path/to/config.sh
echo "$data_set"

Simply source the configuration file from all your processing scripts and you will be able to configure them conveniently and centrally as your measurement scripts evolve over time.

Speaking of evolving over time: At some point, you may wonder how a chart changed over time. For example, were throughput numbers better before you refactored the concurrency logic of your network service? To answer questions like this, there is merit in keeping a simple archive of charts. I recommend encoding a chart’s birthday in its filename, and have a filesystem link that always points to the latest chart, e.g.:

$ ls -1
syscall-timing-latest.pdf -> 2021-12-06_11:07_syscall-timing.pdf
2021-12-06_11:07_syscall-timing.pdf
2021-12-03_09:32_syscall-timing.pdf
2021-12-01_15:10_syscall-timing.pdf

Whenever your shell script creates a new chart, it can assign the chart the proper filename, create a new link to the latest chart, which your research paper can then embed. Below is sample code that accomplishes this:

#!/bin/bash

# Create experimental data by running our data gathering script.
python3 /path/to/data/gathering.py > data.csv
if [ $? -ne 0 ]
then
    echo "Data gathering script failed with exit code ${?}." >&2
    exit $?
fi

# Create the chart's desired filename.
date_time_prefix=`date +"%F_%T"`
chart_name="${date_time_prefix}_chart.pdf"

# Create a chart.
python3 /path/to/plot.py data.csv > "$chart_name"

# Create a link to the chart.
ln -sf "$chart_name" "latest_chart.pdf"

# Finally, rebuild the paper.
cd /path/to/paper
make

The automation of your data processing pipeline saves you time, prevents frustration, and is more robust to errors because there is minimal manual work. Once you have an automation pipeline that you are happy with, you can re-use and adapt it across projects. The marginal cost of re-using your automation pipeline is negligible, making it a good time investment.

Make your processing pipeline verbose

No matter your area of research, you will likely be doing data analysis as part of a research project. Conceptually, data analysis is often done with a sieve-like processing model: the raw, unfiltered data goes into your code, which then removes broken data points, special cases, outliers, does some processing, classification, what have you, until only valid data points remain. At any point during this analysis pipeline, it is helpful to log the number of data points that your code is dealing with. This will help you realise early on, by simply glancing at your code’s output, if something went wrong. Otherwise, it is all too easy to end up in a situation where you present somebody with your data, only to be asked “what happened to the other 85% of your data?”

Below is command line output that illustrates the idea. The log shows at a glance that 75% of all original elements ends up surviving the filtering criteria.

[2019-10-16 11:02:43] Processing raw data with 54,329 elements.
[2019-10-16 11:02:44] Discarding 2,199 (4.0%) broken elements.
...
[2019-10-16 11:03:12] Writing 41,049 (75.6%) elements to disk.

Collect data as raw as possible

Assume you are working on a network measurement project for which you need a large number of UDP headers. UDP’s simple 8-byte header is easily written to a database, so that’s what you do. Once the data is collected, you realize that not every UDP header is valid. The recipient rejected some of the UDP datagrams because the surrounding IP header was corrupt. Unfortunately, you are not able to figure out what headers were affected because you don’t have the full packet capture.

For this reason, it often pays off to store your data as raw as possible – storage permitting, of course. In this case, one should have stored the full packet capture in pcap format, which contains the surrounding IP header and even the network layer. The same applies to other types of data sources. If Python code fails, log its full stack trace instead of just the last error message. You never know when you will need the extra data. There is however an important exception to this rule and that is privacy. If you are collecting data that pertains to people and their privacy, the opposite is called for: collect as little data as you need to answer your research question.

Linux tools for data analysis

When first presented with a new data set, one typically seeks to understand the data on a high level. We don’t yet care about every single outlier; rather, we are interested in broader trends, which exploratory analysis is meant to uncover. Exploration is meant to be quick and easy to adapt. What’s more, data sets are often encoded in a structured, textual representation like CSV or JSON. A small set of Linux command line tools is all that’s necessary for quick exploration. To me, the holy trinity of quick and dirty analysis is cut, grep, and sort, which I will briefly introduce below.

cut

Assume you have a CSV file – or any kind of text file that has one data record per line, and each data record consists of separate fields. Cut helps you select specific fields. Assume the following example; a CSV-formatted data set that maps a timestamp to the number of requests for that timestamp.

time,requests
2019-10-27,582938
2019-10-28,582938
2019-10-29,519301
2019-10-30,502318
2019-10-31,510329

Given this file format, how can you select only the requests, and discard the time column? Here’s how:

$ cut -d , -f 2 file.csv
requests
582938
582938
519301
502318
510329

The argument -d , tells cut to use a comma to distinguish between columns and the argument -f 2 selects the second column, which contains “requests”.

Do you want to discard the first line, which contains the file header? Pipe the output of cut into tail, to only display lines starting with line number two:

$ cut -d , -f 2 file.csv | tail +2
582938
582938
519301
502318
510329

sort and uniq

Once you combined command line tools in a way that they print numeric data to the console, you can sort that data by piping it into the sort tool. For example, to sort the numbers above, we append sort -n to the command pipeline. The -n flag tells the tool to sort the numbers numerically rather than lexicographically.

$ cut -d , -f 2 file.csv | tail +2 | sort -n
502318
510329
519301
582938
582938

We often want to know about duplicate elements in a data set. The tool uniq can help us with that. To eliminate all duplicates, we need to pipe the output into uniq:

$ cut -d , -f 2 file.csv | tail +2 | sort -n | uniq
502318
510329
519301
582938

Finally, when given the option -c, uniq can tell us how many duplicates it found:

$ cut -d , -f 2 file.csv | tail +2 | sort -n | uniq -c
      1 502318
      1 510329
      1 519301
      2 582938

The first column identifies the number of records found in the second column. The last record shows that uniq found a duplicate that showed up twice in the data.

grep

Instead of the above example, we might have a considerably larger data set, spanning many months, and we may find ourselves wanting to filter the data set for specific months, e.g., August 2019. The tool grep makes this straightforward. The below example selects only lines that begin with the string “2019-08”:

$ grep '^2019-08' file.csv

grep accepts as input regular expressions and supports a large number of configuration options, making it by far the most powerful tool in this collection. Think of grep as the Swiss Army Knife for data filtering. We will now go through an example.

An example

Let’s take a look at a practical problem that requires a combination of the tools discussed above: What are the most frequently used words in this book? To answer this question, let’s begin by splitting the lines in this book into words by using the tool tr, which replaces spaces with newlines:

cat book.md | tr ' ' '\n'
---
title:
Effective
tools
for
computer
systems
research
subtitle:
"DRAFT
...

The resulting list contains many lines that are markdown control characters and other fragments that aren’t English words. To select only words, I’m using grep with the regular expression ^[a-zA-Z]\+$ to select only lines that consist of the letters A-Z and a-z.

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$'
Effective
tools
for
computer
systems
research
Please
do
not
author
...

Now that we have a list of all words in this book, let’s now sort those words to make it easier to count them. This is as simple as piping the output of our existing tool chain into sort:

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort
a
a
a
a
a
a
a
a
a
a
...

Finally, it’s time to count each word. The tool uniq does this for us. It expects as input a sorted list of strings, one per line, and, if given the argument -c, shows the number of times it encountered each string:

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort | uniq -c
    166 a
      2 A
      2 able
     24 about
      1 above
      2 academic
      1 academics
      1 accept
      1 access
      1 accessible
...

Wouldn’t it be useful to have uniq’s output sorted? Again, this is as simple as piping our current output into sort yet again. This time, we pass the argument -n to sort, instructing it to sort numerically instead of lexicographically. We also use the argument -r to reverse the output, so the most common strings show up first:

cat book.md | tr ' ' '\n' | grep '^[a-zA-Z]\+$' | sort | uniq -c | sort -n -r
    201 to
    168 a
    159 the
    139 you
    128 of
    126 your
    102 and
     75 in
     72 I
     70 that
...

There are numerous command line tools that help with simple data processing. A simple Web search for “how to do X in bash” typically suffices to discover how a specific problem can be solved in bash.

…for systems building

Let us move on to programming tips for systems building.

Consider adding tests

Imagine you are building a TCP proxy that modifies data in flight. As you work on the code that shoves data from one socket into the other, you realize that the proxy does not close its open connections properly, so you set out to fix that issue, and in the process, you restructure other parts of the code. Throughout all those steps, you run the risk of breaking the code that’s already in place, and perhaps without even noticing.

It may pay off to write a set of unit and integration tests for your code. Modern languages like Go make that reasonably easy, and the benefit is substantial: you can make complex changes to your code, run the tests and – assuming they pass – be reasonably sure that your changes did not break existing source code. You don’t have to test every single line of each function; focus on code that’s critical to the overall architecture and write a few unit tests that ensure that the code is working as expected.

Consider a design principle

Simple proof of concept projects that consist of a couple thousand lines of code can be built without paying too much attention to architectural design but if your code is meant to grow beyond that, you may want to think about adopting an architectural design principle like The Clean Architecture (Martin 2012). Such principles make it easier to keep complexity under control as software grows.

Tackle the riskiest component first

When designing and implementing a new system, it’s often not clear if it can be done. Your project may require to combination of several existing systems and this may lead to insurmountable issues that are difficult to predict. Some components of your new system are bound to be riskier and more likely to fail than others. If possible, try to work on those first, so you can abort the project early if you are running into insurmountable obstacles. It’s much better to realize that your system cannot work only five hours in than after days or weeks of programming.

Learn to navigate large code bases

Most systems researchers will sooner or later have to understand and modify large code bases that have been written by others. That could be the Linux kernel, device drivers, or browsers. They all have in common that they cannot be understood in their entirety, which makes it difficult to make changes. There are hoever a number of strategies that can speed up the process of navigating a large third-party code base.

Before diving into the code, try to get it set up and running locally. Once the code is running, resist the temptation to immediately set out and try to find the part in the code that you think will hold your changes. It pays off to obtain a high-level understanding of the code before diving into the details. Crucially, one should obtain that understanding by going top to bottom instead of bottom to top, i.e. start with understanding the code on a very high level, which means class hierarchies, control flow, and interfaces. Instead of focusing on specific functions, focus on function call graphs; instead of focusing on class attributes, focus on class hierarchy. Well-documented software occasionally has architectural diagrams that help with understanding how the code is organised. Such diagrams are an excellent way to start. If you can obtain a semi-comprehensive understanding of the code – or at least of the subset that concerns you – it will be much easier to understand where your changes should land and you are less likely to run into costly dead-ends.

Another aspect that can slow you down substantially is a sloppy development setup. Complex code bases can take a while to compile, test, and set up, which makes it cumbersome and error-prone to test small changes to the source code. Be sure to spend some time setting up a proper development environment and consider reaching out to the project’s development community to get advice on efficient development. And as always, automate commands that you type repeatedly. For example, instead of running ./configure, make, and make install separately, you can run a single command that aborts if any of those steps fails: ./configure && make && make install.

Finally, large code bases are too complex to keep in memory. Don’t be afraid to reach for pen and paper to understand a program’s call graph or its startup sequence. I take extensive notes of complex code that I am studying, which resembles a call graph including source code files and line numbers.

Summary

  • To make your code more reliable and speed up the development process, use high-quality libraries.

  • To make your code more robust, consider adding unit tests.

  • To make your code more structured, consider following a design principle.

  • To make your work more useful to peers (and gain recognition), publish your code.

Communicating

Regardless of what research you do, a substantial part of your job will be communication, mostly with your peers, but ideally, also with the general public. We communicate constantly, by writing papers, sending emails, talking to advisors, presenting our work, and complaining on Twitter. Being an outstanding researcher goes a long way, but to truly excel, we also have to master communication.

Effective communication creates numerous opportunities by 1) exposing your research to people who would otherwise not see it, 2) saving time, 3) “selling” your work, and by 4) earning the respect of your collaborators.

In this chapter, I will encourage you to create project pages, publish pre-prints, present effectively, engage in popular science writing, and use social media to your advantage. Regarding the more “intimate” communication with your peers, this chapter also discusses socialising, managing your collaborators, proper email etiquette, picking the right communication mode, and the reasons for communicating openly.

…with the world

Project pages

Data and source code of research papers are often only available upon request.

Note

Have you ever stumbled upon a promising research paper that mentions that you can get its source code by emailing the authors, only to find that their email addresses no longer work? Or that they don’t respond to your email? Or when they do get back to you, they can’t find their source code anymore? The main output of a research project is the resulting scientific paper and once it’s published, there is little incentive for authors to do more.

Early on in my Ph.D. life, I made it a habit to create project pages for almost every research project I have ever been involved in:

  1. An network traffic obfuscation protocol
  2. a scanner for Tor exit relays
  3. a Sybil detection tool
  4. a DNS measurement study
  5. a usability study on onion services
  6. an analysis of weak RSA keys in Tor relays
  7. A study on China’s Great Firewall (and another one)
  8. a measurement study using the RIPE Atlas network

The workload in research can be overwhelming and adding another part to a project may sound daunting. But creating a project page doesn’t take much time – maybe one afternoon, if you take your time. Once you have a template, you can re-use it for your next project, minimising the marginal cost of each new project page.

I recommend that project pages have at least the following sections:

  • Project summary: Start with a paragraph that summarises your project. Similar to an abstract, it should convey (i) what problem your project solves, (ii) how it solves the problem, and (iii) what the results are. Try to write the project summary for a broad audience; write it the way you would explain your research to someone in another department, or to someone in the grocery store. In other words: use simple language and avoid jargon.

  • Datasets: If your research uses a dataset, then your project page should link to the data. You may not want to host datasets yourself, especially large ones. Consider using the Internet Archive to archive your dataset; link to your Internet Archive page from your project page.

  • Code: Your code matters because it allows others to reproduce your work. We therefore have an obligation to publish our code. Code is never perfect, so don’t ever be embarrassed about your code’s quality. No reasonable person will judge you by the quality of your code. As with datasets, there is no need to host code yourself: feel free to link to a GitHub or GitLab repository.

  • Papers: Papers are the main outcome of a research project, so we should all make our research papers and other write-ups available on our project pages. Be sure to make your paper openly accessible instead of linking to a paywalled portal. Research papers behind a paywall are an injustice and prevent less wealthy scientists from engaging in the scientific discourse. If you are worried about legal consequences of publishing a paper outside a paywall imposed by the publisher of record: don’t be. I have yet to hear of a single case of a scientist getting into trouble for making their own work available.

  • Contact information: Consider providing contact information to make it easy for fellow researchers to reach out to you. Try to use email addresses that will still work five years from now – even if this means using your personal address instead of your university email address.

I recommend keeping your project pages under your control, so you can edit them whenever you need to. It’s difficult to update the page if it’s hosted at university.edu/project/ and you are no longer employed by your former university. At some point I decided to host all my project pages on my own web server, nymity.ch, which gives me full control. But this control comes at a price: responsibility. If you host your own web server, it is now your responsibility to keep it alive, and to refresh your domain names and HTTPS certificates. If you want the same control with less responsibility, I recommend hosting your pages on a service like GitHub Pages.

It is increasingly common to buy fancy domains for project pages, often ending in the desirable “.io” top level domain. There is nothing wrong with that, but if you let these domains expire, your project page will disappear. Are you still going to pay that yearly $15 fee for myproject.io ten years from now? If not, then don’t go that route.

To get you started with project pages, feel free to use the following template that gives you a simple, fast, and decent-looking project page in little time.

<!doctype html>

<html lang="en">
<head>
  <title>TODO: Page title</title>
  <meta charset="utf-8">
  <meta name="description" content="TODO: Web page description">
  <meta name="author" content="TODO: Your name">
  <style>
  .toc {
    justify-content: space-between;
    display: flex;
  }
  body {
    width: 60%;
    font-family: sans-serif;
  }
  </style>
</head>

<body>
  <h1>Your project's title</h1>
  <div class="toc">
    <a href="#overview">Overview</a>
    <a href="#writing">Writing</a>
    <a href="#code">Code</a>
    <a href="#data">Data</a>
    <a href="#contact">Contact</a>
  </div>

  <hr/>

  <h2><a id="overview">Overview</a></h2>
  <p>This is the project overview</p>

  <h2><a id="writing">Writing</a></h2>
  <p>An overview of what writing you published.</p>

  <h2><a id="code">Code</a></h2>
  <p>Links to your source code</p>

  <h2><a id="data">Data</a></h2>
  <p>Links to datasets</p>

  <h2><a id="contact">Contact</a></h2>
  <p>Contact information</p>

  <hr/>

  <p><i>Last update: YYYY-MM-DD</i></p>
</body>
</html>

You can think of project pages as documentation of a finished piece of work, but I prefer to think of them as living documents that evolve as a research project progresses. The earlier you can share information about your work, the better. Research papers are often preceded by workshop papers, posters, abstracts, or presentations. All of these are worth making available early on, on a project page. In fact, a project page can serve as documentation for yourself, to keep track of your project’s output. I am not suggesting creating project pages for only altruistic reasons; you get something out of it too:

  • You learn about your audience by looking at your web server logs. I used to regularly check the visitor log of my project pages. It was interesting to see which universities and departments people looking at my work came from. In fact, it was gratifying to realise that anyone at all was interested in reading my work.

  • You expose your research to a broader audience. Research papers follow a style of writing and presentation that can be alienating to a general audience. Project pages mitigate this problem. Somebody who would not read your paper may read your project page – and perhaps then decide to take a look at the paper too.

  • It signals to potential employers that you go the extra mile and care about the presentation of your work, even when you don’t have to.

Publish preprints

For fear of getting scooped, researchers typically keep projects confidential until publication of a peer-reviewed paper. But getting a paper through peer review can take many months, if not years, because it is common for a paper to be submitted multiple times for review. Throughout all this time, your work could have been useful to others.

In a short-lived field like computer science, this antiquated publication model causes frustrating and unnecessary delays. It does not have to be this way. While we can’t get around publishing peer-reviewed papers – it is academia’s currency, after all – we can publish a technical report before the final, peer-reviewed version of a paper is out. If you are still not convinced: Correa et al. (Correa et al. 2020) provide (not yet peer-reviewed) evidence that openly accessible papers are cited more than closed access papers.

Originally created for the publication of physics pre-prints, the arXiv became computer science’s most popular pre-print publication platform too. You “publish” your work on the arXiv by uploading your research paper’s LaTeX code (be sure to first remove all cuss words in the comments). After a moderator reviews your submission, your article will appear on the arXiv – typically after one or two days.

Conveniently, the arXiv provides a notification system that informs subscribers about new reports in their area of interest. This means that a non-trivial number of people who subscribe to the field “computer networks” will get a notification after the publication of your new report in computer networks.

A frequent concern about the arXiv is that many conferences don’t allow paper submissions that have previously been published in a peer-reviewed venue. Fortunately, the arXiv is not peer-reviewed, so a report on arXiv typically does not count as published. In my field of computer security, all top-tier conferences accept papers that previously appeared on arXiv. Regardless, in case of doubt, ask a conference’s program chairs to clarify their policy regarding previously published (but not yet peer-reviewed) technical reports.

“But Philipp,” you may ask, “why go to the extra trouble of uploading my report to the arXiv?” It’s all about exposure. Once your report is published, many of your peers will come across it: through Google Scholar, which crawls the Internet for research papers; via the arXiv’s in-house notification system; or through other aggregators. Early exposure can result in citations, potential collaboration, or at least people hearing of your work.

Presenting

A good conference presentation opens doors. Science journalists may approach you to write a popular science article about your work,Or, in the time-tested academic tradition of unpaid labour, they may ask you to do it for them.

people from industry may wonder how to deploy your research, and other academics may suggest projects to collaborate on. A great presentation can elevate your research from obscure insignificance to something that people talk about. Even if your research is not spectacular, a great presentation sets you apart from other presenters. Take presentations seriously.

Most conference talks I attend are a missed opportunities. The average academic talk is difficult to follow, poorly structured, and dispassionate. Entire books have been written on effective presenting and I won’t try to compete with them here. Instead, I’ll distill my advice into a few key points:

  • Rehearse your talks. Some people believe the myth that great presenters are born instead of made. This is wrong. My best talks were the result of numerous (up to a dozen) rehearsals. That’s why they were my best talks. With practice comes confidence. You will know what to say, so you’ll have in fewer “ehms,” poor transitions, and awkward pauses because you won’t have to try make sense of your own slides. Consider recording yourself to learn how to use your voice more effectively, improve your body language, and be mindful of and eliminate fillers like “ehm,” “you know,” and “like.”

  • Capture your audience’s attention. Don’t dive right into the research. Try to start with a lighthearted joke, an interesting anecdote, or anything that gets people engaged. I once presented a paper on Sybil attacks. Curiously, my name was listed twice on the conference’s list of accepted papers, so I used that fact to start my presentation with a joke that got a few laughs.

  • Focus on what matters. It is very common for presenters to ramble on about irrelevant details. Keep in mind that what your audience can take away from your presentation is very limited. Ask yourself: what are the two or three most important points that I want my audience to remember? Spin your presentation around these points.

  • Have a narrative. Every sentence you say should be directly connected to the previous sentence. If you jump from one topic to another without proper transition, you will gradually lose your audience. Even with a proper narrative it can be difficult to follow a talk. Recapitulate occasionally, e.g., by saying “now that we’ve looked at X and Y, it’s time to talk about Z.”

If you would like to learn more, take a look at Patrick Winston’s excellent lecture on “How To Speak”.

A good presentation uses slides sparingly but effectively. Here are my suggestions for optimal slide use:

  • Minimise the number of words on slides and avoid clutter. Your audience is going to read what’s on your slides and while they are reading, they cannot pay attention to you. Your slides are supporting material and are not supposed to keep your audience busy reading.

  • Use slide numbers, which will allow people to reference specific slides during the Q&A.

  • Make sure that the font (including in diagrams) is big enough that even people in the last rows can clearly read it. Most presenters get this one wrong.

  • When presenting charts, guide the audience. Explain the axes, discuss how to read the chart, and highlight important insights.

  • Optimise your slides for the 16:9 widescreen format, which is now supported by all modern projectors. To be safe, consider exporting a second slide set for the (outdated and increasingly rare) 4:3 aspect ratio.

  • Use a sans-serif font (e.g. Arial). Avoid serif fonts (e.g. Times New Roman) because they are optimised for reading large amounts of text. Needless to say, this is not critical advice but I find it useful nonetheless.

Science Twitter

Twitter has a (sometimes deserved) reputation for being a time sink fueled by conflict and outrage, but in my experience, the platform is all about who you follow. If you follow the right people, you will learn a lot. Over the years, I’ve compiled a set of Twitter accounts that share sharp insights and thoughtful commentary. I find that one can find high-quality discourse on Twitter that’s similar to dinner conversations at conferences. The best thing about Twitter is that you don’t have to pay $800 in registration fees to participate in these conversations.

In deciding who to follow, I use the heuristic of following somebody for a few days or weeks and if I don’t learn much from them, I unfollow them. Twitter is as much a marketing platform as it is a discussion platform and some people push the marketing aspect a little bit too far for my taste.

While you’re at it, use the opportunity to follow people outside your field. And by that, I don’t mean someone in programming language design if you’re in computer vision. Follow people in psychology, economics, or biology. You can learn a lot by observing what problems scientists in other fields struggle with, and how cultures and methods differ.

Twitter can be a great option for staying in the loop on various topics:

  • Conferences and workshopsFor the non-computer scientist: original research in computer science is typically submitted to conferences instead of journals.

    carry have reputations. When I was new to research, I did not know that. Eventually, I got a feeling for this (sometimes informal) “ranking” and which conferences carry the most prestige. Listening to researchers talk about conferences will help you get a sense of where you should submit your work.

  • Professors occasionally talk about faculty hiring processes, what they look for in Ph.D. or grant applications, and their opinions on the peer review process. While this knowledge does not necessarily generalise to all of academia, it is still helpful.

  • Some researchers openly talk about their paper rejections, which serves as an important reminder that the people you admire deal with rejection just as much (if not more) than you do. This helps calibrate your perspective.

  • By engaging in discussions, you will eventually build a following, allowing you to promote your own work more effectively.

By no means do you need to use Twitter to be successful in your field, but controlled use can be an advantage. However, avoid Twitter fights – they make you look like a combative fool to bystanders. Also, make an effort to post interesting and insightful content; don’t just advertise your latest paper.

Teach the public

Communicating about your work does not have to end with your peers. We have a responsibility to make our work accessible to the broader public. To that end, I have published two articles in The ConversationI was not paid to write these articles and have no financial interest in this site. I merely mention The Conversation because I have some experience with it.

. I was originally contacted by an editor who encouraged me to explain my research by publishing an article there. The site does not pay its authors, but I still experienced it as an interesting endeavour because I had never worked with an editor before. For both articles, I created a first draft and my editor provided plenty of suggestions and advice. After three or four more iterations, the article was at a point where it was ready to be published.

There are many other outlets that encourage scientists to explain their research to the public. Your research group or department’s blog platform is another great opportunity to practice these skills.

…with collaborators

Socialise

Academic conferences are where one forms new connections and collaborations. Conferences can be intimidating and uncomfortable, particularly for sufferers of impostor syndrome. You find yourself surrounded by accomplished and smart people, and believe that your research pales in comparison to theirs. I know the feeling.

If you find that approaching a new person at a conference is scary, I recommend finding someone who can introduce you. That might be your advisor or a common friend. Also, keep in mind that there is nothing wrong with approaching someone during a coffee break. Introduce yourself and start with a question or compliment about the person’s research. Most people are flattered to hear that someone enjoyed (or even just read) their work. If you feel very nervous or anxious, you may spend too much time in your head, focused on your stressful feelings. Try instead to make a conscious effort to focus on the other person. Pay close attention to what they are saying and ask follow-up questions.

While most of the networking at a conference happens in the “hallway track” during breaks, there are more networking opportunities in the evening. People often head out for dinner and drinks, which creates a less formal environment that makes it easier to strike up conversations and meet new collaborators. Consider tagging along with a group so you don’t miss out on this opportunity.

Manage collaborators

Eventually, you are going to lead a research project. This involves coordinating collaborators, organising meetings, and keeping everyone in the loop on the project’s progress. Things inevitably get messy when people with different personalities, cultures, and communication styles work together. The following tips help make the process smoother:

  • Your advisor is a collaborator too and needs “management.” Advisors differ significantly in their style and range from entirely hands-off to micromanagers. To learn more, take a look at Nick Feamster’s excellent blog post on the matter.

  • If you want a collaborator to work on something, ask specific questions and provide clear instructions. Don’t expect them to realise how busy you are and offer help – they are likely too busy to notice.

  • Keep people up-to-date on the project’s progress. Some people like to use email for this; others schedule regular calls to discuss progress. Consider using email for short updates and have an occasional call when there’s more to discuss.

  • Don’t be afraid to express frustration, but do so respectfully and with the intent to improve the collaboration rather than assign blame. For example, if half of your team always misses meetings, strike up a conversation on how to collaborate in ways that work for everyone.

  • Even more important than expressing frustration is the expression of gratitude. Let your collaborators know when they did a good job! We all love to feel appreciated.

  • Conflict among collaborators is a common occurrence. If you find it difficult to resolve a conflict yourself, consider involving your advisor as mediator.

  • Whenever there is something to discuss, involve all collaborators unless you have a good reason not to. Your collaborators will feel respected for being kept in the loop. (More on that below.)

  • As if all of the above were not difficult enough, the average team consists of researchers from several cultures that have different customs regarding communication. Give people the benefit of the doubt and try to be clear and respectful in your communication.

Email etiquette

Pick descriptive email subjects that make it clear what your email is about. I occasionally prefix an email subject with “FYI:” or “Action needed:” to let the recipient know that an email can be ignored or that it requires action.

  • Bad: “Project update”
    Good: “FYI: Paper got accepted”

  • Bad: “Repository”
    Good: “Action needed: Commit missing code to repository”

  • Bad: “Need help with code”
    Good: “Please commit missing code to repository”

Try to avoid top-posting when dealing with long and complicated emails because it makes it difficult to follow an email discussion:

I have strong opinions about your email.  You are wrong about X, Y, and Z.

On Tue, Jan 05, 2021 at 07:35:54PM +0000, John Doe wrote:
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
> incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
> nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
> fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
> culpa qui officia deserunt mollit anim id est laborum.

Instead, try to quote and respond to specific parts of the original email:

On Tue, Jan 05, 2021 at 07:35:54PM +0000, John Doe wrote:
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
> incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
> nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

This I agree with.

> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
> fugiat nulla pariatur.

I believe we should do X instead, because of Y.

> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
> deserunt mollit anim id est laborum.

Well said.

Use To: and Cc: wisely. Put everyone whose attention you require into the To: field and the remaining collaborators into the Cc: field. Create an email alias to make it easy to reach all of your collaborators. For more email writing tips, take a look at Philip Guo’s excellent article.

Pick the right communication mode

Find the right balance between the least invasive and the most convenient communication method. It may be convenient for you to call your collaborator each time you need something, but they may experience this as distracting and invasive.

To discuss complicated research designs, you typically need a synchronous meeting: either a phone call or an in-person meeting. For topics that require less back-and-forth, asynchronous communications methods like email are a better fit. If you need something right now, an instant message or phone call may be the most appropriate.

Also, keep in mind that everyone’s communication preferences differ. Some people enjoy video calls while others prefer texting. Collaboration often requires compromising; try to find communication methods that work for everyone.

Regarding specific communication tools, Slack (or its free software alternative Mattermost) is useful because it allows collaborators to self-select which communications they want to participate in.

Communicate openly

Imagine a small research project consisting of three collaborators; Alice, Bob, and Eve. There are four possible communication channels – assuming nobody talks to themselves:

  1. Alice ↔︎ Bob
  2. Alice ↔︎ Eve
  3. Bob ↔︎ Eve
  4. Alice ↔︎ Bob ↔︎ Eve

Four collaborators have eleven possible communication channels, while five collaborators have a whopping twenty-six possible communication channels!The binomial coefficient (a.k.a. choose n out of k) reveals the number of communication channels among a group of collaborators.

A project with five collaborators is by no means unusual – in fact, the top four academic security conferences now average five authors per paper (Balzarotti 2020).

The good news is that you don’t have to ponder which one of the twenty-six communication channels to opt for before writing an email. Unless you have a good reason not to, err on the side of inclusion when communicating. That is, include everyone in your email CC list by default. If any one of your collaborators feels overwhelmed by the communication, they can request to be omitted from future correspondence, or they can simply ignore your emails. Typically, it should be your collaborator’s decision what to participate in – not yours. In my experience, collaborators appreciate being kept in the loop – even if they rarely respond to email threads.

As a young Ph.D. student, I mistakenly believed that I was doing my collaborators a favour by not including them unless I really needed their help. After all, isn’t everyone busy and don’t they have better things to do? This is a fallacy. Collaborators exist to help each other and they generally like to know what’s going on. Give them the opportunity! Besides, leaving people out of communication can quickly lead to a culture of distrust. Junior collaborators, especially, will wonder if there are ulterior motives for them being left out.

However, not everything needs to be discussed with all of your collaborators. Do you need your advisor’s signature on a document? Your collaborators won’t care. The same is true if one of your collaborators is unable to log into a machine that you use for experiments. When it comes to the actual research, however, you need to have a good reason to not include someone.

I know first-hand that it’s often tempting to initiate one-on-one communication. For example, you may feel insecure about an idea and want to run it by someone before you share it further. Try to avoid this. The more you communicate in the open, the better for you and the project, and your collaborators will respect you for it.

Balzarotti, Davide. 2020. “System Security Circus 2019.” January 2020. https://s3.eurecom.fr/~balzarot/notes/top4_2019/.

Chacon, Scott, and Ben Straub. 2014. Pro Git. Apress. https://git-scm.com/book/en/v2.

Clear, James. 2018. Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones. Avery.

Correa, Juan C., Henry Laverde-Rojas, Fernando Marmolejo-Ramos, and Julian Tejada. 2020. “The Sci-Hub Effect: Sci-Hub Downloads Lead to More Article Citations.” 2020. https://arxiv.org/pdf/2006.14979.pdf.

Diffie, Whitfield, and Martin E. Hellman. 1976. “New Directions in Cryptography.” Transactions on Information Theory 22 (6). https://ee.stanford.edu/~hellman/publications/24.pdf.

Keshav, Srinivasan. 2007. “How to Read a Paper.” SIGCOMM Computer Communication Review 37 (3). http://ccr.sigcomm.org/online/files/p83-keshavA.pdf.

Martin, Robert C. 2012. “The Clean Architecture.” 2012. https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html.

Newport, Cal. 2016. Deep Work: Rules for Focused Success in a Distracted World. Grand Central Publishing.

Pinker, Steven. 2015. The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century. Penguin Books.

Pollan, Michael. 2009. In Defense of Food: An Eater’s Manifesto. Penguin Books.

Walker, Matthew. 2018. Why We Sleep. Scribner.