Getting to
grips with a new codebase can be very difficult. Every software developer has
to dive into unfamiliar code on a regular basis, but to my knowledge there are
no good guides on how to approach the task. My job(s) for the past decade have
involved writing code, but more of my time has been spent reviewing code on
dozens of active projects and learning how to quickly dive into an unfamiliar
codebase has been crucial.
Many
discussions on this topic focus on how to navigate code in a particular editor.
In this post I want to focus on the general techniques rather than editor
specifics (though I’ll get to my current preferred setup at the end).
Survey the directory structure
Start at
the root directory of the project. Most normal projects will have 10 to 20
files and directories in the root. Go through these one by one and make a
1-line note of the purpose and contents of each.
Checklist –
at the end of this step you should be able to answer these questions
- What files, if any, provide documentation
- What files drive the build and deployment system (for projects and languages that don’t strictly have a build system, there’s usually a deployment system; I’ll just refer to this as the build system)
- Where is the source code and is there a subdirectory structure for source files, if so, what does the subdirectory structure represent (libraries, components, executables?)
- Where is the test code (usually either commingled with the main source or in a separate subdirectory)
- What are the external build dependencies required to build the project
- What are the build targets (usually executables, libraries, tests, documentation)
Understand and Run the Build and Tests
Even though
I’m often reviewing code that I’m never going to modify, I still like to start
by verifying that I can successfully build the project outputs and run the main
executable and tests (if those things exist). This step helps identify any
weird dependencies the project has, and means that when you’re finally ready to
edit code you don’t have to break flow to figure out the build system.
If the
project has a test suite, figure out how to run it. Test suites vary a lot
across languages and projects, and in some cases can be really finicky to get
running, but it’s time well spent.
Identify the interfaces, inputs and outputs
Every
program is just a way to transform input data into output data. If you “dive”
into the middle of a large codebase and try to figure things out from the
inside out, you will fail, or at least waste a lot more time than you should. Always
start from the outside and work your way in.
Identify
what the inputs are, and what the outputs are. Make notes describing them –
force yourself to articulate this knowledge.
Projects
that implement an “official” API ought to be easier to comprehend, and often
they are, but don’t fall into the trap of assuming that all the inputs and
outputs are captured by the API. Many APIs provide a partial account of the
I/O, and in fact you need to understand the backend database interface and the
dataflows into the DB in order to really identify all the relevant inputs and
outputs.
Make sure
that you identify all the inputs and outputs, that includes log file outputs
and configuration inputs. Many projects have logging outputs that give you a
very useful and comprehensive picture of what the program does.
Structured Examination of Code
Don’t just “browse”
the code. Write down specific questions that you want to investigate, like “How
are messages filtered and decrypted”. Keep focused on the point you are
investigating, try to avoid being distracted by interesting looking code.
Make notes
describing the answer to these questions, including a function call graph and
any important data manipulations.
When you
open a file, page down through the file, all the way to the bottom, spending
about 5 seconds skim/scanning the code per screen. I don’t have a good
explanation, but I find this really helps me to get oriented and get a feel for
the size and shape of the code. You obviously can’t absorb much of the detail
by doing this, but it answers a lot of high level questions like whether the
code is repetitive boiler plate or a bunch of simple functions or a small
number of really complicated functions.
Understand the branching structure
Thankfully
most modern projects use good distributed version control systems with sane
branching policies. You can usually figure out the branching policy quite
quickly just by looking at the history, but always check the project
documentation for specific information on this.
Spend 20 minutes reading the most recent commit messages and diffs
I time-box
this activity because for large, long-running projects you could spend an
indefinite length of time reading the changes. 20 minutes doesn’t sound like
much but it’s more than enough to get a feel for the parts of the codebase that
are under active development, which developers are working on those areas, and
whether the development is issue-driven or new-feature.
Making Notes
You have to
make notes as you go, otherwise you will flounder and waste an inordinate amount
of time. If you need to dip in and out of codebases with weeks or months in
between visits, your notes will be invaluable to you the next time through.
I start
taking notes in Workflowy. If the notes grow a lot, I switch them to a git
repository I’ve called “codenotes” just for this purpose. It has a subdirectory
for every project, with cloning instructions, so I know how to get started next
time around, along with my notes. If you’re spending a lot of time on one large
project, consider writing a readme for developers and adding it to the project’s
own wiki or source control.
My Personal Setup
I use Vim
to read code. I turned off syntax highlighting a long time ago and am convinced
that it’s far easier to quickly read and comprehend code without it. Actually I
use the nofrils color scheme that has no syntax highlighting but does make
comments a very slightly different color to the code.
I
occasionally use folds (two keystrokes will hide all the code except the
toplevel class and function declarations), but they are not crucial. I use the
NERDTree plugin to browse the directory structure, but again I don’t think it’s
crucial.
I have set
up a few keyboard shortcuts that make it quicker to load files and switch
between files.
Buffers:
nnoremap Leader b :ls :buffer
Files:
nnoremap Leader e q:iedit **/*
I’ve used
tags on and off over the years. If you work with languages for which tag
support is mature, then they’re good, but several of the languages I need to
work with are still working out tags support (javascript and others), and the
time needed to set up the finicky toolchain isn’t worth it in my view. There isn’t
enough of a difference between tags and grep in my view for me to spend time on
tags that don’t just work out of the box.
I also
occasionally use Atom, Visual Studio, VS Code, neovim, and a few other editors
and IDEs and find them all to be perfectly acceptable, I’m just more productive
in Vim.
Thank you Sir for sharing this invalueable information. Diving into large codebase is hard but these techniques are really helpful
ReplyDelete