Backtesting Basics: Getting Started with Zipline

This post describes how to use the Python package zipline to backtest an equities trading strategy. Backtesting is the practice of evaluating a trading model or heuristic against past data. Although past performance is no guarantee of future performance, practitioners generally feel more confident in strategies which would have performed well had they been executed in the past (given information available at the time). Another excellent reason to use backtesting is to catch conspicuously bad mistakes (e.g., coding or math errors) that are likely to guarantee poor performance in the future.

This post covers the basics of using zipline in the context of trading equities. A future post will discuss using zipline to backtest a prediction markets trading strategy.

Overview

These instructions assume you are using zipline version 1.3.0.

Using zipline involves two distinct steps:

  1. Loading data
  2. Running a strategy which uses the data

Loading data from CSV

zipline can load arbitrary price data from a CSV file provided the price data is formatted in a specific way. Fortunately, the required format is relatively common. The format is called OHLCV and has eight columns: date, open, high, low, close, volume, split, and dividend. Do not worry about the split and dividend columns. They will be uniformly 0.0 and 1.0, respectively. The remaining columns are self-explanatory. Here is an example line in the file AAPL.csv.gz: 2012-01-03,58.485714,58.92857,58.42857,58.747143,75555200,0.0,1.0. The symbol associated with the price data is taken from the filename. In this case, the symbol is AAPL.

To make the data available to zipline for backtesting, we load (“ingest”) the data by putting the file into a daily subdirectory inside a directory and running the following command:

CSVDIR=/path/to/data python3 -m zipline ingest -b csvdir

where /path/to/data is replaced with the appropriate directory. If /path/to/data is our directory, then AAPL.csv.gz is located at /path/to/data/daily/AAPL.csv.gz. (AAPL.csv.gz can be found in the zipline repository.) Note the required daily subdirectory. If you have done everything correctly then your command should be greeted with a message which begins with Loading custom pricing data: ....

Testing a Trading Strategy

Now that we have loaded historical price data we can test trading strategies.

Let’s test the following “strategy”: buy 10 shares of stock on Mondays, sell 10 shares on Wednesday. (Who knows, maybe we’ve discovered that people are pessimistic on Mondays and tend to systematically undervalue equities.)

To implement this strategy we need to figure out how to cast it into terms used by the zipline API. Fortunately for us the strategy is simple, so this will not be difficult. We know that zipline requires us to write a function which gets called every day, handle_data, where this function can buy and sell things. We can buy and sell shares with the order function. The difficult part here is querying zipline for the date(time), from which we calculate the day of the week. To retrieve the current datetime in our called-every-day function, we need to inspect an attribute, current_dt, of the data argument (an instance of BarData) which gets passed to our function. When dealing with daily data, as we are here, this datetime is the end of the current trading period for that day. Since AAPL is traded in the United States and trading closes at 16:00 New York time, an exemplary data.current_dt would be 2014-06-16 20:00:00+00:00.

Let’s make a first attempt at writing a strategy. We will start with an empty initialize function (which zipline requires we define) and then craft our handle_data function. Here it is:

If we save this code in a file strategy1.py we can backtest it over any period for which we have data. For example, we can test this minimal strategy over the first full trading week in January 2012 with:

The output of a zipline run is a pandas data frame which can be read with the pandas.read_pickle function. The following lines of code will open the data frame and print the following:

  • The opening value of our portfolio (i.e., the cash we start with)
  • The closing price of AAPL on Monday (at which we buy 10 shares)
  • The closing price of AAPL on Wednesday (at which we sell 10 shares)
  • The closing value of our portfolio

which outputs:

Opening `portfolio_value`: 10000000.0
AAPL on Monday: 60.247
AAPL on Wednesday: 60.364
Closing `portfolio_value` 10000000.5469

So in particular one week we’ve earned a profit of 0.55 having made use of 602.247 (\(10 \times 60.247\)) (ignoring fees and slippage). In annualized terms this is a return of about 2.8% which is better than the risk-free rate in early 2012 (around 1.9%). Of course we only backtested the algorithm for one week. (Our strategy is also very risky.) To run this strategy for an entire year we would change the end date: zipline run -f strategy1.py -b csvdir --start 2012-1-6 --end 2012-1-28 -o strategy1_out.pickle.

Using zipline in a prediction market setting

To use zipline in a prediction market setting one needs correctly formated data. Obtaining current price information, including the bid-ask spread, is typically not difficult. PredictIt, for example, makes this data available in a machine-readable format (e.g., https://www.predictit.org/api/Market/3633/Contracts).


This post is part of a series. The most recent post in the series is “Expected Shortfall in a Prediction Market Setting.” Learn when new posts appear by subscribing (RSS).