Introduction
Let’s start talking about what Survivorship Bias (SB) is.
SB bias is a phenomenon where data analysis or research is skewed because it only considers the individuals or things that have “survived” or succeeded in some way while neglecting those that failed or did not survive.
This can lead to inaccurate conclusions because it ignores important information from the non-surviving or less successful cases. In essence, SB occurs when you focus only on the winners and disregard the losers, resulting in a biased and incomplete perspective of a situation or dataset.
The most famous and clear example of SB is in the story of the bullet holes in the plane of WW2.
Location of the bullet holes on the surviving planes
During WWII, when military experts saw bullet holes in certain parts of returning planes, like the wings and fuselage, they thought about adding more armour to those spots. Their idea was to make the planes safer in the future by strengthening the areas where they found damage.
However, they didn’t think about the planes that didn’t make it back, which was a mistake. Those missing planes were the ones with holes in the areas that needed extra protection, as getting hit there usually meant the plane wouldn’t return.
On the other hand, the planes that did come back were the ones that got hit in less critical areas.
So, the desire to cover the holes in the areas they observed was an attempt to improve survivability, but it was based on incomplete data due to survivor bias.
Survivor bias in financial models
This kind of bias can happen everywhere, also in financial data.
In fact, my colleagues and I in Analytical Platform incurred this problem during the training and backtesting of our machine learning model.
This lets do some problems like:
- Overestimation of Returns: Survivorship bias can make historical returns appear better than they actually are, leading to unrealistic expectations of future performance.
- Inaccurate Risk Assessment: It may underestimate risk as poorly performing assets are excluded, potentially leading to the adoption of riskier strategies.
- False Model Validation: Survivorship bias can validate strategies that wouldn’t have been profitable with a complete dataset, resulting in ineffective strategies in real trading.
- Biased Strategy Development: Strategies may become tailored to surviving assets’ characteristics, which may not apply to the broader market.
- Limited Generalizability: Strategies developed with survivorship-biased data may not work well with new data, making adaptation to changing market conditions challenging.
This bias in data like stock performance arises because the data typically includes only the successful cases, like profitable investments or thriving companies while neglecting instances of failure or underperformance.
Let’s look more closely at some examples:
- Stock Indices: Consider an index like the S&P 500, which tracks the performance of 500 large-cap U.S. companies. Over time, companies that decline in value or go bankrupt are removed from the index, while new successful ones are added. If you analyze the performance of this index without considering the companies that have been removed due to poor performance, you might get an overly positive view of the market’s historical performance. This can lead investors to underestimate the risks associated with investing in stocks.
- Mutual Funds: Many investors analyze the historical performance of mutual funds to make investment decisions. Survivor bias can occur if you only look at the performance of funds that are currently active. Funds that performed poorly and were closed or merged with other funds are often excluded from such analyses. This can give the impression that investing in mutual funds is more profitable than it might be because you’re not accounting for the underperforming funds that didn’t survive. This bias can skew the performance up to 5% [1]
- IPO Analysis: When assessing the performance of Initial Public Offerings (IPOs), if you only study companies that successfully went public and became profitable, you might conclude that IPOs are generally lucrative. However, this ignores companies that attempted IPOs but failed or struggled post-IPO. Understanding the performance of both successful and unsuccessful IPOs is crucial for making informed investment decisions.
We can see the effect of SB in an index like the TSEJ, where failing stocks are constantly removed from the index (Green Line), while if we accounted for those the performance would have been different (Blue line), giving an altered perception of the market performance. The same can be seen for other markets like the FTSE 100 and the EuroStoxx.
Returns of Survivorship-bias vs Survivorship-bias-free datasets with delisted stocks into the Tokyo stock exchange
Returns of Survivorship-bias vs Survivorship-bias-free datasets with delisted stocks into the FTSE and EuroStoxx exchange
How to solve it
To tackle survivor bias in the context of stocks, a practical solution is to build a dataset that doesn’t favour survivors, but instead includes stocks that have experienced various fates, such as:
- Delisting: This involves stocks that were taken off the public stock market, often due to poor performance or other reasons. By including delisted stocks, we get a more balanced view of the stock market, acknowledging both successful and unsuccessful ventures.
- Acquisition by Other Companies: Some stocks cease to exist on their own when they are bought or merged with other companies. Including these stocks in the dataset helps us understand how corporate deals impact stock performance.
- Removal from Index: Stocks that were once part of major indices like the S&P 500 but were later removed provide valuable insights. Their inclusion in the dataset shows how stocks can rise in importance and then decline over time.
By considering these different stock outcomes, we can analyze a more diverse range of scenarios and results, helping researchers and investors make more informed decisions in the ever-changing world of stock trading.
Free survivorship-free bias price dataset.
An example of how survivorship-free bias datasets are created can be seen in the blog of Teddy Koker.
In summary:
First, we need to gather a complete historical list of all the companies that were part of the S&P 500 during our target period. Once we have this list, we can create a dataset by including the historical prices of these companies while they were in the S&P 500.
Unfortunately, getting historical data for S&P 500 constituents can be tricky. In most cases, you have to buy this data from specialized providers.
The article mentioned above suggests a partial solution using the iShares Core S&P 500 ETF (IVV), which tracks the S&P 500 and reveals its composition every month. However, please note that this data only goes back to 2006.
After identifying the historical S&P 500 constituents, the next challenge is finding their pricing data. Companies in the S&P 500 change over time due to name changes, acquisitions, and sometimes even bankruptcy. This complicates things when you rely on free sources like Yahoo Finance, which often lacks information about delisted stocks. Moreover, records of ticker name changes are often not well-documented, making the data collection process even more complex.
To tackle this, we can use the WIKI Prices Dataset to find pricing data for each S&P 500 constituent after we’ve established the index’s composition. Once we have all this data together, we can effectively test our strategy.
You can directly download the bias-free dataset of stock prices from here.
What about fundamentals? We need fundamentals!
You may ask now:
What is fundamental data on the first place?
Fundamental company data comprises vital financial information, including revenue, profit, balance sheet, cash flow, dividends, debt levels and much more. It plays a crucial role in assessing a company’s financial health and investment potential.
When it comes to training a model using a quantitative approach, we require both price and fundamental data. Obtaining this data over an extended period of time can be quite challenging.
Fortunately, there are providers who offer this information for a monthly fee. Whether it’s worth the cost depends on whether we are using it for professional purposes.
Some paid providers are Norgate (from 1990), SimFim, Sharadar, Alsoseek (only from 2007), and EOD (from 2000).
If we manage to obtain some of these datasets, we will be able to train our Machine Learning model.
How we solved the problem
We, along with my colleagues, accessed the historical composition of the S&P 500 index through an EOD subscription.
As previously mentioned, the constituents of this index undergo quarterly changes.
Consequently, we conducted our model training and backtesting exclusively on the stocks that were part of the index during the specific time period in question resolving the SB problem.
Conclusions
Survivorship bias is a big issue in finance, where data often only shows success stories and ignores failures. This can lead to misguided investment decisions.
For example, stock indices like the S&P 500 remove struggling companies, making the index look better than it really is. The same goes for mutual funds.
To solve this, we need to collect data on both successful and unsuccessful cases. This means getting historical data for all companies in an index, not just the ones still around today. It’s a tough job, but it’s necessary for accurate financial analysis.
It’s possible to get bias-free price data here.
But for fundamentals data instead, is not always possible to get it for free so we listed a couple of paid providers.
By addressing SB using in our models only the stocks that are actually present in that specific period of time, we can make better financial models and avoid costly mistakes in investing. It’s all about seeing the full picture, warts and all.
You can read how we use all this information and many more to beat the market here.
References
[1] Welcome to the Dark Side, Gaurav S. Amin, Harry M. Kat, The Journal of Alternative Investments, Summer 2003