The methods used today to understand private companies — especially fast, high-growth startup companies — are broken.
Hopelessly, insanely broken and antiquated.
Given the importance that small- and medium-sized enterprises (SMEs) and startups in particular play in the global economy and driving innovation, it’s unfortunate and surprising how little the methods to understand them have evolved.
There is a better way.
Yesterday, The New York Times offered a glimpse into this better way with their release of a list of 50 Future Unicorns (private companies worth over $1B).
It was generated algorithmically using CB Insights’ Mosaic software. Mosaic is an evidence-based, statistically-driven approach to objectively assess private company health. It was built using machine-learning and data science.
In simpler terms, think of Mosaic as “Moneyball for Startups” or the modern business equivalent to the FICO score.
Today, we are excited to unveil Mosaic and also announce $1.15 million in funding from the National Science Foundation (NSF)— an early believer in our vision to build a more nuanced understanding of historically opaque private companies.
Why build Mosaic?
The germ of an idea that has become Mosaic started while Jon and I were at American Express. It was there, as a lender to private companies, that we saw the the challenges of understanding private companies and the limitations of existing purported solutions.
After American Express, we became the founders of a startup, CB Insights.
After finding ourselves on the other side of the table, the challenges and obstacles that private, high-growth companies face became very real to us. It was as founders that we learned that we couldn’t get better pricing or credit terms on certain services because the broken, legacy credit bureaus had deemed us risky. One payroll provider a couple of years ago told us that the credit bureaus suggested that we were “near bankruptcy” – even though our revenue and bank balance clearly showed otherwise.
Because we hadn’t taken credit and the legacy providers didn’t know how to handle a company with little to no credit history (known in the industry as a “thin file company”), we got worse rates, terms, and attention from those that should want to do business with us.
And we’re not unique here. Hundreds of thousands of private SMEs and startups see similar challenges, and the costs go beyond mere annoyance.
For private companies, these challenges skim away profits, decrease opportunity and hinder productivity.
-
Vendors offer inferior pricing, terms, credit, or service
-
Lenders (banks, credit card companies) give them less or no credit and at worse terms
-
Investors or acquirers are unable to discover and ultimately compete for them
For the vendors, lenders, investors, and acquirers, the costs are real too in the form of opportunities lost.
Mosaic solves this.
How does Mosaic work?
The Moneyball approach, or that of using statistical analysis to make predictions, is not new in other domains. The Oakland A’s baseball team is famous for using statistics (or what is known in baseball as “sabermetrics”) to build a competitive team despite a limited payroll. Nate Silver and his now famous 538 blog rose to prominence using statistics to forecast the results of the presidential election more accurately than political analysts and pundits.
But this statistical approach hadn’t been applied to the private markets. Until now.
Specifically, Mosaic ingests massive amounts of structured, semi-structured, and unstructured information left behind by companies and extracts, processes, and statistically analyzes this trail of digital breadcrumbs or data exhaust to develop predictive intelligence regarding the health of a private company.
The result of this statistical analysis is the Mosaic Score. Currently, Mosaic uses dozens of signals, including the quantity and quality of job postings, web traffic, social media chatter, executive turnover, customer signings, mobile app data, and news sentiment, among others. New signals are being integrated constantly.
A company’s Mosaic score is driven by 3 separate factors or underlying models as detailed below. When visiting a company profile on CB Insights, customers with access to Mosaic will see all these scores (see screenshot above) which are on a scale of 0 to 1000 with 1000 being the top possible score.
The 3 sub-scores or 3Ms are Momentum, Market and Money.
- Momentum – Measures the individual performance of a company relative to itself and peers using signals such as product news, hiring activity, partner/customer signings, online sentiment, social media chatter, and mobile & web traffic/downloads among other factors.
- Market – Quantifies the health of the industry in which a company participates based on funding, deals, hiring activity, industry sentiment, investor/acquirer quality, and exit activity.
- Money – Assesses the financial strength and financial viability of a company based on projected burn rate, financing history, and investor quality.
Each of the sub-scores has its own distinct distribution, reflecting the real life state of private, venture- backed companies. The Market and Money scores skew left, as financing hit record levels in recent years and investors flocked to hot industries. Momentum, on the other hand, skews right reflecting the reality of venture capital that more VC-backed companies fail than succeed. When these scores are aggregated, the majority of companies fall in the middle, resulting in something that resembles a normal distribution.
Beyond seeing Mosaic scores on a company’s profile, you will see them when comparing companies against one another as illustrated below in this comparison of Y Combinator unicorns. All of these have relatively high Mosaic scores but among the four, Dropbox is the laggard.
You can also now search for Mosaic scores in Search. This is particularly useful when the number of results is large, and you want a quick way to zero in on the strongest companies within an industry, geography, etc.
Who is using Mosaic?
Mosaic is currently being used by a diverse array of customers in sales, venture capital/M&A, corporate strategy/innovation, and risk management who are using it to make better decisions about private companies and to understand where industries and business models are going.
We are doing a lot of work to integrate Mosaic into 3rd-party platforms. Ultimately, our goal is that a company’s Mosaic score will underlie, sometimes invisibly, a lot of the decisions that investors, vendors, creditors, and acquirers make about private companies.
We also recently launched CB Insights For Sales for B2B sales teams. We see numerous other verticals that would benefit from Mosaic and will be releasing those over time.
If you’d like to see CB Insights Mosaic, sign up for a free trial.
The below is a bit more on the data science and machine learning challenges we’ve encountered in building Mosaic. It is high-level but if you’re into this sort of stuff, read on and feel free to ask questions in the comments or via email. If the problem of assessing private company health using public data is one that interests you, we are hiring.
Challenges of building Mosaic
Assessing private, high-growth startup companies using public data is, to put it mildly, a very thorny problem. We’ve encountered many interesting challenges already and anticipate many others. We’ve summarized some of them below but will be covering these challenges and the development of Mosaic in more detail on our engineering blog.
Data Collection, Aggregation, And Translation
Before we could even begin developing the Mosaic models, we had to collect information. The vast majority of the financing and exit data we have is collected algorithmically through a technology and process we call The Cruncher. This system of machine-learning software crawls millions of unstructured, semi-structured, and structured information sources every day, extracting the key pieces and structured data that we care about.
In addition to the Cruncher, we’ve built various news classification and data extraction technologies over time, including classification of HR and partnership/customer news as well as assessing the sentiment of news and social media. Additionally, we are looking at non-obvious signals like press release or tweet frequency where relevant to take the pulse of a company.
Once we get the data, perhaps even more difficult, is translating disparate data types in a way that can be programmatically and algorithmically analyzed.
Feature Scoring/Weighting
Once we’ve identified the digital breadcrumbs we want and gathered them, we need to figure out how they all fit together.
The first part of that task is assessing how much of this data is actually a determining factor in where that company is today, let alone where it is going. There’s math for that: autocorrelation, the measure of how a signal’s value at one time affects its value at a later point. Using the vast pool of historical data we’ve collected, we are able to determine statistically how far back we need to look at a company’s data before its effect is lost on current and future performance.
How do these features play together?
At this point, we have a pool of data that we know to be relevant to the current and future performance of each company. The simpleton’s approach here would be to sum the features’ scores together to create a final score representing an unweighted aggregation of our inputs. But not all inputs are created equal:
-
Is a job posting for a head of HR as important as a post for marketing analyst?
-
Are Facebook likes equally as valuable as a press mention?
-
Is positively glowing news coverage equally as valuable as a negative press mention is damaging?
To answer these questions, and determine the feature weights for our inputs, we used a regression model to predict the likelihood that a company would raise and/or exit in the near future. Measuring how each of these inputs contributed to predictions’ accuracy provided us with their relative importance in forecasting a company’s future success.
What about missing data?
As much data as we’ve collected, there are many companies out there that are missing data on certain features. Either they don’t have social media accounts, aren’t currently hiring, or they haven’t received press coverage.
Where appropriate, we’ve used an array of methods, including clustering and nearest neighbor imputation, to fill in those gaps. In order to account for situations where imputation was inappropriate, we’ve implemented a responsive weighting system to dynamically calculate coefficients when encountering incomplete data.
Is it really fair to compare all these companies?
One of the challenges in attempting to synthesize an area as broad as tech is the extreme feature diversity, both in terms of company meta data as well as feature values. Can we compare a Computer Hardware company to a Mobile Gaming company?
We could, but it likely wouldn’t yield any meaningful insights.
A similar challenge arises from the varying stages of companies we are evaluating. We shouldn’t compare an early stage company with < $1M in funding to Uber; it just wouldn’t be fair (or logical).
Both of these issues can be addressed with clustering and a nearest neighbors analysis akin to that used by Nate Silver’s PECOTA, which identifies similar players by both personal attributes and performance. By creating peer groups of companies with similar attributes and at similar growth stages, we can score a company’s performance in a meaningful way, benchmarking their health and growth relative to those of a typical company in its peer group.
Continuing with the baseball analogy, the scores are similar to the WAR metric, indicating the degree to which a company outperformed or underperformed the competition.
Again, if any questions, leave them in the comments or drop us a line.