Company Mosaic

Our Company Mosaic algorithm, funded by the National Science Foundation, gives you predictive intelligence into the health of private companies.


Mosaic lets you see the best, most relevant companies exhibiting the signals of health you care about.



Like Google PageRank, the Business Social Graph illustrates the quantity and quality of a company’s network—a critical predictor of company health.

increasing access to opportunities for the best private companies


While working at American Express, the founders of CB Insights experienced firsthand the challenges of understanding private companies. Traditional scoring methodologies were time-lagged and highly incomplete, especially for emerging companies with limited credit histories.

As a result, business development, marketing and even risk management professionals were hampered in their efforts.

In 2012, CB Insights approached the National Science Foundation (NSF) with a proposal to use machine learning and advanced language processing techniques to build a system that would solve for these challenges. The NSF provided several grants to CB Insights to develop this technology.

The technology we began developing with the NSF has become Mosaic.

What is mosaic?

Mosaic is a quantitative framework to measure the overall health and growth potential of private companies using non-traditional signals (covered below).

There are three distinct and independent factors that are synthesized into one final Mosaic score. These models are the three M’s:


Measures the individual performance of a company relative to itself and peers using signals from social media, news sentiment, mobile and web traffic/usage, hiring, customer and partner signings


Quantifies the health (or lack thereof) of the industry in which a company participates based on funding, deals, hiring activity, industry sentiment, and exit activity among other factors


Assesses the financial strength and financial viability of a company based on financing history, burn rate and investor quality

The individual model scores from the 3 M’s, as well as the overall Mosaic score, are used by clients in a variety of ways to help them understand what’s next.

How do our customers use mosaic?

Marketing & business development 

Identification, prioritization and nurturing of opportunities


See fast-growing markets and industries before others to inform strategic decisions

Product development 

Pinpoint fast-growing private companies to understand their business models, products and technology

Account management

Classify existing customers with momentum which may be suitable targets for upsell

Risk management

Assess the health of current and prospective customers to assess credit risk and decrease accounts receivable risk


Identify and analyze vendors in areas of interest and track vendor health on ongoing basis

Backed by the National Science Foundation

Mosaic is a quantitative framework to measure the overall health and growth potential of private companies using non-traditional signals (covered below).

While our full proposal to the National Science Foundation is confidential, some excerpts from our proposal are provided below to provide insight into our goals, approach and challenges in building Mosaic.

First, it is worth noting that this is a very hard problem to solve and one we’ve been tackling for years. And it is also a very important one. Private companies, especially those in high growth sectors like tech, energy and life sciences, are the lifeblood of our economy and massive drivers of employment, innovation and commerce. Increasing their access to opportunities is a good thing for society.

The difficulty and importance of doing this means that any type of algorithm that purports to assess their health must be rigorous as doing this poorly can have deleterious impacts on companies and the people at those companies. Superficial tracking of buzz and vanity metrics may be interesting and good for notoriety, but such shallow methods ultimately do more harm than good to private companies and undermines our discussion about them as they introduce more noise than signal.

Excerpts from our National Science Foundation Proposal

Our National Science Foundation support has spanned multiple phases. The following excerpts are taken from different proposals. These cover elements of our approach, our challenges and our technology. Note: If the problems we are solving are interesting to you, we are hiring.

Project Objectives

In Phase I, our primary technical objectives were to research and develop information collection, extraction, categorization and sentiment analysis capabilities that would enable us to deliver initial algorithmic Mosaic health assessments for a test group of private companies. Specifically, the detailed objectives of our Phase I plan were to:

  1. Identify data and information inputs with potential signal value for Mosaic
  2. Determine strength and sentiment indicators that can offer signal value and which can be leveraged to determine private company health
  3. Extract and store relevant information from inputs that provide signals of private company health
  4. Run statistical analyses on Mosaic signals and models and compare them to available historical data to back-test models
  5. Algorithmically combine signals to develop Mosaic health assessment"

Why Mosaic is Different

A key differentiator and value proposition of Mosaic is the broad set of inputs it processes to deliver its insights into private company health. To enable this, the main “artery” of the Mosaic system is the Input Aggregation Module (IAM). IAM identifies sources of information—structured, unstructured, and semi-structured—for companies and extracts relevant information from these diverse sources. Mosaic looks at a diverse array of signals that measure everything from HR to industry health to investor quality and a host of other measures which offer predictive insights into the health of a private company.

To build the Input Aggregation Module, we had four primary research goals in Phase I: 

  1. Identify relevant sources of information
  2. Understand characteristics of information sources to determine Phase I focus
  3. Retrieve and extract information inputs from selected sources
  4. Conduct entity matching and calculate input relevance.

How Mosaic Helps

Our proposal seeks to continue our development of Mosaic, an innovative software directed at investors, acquirers, partners and vendors of private companies that will provide them with actionable, real-time intelligence into the health of these companies.

Mosaic scans and parses millions of structured, semi-structured and unstructured information sources searching for signals of a private company’s health, and then algorithmically processes, categorizes and assesses the sentiment and strength of these disparate signals to offer a comprehensive, coherent and real-time view of a private company’s health. 

Using Mosaic, our customers are able to look at private companies in a fundamentally different, smarter, more scalable and data-driven way that empowers them to efficiently and intelligently make critical decisions about private companies. More specifically, they are able to identify the right companies and are armed with intelligence that improves the rigor by which they make decisions about private companies.

Early Results

Given the success of our Phase I technical efforts, Mosaic has shown that signals relating to private companies are not just available, but that they can be extracted, translated, synthesized and algorithmically analyzed to offer predictive insights into a private company’s health.

Why It’s Hard

Here’s just a few of the many reasons:

Input Heterogeneity

For example, one page of particular interest on a company’s website was the page detailing a company’s management team. We collected a training set of over 1,000 company management team website pages and then manually analyzed and labeled them for the purpose of identifying recurring patterns. The heterogeneity among this sample of websites was staggering.

Post-processing, Error Handling

We evaluated information extraction tools such as _______ and _______ for information extraction but found through our work during Phase I that while these may help with parsing and extracting data from pages, they generally do not assist in post-processing, error handling, incomplete data issue resolution, and ensuring that changes to URL and layout changes are reflected.

Entity Matching

To ensure that entities identified in our input sources from Step 3 referred to the appropriate real-world entity (company), entity matching (EM) was a significant priority. Our primary goals for entity matching were three-fold and focused on being effective (achieving high-quality results), efficient (EM should occur quickly), and generalized (the EM method could be applied to different tasks and information types).

Following entity matching, the relevance of that match to the company in question was determined—what we refer to as Input Relevance. Input Relevance is needed in the context of Mosaic to ensure that an input that references a company is contextually focused on the company and not only superficially relevant. Input relevance ensures that only relevant signals are used in the determination of a company’s Mosaic health.

To perform Entity Matching and Input Relevance for Mosaic, we’ve developed an algorithm to compute the similarity between an entity extracted from an information input and the real-world entity in question. The algorithm we’ve developed focuses on attribute and context matching.

Life Event Categorization

The inputs provided by the Input Aggregation Module (IAM) are then categorized based on their context into Mosaic Life Events using the next module, the Life Event Categorizer (LEC). Companies experience different types of events that offer clues or signals into the company’s levels of stress or strength. These Life Events fall into distinct categories and need to be handled differently based on their significance to private company health.

To enable Life Event Categorization, we first used a collection of machine learning algorithms for use in data mining tasks ranging from data preprocessing, clustering, classification, regression, visualization and feature selection. We used _____ for feature selection for each Life Event category.

We found that __________ yielded the best results for Life Event categorization, so following feature selection, we created a standalone program in ______ that assigned two weights to each feature as detailed below. Specifically, we looked at a feature’s prevalence within a Life Event category relative to all features in that category (W1). We also looked at a feature’s prevalence within a category relative to that feature’s prevalence across all categories (W2).

For example, our Life Event categorization module scores each article’s title and content for each of the Life Event categories yielding a total of ____ scores for each news article. The scores are calculated for both the title and content of each article as detailed below.

m = Multiplier, is generally equal to 1. For certain title features, the multiplier increases the weight of certain features which are more important given their high correlation with particular categories.