One of the questions we get frequently when talking to prospective customers is:
Where does CB Insights data come from?
Today, we wanted to offer some additional insight into the technology and processes we use to gather our private company financing and exit data. Much of this post gets into the technical aspects of our data collection process but we begin with a non-technical overview.
The post is structured as follows:
- How we collect our data
- Why this data collection process versus others
- What are our data sources?
- Walking through the technology and process
How we collect our data
While we’ll be walking through our technology in this post, it’s worth noting that not all of our data is gathered algorithmically. Our data is collected in two ways as we detail below.
70% via algorithms
We’ve built machine learning software which parses unstructured and semi-structured data sources and programmatically extracts the key pieces of structured data we care about. Things like company & investor/acquirer name, amount of funding, valuation, date, stage, board of directors, patents, etc.
30% via direct submission from investors
This is data directly submitted to us by investors whether they be venture capital firms, angel investors, family offices, growth/private equity firms, etc. Interestingly, the proportion of direct submitted data from investors has increased over time as it stood at around 10% a year ago and has climbed markedly since then.
While causality is hard to determine here, we believe the greater amount of direct submission from investors is being driven primarily by two factors:
- Want exposure of portfolio to our other clients – Most of the Fortune 100 and many investors have become clients. As a result, investors know corporate development, sourcing and strategy groups as well as potential investment syndicate partners are using us and so there is “enlightened self interest” in them ensuring their data is upto-date.
- We’ve become a bit more prominent – As general awareness of CB Insights has grown in the market by virtue of our newsletter and media attention, investors have taken greater care to send us data.
In most instances, our algorithmic approach has already found most of this directly submitted data but getting data directly from investors is obviously a good confirmatory source in these instances.
In addition, 30% of data is directly submitted through The Editor. CB Insights Editor is the easy way to get in front of our analysts to manage your brand with investors, acquirers, customers and within the larger startup ecosystem.
Why this data collection process versus others?
As a technology company, scalability is very important to us.
As a data company, speed & comprehensiveness of the data are equally important.
Optimizing for these factors is the reason we’ve taken this approach to data collection. Today, the availability and ubiquity of information on the web and the progress of machine learning technology to make sense of that information makes this the optimal process.
As an example of scalability, the explosion of micro venture capital funds and accelerators is something that we’ve been uniquely able to capture because our process is as simple as pointing our technology at new websites and extracting the data.
Of course, there are aspects of the investment ecosystem that remain opaque, i.e., areas like angel investment. This is why we augment our data with direct submitted data via partnerships which give us access to hard-to-find proprietary data.
It is worth noting that none of our financing or exit data comes from licensing data from 3rd parties. This data is fundamental to CB Insights and so dependencies on other data providers for a core part of our product is (1) bad business because we’d be building in someone else’s sandbox and (2) honestly, there are no credible data sources out there for this type of data that we’d be comfortable licensing from.
The Algorithmic Approach – What are the data sources & approach?
This write-up will focus on the 70% of our data gathered algorithmically.
The core of this data is extracted from crawlers that analyze 150,000+ sources on a daily basis. In the last 8 months, we have crawled and analyzed nearly 16 million unique articles and information sources. The list of sites we index grows regularly but includes the following mostly unstructured data sources:
- Regulatory filings – form Ds as an example
- Investor websites (press & portfolio pages)
- Company websites (press pages & blogs)
- Acquirer websites (press pages or investor relations sections)
- Press releases
- Social media (Twitter primarily)
- A select group of local, national and international news and trade publications
Sifting through all this data in order to extract meaningful information and bring you the most comprehensive funding data in the industry is quite a challenge. In order to expedite this process and cut back the amount of hours needed to get this information into the CB Insights platform, we have developed The Cruncher.
The Cruncher is the lynchpin to how we source and input private company financings and acquisitions quicker and more comprehensively than anyone else in the market. It works by identifying relevant sources, classifying them into different categories, parsing them, running them through a named entity recognizer and then extracting relevant funding information.
All the above mentioned substeps in themselves contain challenging problems in the field of Natural Language Processing and Machine Learning (we are hiring btw). Thus, while attaining the highest accuracy (as measured by capturing and correctly identifying all the entities involved) is our aim, these processes by themselves do not guarantee that all information is correctly identified. The Cruncher, therefore, provides a unified platform and ultimately an interface for our analysts to very quickly curate the data to ensure it’s correct before customers see it. You can think of our algorithmic approach as 90% machine and 10% human. We recognize machines are not infallible and so building a solution that allows 1 analyst of ours to do what 25 data entry folks can do was a fundamental goal.
We will now dive into the technology and process that underlies The Cruncher. Each unstructured information source we obtain essentially goes through a sequence of steps which transform it and extract structured information from it.
In this post, we’ll focus mostly on our technology to parse articles as parsing regulatory filings or investor portfolio pages have some of their own unique nuances which we’ll cover in subsequent briefs.
In general, our process looks like this:
If you have questions or comments about the technology, processes or our thinking behind The Cruncher, please ask away in the comments. If you like hard problems, we’re hiring.
Life Event Classifier
Our Life Event Classifier deals with categorizing articles into a set of predefined classes such as Human Resources, Financings, M&A, Partnerships/Customer etc. If you think of an organization as an organism of sorts, these are all life events which may be suggestive of company health (or lack of).
This is a critical step in The Cruncher as different kinds of entities and relationships between those entities need to be extracted depending on the particular Life Event. In other words, the entities and patterns of a patnership article are very different than an HR article than a financing article. Further, the relation extraction process itself is helped if we are restricted to certain types of articles as the fact that the articles belong to only a fixed set of Life Events helps eliminate false positives.
The filtered list of articles go through a multi-label news classifier which classifies articles into the appropriate Life Event and labels them. This, of course, is rarely straightforward. The same news article could be talking about a company raising financing and hiring a new CFO so it could be assigned multiple labels. We take the traditional approach to this multi-label classification problem and reduce it instead to a set of binary and multi-class classification problems. In case of HR classification, a binary classifier works well. In case of fundings, we approach it as classifying into 3 classes – financing, M&A and Other. We have previously written about how we deal with Human Resources News Classification
The Cruncher is designed such that the analysts are able to provide feedback to the classifier not only in case there are any misclassified labels but also in helping identify important features. The goal ultimately is to ensure the models keeps improving as it is used.
Depending on the Life Event that is determined, the news article will be fed into the appropriate next part of the process. For financing and exit life events, the articles are fed to the Entity Recognition Engine.
The Entity Recognition Engine
Named Entity Recognition (NER) is the task of analyzing various text elements and categorizing them as organizations, people, numbers etc. Our Entity Recognition Engine handles this and identifies entities like organization names, people, dates, amounts, rounds, places etc. in news articles. In order for the Entity Recognition Engine to work effectively, articles must go through a series of processing steps
The articles first undergo tokenization. Tokenization is the task of separating a sentence into separate tokens (words, punctuation marks, numbers). The tokenized text then feeds into a Part-of-Speech (POS) tagger. A POS tagger, as the name suggests, assigns parts of speech (noun, adjective, verb etc) to each word.
The pos-tagged text is fed into a parser which operates on a sentence level. Parsing a sentence deals with computationally analyzing and identifying constituents of a sentence and how they are related to each other. The constituents here include elements like a noun phrase or a verb phrase which themselves could be part of a bigger constituent. These relations are represented as a parse tree.
The Entity Recognition Engine then uses features from all previous steps and tries to identify a number of different types of entities .
Parsing and named entity recognition is computationally the most expensive part of our pipeline and is therefore only run on articles that we are confident we can derive value from. We also supplement the Entity Recognition Engine with a simple correference resolution system specifically for organization names.
Since we obtain articles from news sources in different countries, there are often cleanup and normalization issues to deal with. This includes normalizing dates to a standard format and also converting currency amounts to US dollars. These normalized entities can then be used to extract relations.
The Relationship Extractor
At this stage in process, we can finally extract relevant bits of financing and M&A information, and this happens in our Relationship Extractor. The Extractor focuses on disambiguation between entities so we can look at organization names to find which of the mentioned ones are being funded and which ones are the investors. We take into account features like position of the entities, their frequency in the article, proximity to the title, our own prior historical data on the entity, etc. Based on this we get a set of entities and relations between them. At this point, the system has funding information and associated confidence measures for them which is then stored.
The Clustering Engine
Our next module, The Clustering Engine, is where we cluster or group similar articles. This works as a signal for how reliable the information is. In essence, the greater the number of such articles and more reputable the source, the higher is the likelihood that funding or exit information is accurate. Clustering is also important as showing the same news event repeatedly to an analyst in The Cruncher admin interface is a waste of time. Scalability is always top of mind.
One approach to clustering could be at the news article level based on the text. For our use case, however, we have found that taking this approach does not always work. We instead rely on clustering similar fundings based on entities extracted as it is more accurate. The features we consider include date of the article, entities involved, amounts etc. If enough number of these match or are similar for 2 or articles, we group them into one cluster.
Another little problem we need to deal with is identifying if we have already seen this funding before and only present it to The Cruncher’s admin interface if we haven’t. This can happen, say, if the news article talks about a new product released by a company but also gives a reference to the funding the company had raised previously. Based on the recency of the funding raised and investors involved, we determine if it is a duplication. If not, all this information is then presented in The Cruncher’s admin interface which analysts use to quick curate the data. (screenshot below)
The Cruncher Admin
The Cruncher Admin is designed in a way that makes it easy for analysts to quickly identify and approve the entities extracted. If some entity is incorrectly classified or missed, it can be easily rectified. Apart from listing snippets, dates and sources, it also specifies whether the funding information is for an existing entity or an entity we already have in our database. Based on this, analysts can choose which articles are the most important to look at first and save the rest for later.
The interface also provides a mechanism to leave feedback. The feedback is important as this information is relayed back to our models and so this feedback serves to improve them over time. While some of this feedback is automatically incorporated into the models, the rest can be used to change certain aspects of the model itself if it is not behaving well on certain types of articles or on articles from particular sources. This also helps identify potentially problematic news sources.
Once our analysts choose to process a funding, they are directed to the transaction page. Entities identified from a typical funding article and how they are used in The Cruncher are shown below.
This page contains basic information about a company along with a table of funding information. Each row of the table contains details of every round of funding a company has received (or exit data). If an article talks about a company receiving a new financing, the relevant entities are extracted and listed for our analyst to approve or edit. In the screenshot above, for e.g., the amount, investors and board members listed in the article have been extracted. It also lists all the articles that it found similar information in. Once the information is approved by the analyst, it is then saved in the database.
Making data available on CB Insights
The machine learning process that underlies The Cruncher and which we detailed above helps ensure we get financing and exit transactions to our customers faster than anyone else (typically within hours of the data source being found).
At this point, the data has passed through all programmatic tests for quality as well as been “blessed” by someone on our data team and so is ready for prime time – meaning it can be pushed to production and available to customers.
It will be visible in the feeds on your MyCBI page (as shown in the below screenshot) as well query’able via CB Insights search. In addition, it will be available for API customers who are pulling our data programmatically.
If you have questions or comments on the technology, processes or our practices as it relates to The Cruncher, feel free to ask in the comments. If solving these types of problems is of interest to you, we’re hiring machine learning engineers and a whole host of folks in marketing, engineering, biz dev and research.