On the CB Insights platform, you can view all the VC- and PE-backed companies that received PPP loans on our latest Paycheck Protection Program profile.
Businesses are feeling the economic impact of Covid-19.
As part of larger relief initiatives under the CARES Act, the U.S. Small Business Administration (SBA) created the Paycheck Protection Program (PPP) to provide loans to small businesses.
In July 2020, the SBA released data on 661,218 businesses that received at least $150,000 as part of this government initiative.
Less than 36 hours after its release, our technology and data teams uploaded PPP loan data for over 9,600 VC-and PE-backed companies on to the CB Insights platform. Our goal was to provide our clients with a complete view of a company’s financing history, especially as this type of funding could be critical for investor-backed companies, many of which see this loan as an important financing milestone.
Below, we dive into the process behind making this data available on our Paycheck Protection Program profile, and how our team worked to ingest the data, match companies, and structure the resulting information.
Step 1: Entity matching
The SBA data didn’t have any type of government issued ID (cusip, DUNS, etc) to link directly to our company database. Our company matching API, which uses an algorithm for matching external data with data on the CBI platform, was able to use name and address information in the SBA data to automatically link to companies on our platform.
The 9,659 companies can now be seen on our Paycheck Protection Program profile, and as new PPP loan data becomes available, the number of investor-backed companies linked to this PPP profile will grow.
Step 2: Adding PPP loan rounds to company profiles
Next, we needed to do a bulk insert of 9,659 loans onto these company profiles. We used a script that took the information for the CBI linked companies from the SBA data file and uploaded these directly to our platform. The SBA data file included information like loan amount, date, and investor name for the company — all of which can now be seen on each individual company profile as well as the new Paycheck Protection Program profile.
Step 3: Making the data searchable in elastic search
At our core, CB Insights is a vertical search engine, so it was important to make this data queryable. We wanted to be able to answer questions like:
- How many companies were in particular sectors, i.e. pharma vs tech?
- Were PPP loan recipients from particular geographies vs others?
- How much had PPP loan recipients raised in investor (VC, private equity, etc) funding prior?
- Had any PPP loan recipients raised money recently before or after their PPP loan was given?
- Were there trends in terms of which investors had more or less portfolio companies availing themselves of PPP loans?
This required making the data available very quickly so that our clients could run searches and filter the data.
Elasticsearch is the search engine which powers search on the CBI platform. In order to make the data searchable, we had to ensure that the new Paycheck Protection Program investor profile and all its 9,659 loans were loaded and indexed in our Elasticsearch cluster. This is normally done via our standard company-investor indexer job which kicks off when it detects a new company or investor in the system.
Our indexer runs continuously, so in order to avoid the large investor profile for the Paycheck Protection Program blocking other documents from being processed, we opted to run this on a separate dedicated Elasticsearch instance. Since the large size of the document slowed down the Elasticsearch query, we also opted to slim down the metadata for the Elasticsearch document for this investor specifically (more details on that below).
Once these modifications were made and the document was indexed in Elasticsearch, all the 9,659 loans from the Paycheck Protection Program that matched companies in the CBI platform were available in our advanced search feature.
Challenge: A massive investor profile
Generally, investor profiles have in the hundreds of investments. In fact, our largest investor profile prior to the Paycheck Protection Program was the U.S. Department of Defense, which has 3,116 investments — so the PPP profile with 9,600+ loans was over 3x larger than any other investor profile.
This caused a problem with Elasticsearch since our current document is structured to optimize for lower number of investments. This resulted in long indexing times for these documents, which blocked other documents in the queue. In addition, the search queries that included investors as parameters would take longer due to the size of this one document.
In order to resolve this issue, the team optimized the metadata for the document to work well with searches that leverage the Paycheck Protection Program parameters and by removing metadata that are required for searches that are not relevant for this particular use case.
We were able to resolve this issue within the first 36 hours of PPP loan data being released and ultimately make the data available and queryable on the CBI platform. These types of loan programs are not common occurrences, but given the size of this economic stimulus and the number of companies that were recipients, it was important to make this data available to our clients right away.
Most importantly, our ingestion of PPP loan data allowed us to take a large and opaque dump of data about a government loan program and shine a light on it — making it more intelligible and transparent.
If you’re an engineer fascinated by data and its power, we are hiring in multiple parts of our engineering organization.
If you’re interested in the best data on private companies and their investors (VCs, private equity, angel investors, etc), sign up for a trial to CB Insights here.