How Google's BERT helped me make more progress in 1 day than I'd previously made in 30.
What challenge was this project fixing?
I had previously spent 30 days trying to improve the performance of Company News Sentiment Prediction, but the result was less than satisfactory.
The challenge of this particular classification problem is that of “complex learning task, small training set, large feature space.” The algorithm needed to learn 30+ topics with just 2000+ labeled data in a 100,000+ dimension feature space.
What was your hack day solution?
During hack day, I played with a deep language model, BERT, leveraging transfer learning to boost performance in our news sentiment learning task.
How did you come up with the idea?
I came across BERT earlier this year. It’s a deep language model open sourced by researchers at Google, and has achieved state-of-the-art results in 11 NLP benchmarks with one-shot learning. I felt inspired and wanted to try it out.
What was your starting point and the outcome?
Company News Sentiment is the first machine learning project I undertook at CB insights, and one of many that had to wrestle with limited training data on its path to glory.
The goal of this project was simple: determine whether a piece of news about a company was positive, negative, or neutral. (The sentiment assigned should only be with regard to the tagged company, not other entities or overall.)
An example of news sentiment used in company analysis (source)
As a proxy for public perception, news sentiment has long been part of our efforts to track the health status and growth potential of various companies. When I joined CB Insights in August 2018, I was given 30 days to either significantly improve the existing model or exhaustively explore all possible approaches.
I took a look at the baseline precision of 0.62 and recall of 0.69, and felt pretty confident. As it turns out, I was pretty wrong.
Starting with the baseline model, I marched forward with a series of “observation-hypothesis-experiment” cycles.
I tried numerous features:
- count vectors/tf-idf from title and/or content
- aggregated word embeddings
- doc2vec embeddings
- dictionary-based linguistic features
- lda latent topics
- regular expression-based topic extraction
- GBM engineered features
I assembled an array of algorithms:
- single model with weighted stacked features (logistic, naive bayes, gradient boosting classifier, etc.)
- different ensembling
- 2-stage transfer learning
But at the end of the day, precision remained stagnant at 0.74, while recall slipped to 0.65, far below what I had hoped for.
Our R&D team at CB Insights is named Delphi (Illustration: Tendor)
The challenge that persisted throughout the course of experimentation is that of “complex learning task, small training set, large feature space.”
All we had as a starting point is text — but that’s ~100,000 dimensions. With a total of 2058 labeled data points (and >50% of that belonging to the neutral class, which we care less about), even the simplest algorithm overfits and applying regularization immediately increases the bias.
On the other hand, our seemingly simple learning objective also masks the messy real tasks underneath. Because company sentiment is based on a list of 30+ individual topics (positive topics include “partnership,” “growth,” and “milestone”; negative topics include “layoff” and “scandal”; neutral topics include “company PR” and “merger/acquisition,” etc.), the algorithm in fact h many nuances to learn. For example, a lawsuit can be a plus or a minus, depending on whether the company wins or loses, and a passed regulation can either benefit or harm a company.
We needed ways to reduce feature dimension by sophisticatedly extracting information from text rather than context-insensitive condensation. That said, looking for ways to gather more data at lower cost will almost certainly help.
The best attempt during that 30 days utilized Gradient Boosting Machine for feature engineering, and we found a way to augment the training set through noisier topic search while employing a 2-stage design to ensure bias can be corrected.
Winner model pipeline from my 30-day experiments
Still, overall performance was less than satisfactory, and I was running out of time.
Sensing my frustration, my teammate suggested neural nets. “Sure, you just throw a deep neural net at it and we’re all good, right?” I desperately joked, thinking 2058 data points would quickly lose their way in the twisting roads of a deep network. Deep supervised learning is well known for its big appetite for labeled data, often more than we can accommodate here for a specific project — that’s one of the reasons we hadn’t deployed any deep networks into production.
Hello from deep language models
Fast forward 4 months. I was taking my daily dosage of Chinese ML digest when something really caught my attention: “2018 marks the beginning of a new era for NLP,” I read. “With OpenAI’s publication of GPT and Google’s release of BERT, deep transfer learning has finally come to NLP.” I read on with great interest and spent more time that night looking into BERT.
I was excited, neurons firing. If Word2Vec stands as the first successful neural net aided attempt to represent language quantitatively (a kindergarten sweetheart that many NLP practitioners were once in love with), BERT has come a long way.
A good language model needs to capture at least two aspects of language:
- Complex characteristics of word use (e.g. syntax and semantics)
- How these uses vary across linguistic contexts (i.e. to model polysemy)
From Word2Vec to ELMo, language models went from no-context to context-sensitive, with increased architectural complexity.
To take a closer look, Word2Vec, as stated in a 2013 paper, produces a static embedding for every token in its vocabulary and employs a single projection layer neural net with ~18K parameters (calculated as training complexity of a CBOW model with N (window size) = 10, D (projection layer nodes) = 600, V (vocab size) = 1 million). ELMo, as stated in a 2017 paper, represent each token as a function of the entire input sentence, left to right and right to left, with L=2 biLSTM layers — a total of ~93M parameters. Both models produce embeddings that can be used in downstream supervised learning tasks, and ELMo advances the state-of-the-art for 7 major NLP benchmarks.
Better performance comes with a computational cost. As researchers strived for high performance simplicity, Transformers were born.
Transformer is a neural net architecture that discards the encoder-decoder configuration found in ELMo and friends and embraces a pure attention mechanism. OpenAI’s GPT is a unidirectional Transformer; BERT is a deeply bi-directional Transformer. Both models are capable of either producing embeddings as input features for downstream tasks, or employing fine tuning to transfer their learned knowledge directly without requiring task specific architecture. With 340 million parameters and cleverly designed pre-training tasks, BERT swept 11 records in NLP benchmarks with one-shot learning.
Differences in pre-training model architectures (Source: Devlin et al., 2018)
This made me think about my poor Company News Sentiment project. Was there a possibility that the latest advancement in NLP could help me out? I wanted to find out.
Standing on the shoulders of ‘Transformers’
Two weeks later, it was Hack Day, and I didn’t know what to expect. After all, I had limited experience fine tuning a deep learning model. But as Hack Day began, I retreated to my desk den. Armed with my “Mozart for brain power” playlist, I gathered my 2058 labeled data points and the BERT Git repo and got to work.
The first hour was spent on setting up local environment and running examples. I loaded the pre-trained BERT uncased large model into memory. The Git repo comes handily with an IMDB movie review prediction tutorial that shares a structure very similar to my company news sentiment problem, which gave me a great jump start. After configuring the test harness and adjusting output layer dimensions, training began.
Forty minutes quickly passed. Loss kept going down, but I was getting impatient. I only had a little more than 2000 data points running with 3 epochs. My old model using extra augmented data with 2-stage GBM and logistic finished under 8 minutes. I needed to switch gears.
I logged on to Google Colab and configured a Tesla K80 GPU accelerator. The training started again. While waiting, a thousand thoughts ran through my mind:
- Is 2058 data points too little for BERT?
- There are definitely things I can do if overfitting happens: increase drop-out, apply early stopping, freeze more parameters, reduce learning rate, throw in the augmented data…
- BERT was pre-trained on the 800M words BooksCorpus + the 2500M words English Wikipedia, while our training data is news. Are the training corpora similar enough?
- BERT’s 2 pre-training tasks are unsupervised and have nothing to do company news sentiment; will the previously learnt knowledge be applicable?
- We know in ELMo, the higher-level LSTM states capture context-dependent aspects of word meanings, while lower-level states model aspects of syntax. Does BERT share similar attributes? And if so, what’s the implication in my application?
Just when my head was buzzing, training finished. This time, it had only taken 7 minutes. “GPU is great,” I said to myself. And when I checked performance, my heart skipped a beat. Without any custom feature engineering or hyper parameter tuning, the model scored a precision score of 0.83 and a recall of 0.75 on the hold out test set.
I was suspicious. I double checked the test harness, the code, and the intermediate output. And to rule out the possibility of me hitting a statistical fluke with a biased holdout test set, I ran repeated holdout testing and checked performance metrics across runs. The average precision was 0.813 and average recall 0.766, with a standard deviation of 0.026 and 0.023, respectively. Again, this was without any hyper parameter tuning.
The single experiment I ran on a single Hack Day had turned out to be better than all the other approaches I’d tried over 30 days.
The best results I achieved that day were an average precision of 0.863 and an average recall of 0.784, with the 2058 labeled data plus the 3447 noisier augmented data shuffled and fed into BERT. It finished in a lightning fast 1 minute, using a TPU accelerator. Single-handedly, BERT boosted precision by 12 percentage points, plus another 13 percentage points for recall.
Standing on the shoulders of Transformers was both thrilling and terrifying. On the one hand, it felt great to see the performance boost achieved within such a short time. At the same time, it left me wondering what I’d done in all those 30 days, and determined to get to the bottom of why it worked.
The fact that an unsupervised pre-trained model, with minimal fine tuning, can perform relatively well on a very specific task carries special implications. I think this has to do with the innate structure presented in language and BERT’s ability to capture and preserve a good amount of that knowledge. Though transfer learning first took off in computer vision, it has equally great potential in NLP if not more, as there’s literally an unlimited amount of training text out there for models to observe.
Hack Day was certainly not the end. My little experiment with company news sentiment and BERT has inspired me with a long list of problems to research and think about.
- Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013). Link: https://arxiv.org/pdf/1301.3781.pdf
- Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). Link: https://arxiv.org/pdf/1802.05365.pdf
- Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017. Link: https://arxiv.org/pdf/1706.03762.pdf
- Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018). Link:https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). Link: https://arxiv.org/pdf/1810.04805.pdf
- Google Colab, a powerful Jupyter notebook environment, still free.