A detailed review of CB Insights' HR news classification algorithms using supervised machine learning.
One of the critical inputs in our efforts to assess private company health is the tracking of their human resources activities. This includes programmatic monitoring of:
- hiring activity as evidenced by job postings (you can find these on a company’s profile under Performance Metrics)
- key hires and departures
The focus of this post is on bullet #2 above. The first 2 sections of our discussion are non-technical. After that, it is a technical review of our HR news classification work. It is broken down into the following sections and you can click on any heading below to be taken to the appropriate section.
Problem Introduction (Technical)
- Formulation of the classification problem
Executive moves are a signal of company momentum
This news is important for a number of reasons:
- Hiring of certain key positions can be an indication of company plans & strategy. If a company hires a CFO with public market experience, that is an indication of its IPO ambitions and its general business momentum.
- Conversely, the departures of key executives may suggest problems at the firm. When a VP, Sales is regularly turning over at a company, this is often an indication of sales performance issues which is clearly a signal of stress.
- Hiring of certain positions may suggest coming growth. When a head of HR is brought on, that may suggest the company is about to start a flurry of hiring activity.
These are just a few illustrative examples of how this data can be useful to those who invest, acquire, sell to or buy from private companies.
Of course, this is a hard problem as the information about these moves is unstructured often coming in the form of non-standardized press releases and news/media articles. The “structure” of these unstructured sources needs to be understood so that we can turn this messy, unstructured information into data which can ultimately drive models that we are building to rate private companies.
We are using supervised machine learning to help tackle this problem and deliver insights into HR health.
Supervised Learning – The Basics
Let’s take email as an example. If you have a lot of spam and non-spam emails, you could train a classifier, which is a function which learns a boundary in the observation space between spam and non-spam documents to be able to separate two kinds of documents. Then, this classifier could be used to determine which emails are spam or not for the emails that it has not seen before based on the structure that it learnt from the training set. This problem domain and method falls into a broad category of machine learning called supervised learning. In this category, we train the classifiers with labels of each class (spam vs nonspam), then using the classifiers we try to predict the class based on unobserved sample.
If we want to classify a news article as HR or non-HR, we could approach the problem in a similar way.
Still reading? We’re hiring
Human Resources classification is binary classification problem in the sense that we should be able to discriminate human resources events that we have for companies from the all other news. The reason why we formulate the problem is binary classification schema to start is two-fold.
First, it is simpler both in terms of gathering labeling as well as training the classifier.
Second, some of the articles may have mixed membership for different categories; some of the articles have both HR article components as well as financial news. It is easy to label them as either HR or not rather than multi-class labeling which requires more effort and time.
We previously wrote how much we like Scikit-Learn for machine learning problems. As most of our NLP stack and some data processing depends on Python, it was a natural choice for us. Scikit-Learn enables us to use most modern machine learning methods in a seamless way with its tight integration with Numpy, Scipy and Pandas. Since the development activity of the library is high and maintained well even if it is still development, in production, we did not face serious problems. It tries to be a full-featured toolbox for most common machine learning tasks as it is not limited to only classifiers. There are various components of the library for different tasks; feature selection, feature extraction, cross-validation and different classification measures to name a few. We will explain these methods in a broader view later in this post.
In order to solve a given problem, it is good practice to formulate the problem to solve efficiently and effectively. This not only allows us to solve the problem easily but also gives a clear idea which steps are necessary and how those steps are related to each other.
For classification problem, this could be done for 5 non-orthogonal components:
- Preprocessing and Feature Extraction (Document Representation)
- Feature Selection
- Evaluation and comparison of different classifiers
- Choose the best classifier for a given measure(Classification accuracy, F_beta score, Precision or Recall)
In this step, we first preprocess text. Preprocessing steps include lowercasing, lemmatization and removing the most common words in English. We also mask some titles and names in the news. Second, we produce feature vector representation of documents. We choose a Bag Of Words representation with some additional features to represent our text into vectors. Since our feature vectors are quite long, in order to visualize them, we use dimensionality reduction to the feature vectors in the next section.
In the below figure, we give 20 samples from our test set. The first 10 samples belong to HR class whereas the remaining ones belong to non-HR class. This step arguably is one of the most important steps as most of the classifiers classify well on efficient and compact document representation. This is because if the feature space is well separated and we take the most distinctive features of documents, even very simple classifiers (for example perceptron) could find a boundary which separates the classes effectively.
Generally, the boundary learned by a classifier would make less mistakes if the instances are distant in the training. This also makes sure that the classifier will do fairly well as the feature representation of instances are compact and well-separated.
For larger image, click here.
As you could see from the above figure, we are able to represent the documents well as the HR articles are grouped in a similar way. Note that since we are using a dimensionality reduction technique, this may not necessarily give the exact representation of vectors that we are using. We experimented various dimensionality reduction methods for the classification but they did not improve the fbeta score (F-measure). Therefore, we just did feature selection on top of vector representation which we will talk a little bit about in the next section.
Feature selection is another important step in the classification and it is even more important when your input is text as the feature numbers tend to be fairly high.(either tf-idf or bag of words). Not every word carries the same information to be useful for the classification task. We prefer the ones that would be useful for distinguishing different classes rather the ones that do not contribute to the classification.
Our aim in this section was not to make the classification accuracy as high as possible. We chose the features for a given feature number to maximize the fbeta score where beta = 0.707. For 0.707 in fbeta, precision is approximately two times more important than recall. This has two main motivations: first precision is more important than recall for our application. We may not be able to capture all of the HR news but wanted to make sure that the ones that we will be about HR news. Second, classification accuracy for unequal classes generally is biased towards the class that has more samples(in our case this was non-HR class). Therefore, using fbeta score(precision and recall for that matter) gives a better estimate of how well the classifier is doing rather than the raw classification accuracy.
There are various ways to do feature selection. One important thing to note is you should do the feature selection keeping the ultimate aim(or the task) in mind. You may want to choose the features that are most informative(by using mutual information score), but it may not help in the classification problem. Since Scikit-Learn is flexible for defining measures(we defined our own), you could do feature selection based on that score.
After we obtained the features in a nice vector form and did feature selection, we were ready to train our classifiers. Generally speaking, there is no silver bullet classifier for every task. Therefore, we tried as many as possible classifiers and experimented to see which one performs better. As we will see from their performances in the section 4, their behavior for different number of features vary as the model and learning functions vary drastically from one classifier to another.
We used Stratified 10-Fold Cross Validation to optimize number of features that are selected in the feature selection. The reason why we also wanted stratified dataset is because the number of HR articles are small compared to whole dataset. Stratification of HR articles ensures that every fold has similar number of HR articles. This results in better estimate for classification accuracy.
We need to also look at how number of features may affect the classification accuracy as more number of features may create curse of dimensionality problem, which makes it hard to learn proper boundaries that separates different classes in an optimal way. Also the classifiers may prefer smaller and larger number of features depending on their learning functions. Generally, rule of thumb is that you should choose smaller number of features as it may help avoid overfitting(generalizes better) and prevents curse of dimensionality problem in the first place. You need to consider both accuracy, curse of dimensionality as well as overfitting when you need to choose optimal number of features.
In order to choose best number of features, best features and best classifier, we need to have a measure to compare the different cases and scenarios. This measure could be a number of things; precision, recall, Receiver Operating Characteristic(ROC) score, Area Under Curve(AUC) score, f1 or fbeta score and so on. You need to determine which score would make most sense for your application to optimize. For a search engine, say Google, if it optimizes only for recall and not precision for their query results, then the first page may have a lot of irrelevant results which cause a bad user experience for the user. In this case, high recall would not make much sense as user has only limited time to browse the retrieved results. However, recall needs to be also good because even if the search results are relevant, if the search engine cannot pick up the most relevant results available in web, user still may not find what s/he is looking for. For this type of scenarios, one may weight precision over recall instead of optimizing only one measure. In order to do, there is a measure called fbeta where 1/beta corresponds to the weight of the precision over recall.(to be precise 1 / beta square). We choose beta to be 0.707 as precision is more important than recall. We also looked at the classification accuracies of both classes. However, as our classes are not evenly distributed, generally classification accuracies of classifiers are evenly distributed where the fbeta scores highly vary for different number of features and different classifiers.
Click on the images to see larger versions.
The first figure is FBeta score of the HR class for Beta=0.707, this is the metric that we choose the best classifier. Second figure shows the classification accuracies for both classes to see how good they are for both classes. The plots are useful for not just determining the best classifier for a given metric but also provides some context and understanding on how the number of features affect the classification accuracy and fbeta score of the classifier. Broadly speaking, most of the classifiers reach maximum fbeta scores around 150 for number of features and then either decrease or maintain the classification level. Therefore, one could choose a range and smaller size step to get the optimal number of features for certain classifiers.
Apart from the classical Cross-Validation schema, we also apply our classifier to test set which the classifier did not see previously. This is crucial for classification problem as we are really interested in how well our classifier performs on the unseen dataset. This part is crucial as the dataset that we are training the classifier on is not the dataset that it will be used for. Therefore, having a test measure for the unseen dataset might give a good idea how the classifier will perform in the production.
The reason why some of the machine learning methods introduce randomness(drop-outs in neural networks, random forests) is that it remedies this overfitting behavior where we do not want our classifier learn too much from the training set and also have some flexibility so that it will perform better on the test set. Cross-Validation and test performance are two very good metrics that makes sure that you do not overfit in the training set.
For a student analogy: We want to know if student actually studied all of the chapters that she is responsible for the exam. Purpose of the exam is not the questions that she studied but how well she is doing on the questions that she did not see before. If student only studies from particular chapters, she may be doing well on the stuff she studies hard(training set). However, she will be doing very poorly on the questions that she did not see(test set). (So much for the analogy) If we ignore this step, the performance that we obtained in the training set could be deceitfully optimistic. We need to evaluate our classifier on the test set as well and see if it performs similarly well as it does in the training set.
- Python is awesome. Both text preprocessing, massaging data and libraries that are available, it is a great language for doing any kind of scientific computing.
- Use virtual environments to protect your production setup from wide-system updates and upgrades. This also prevents backwards incompatibility when some library gets updated.
- Use a requirements.txt file to store the external libraries and to be able to reproduce the virtual environment if you want to deploy to some other machine. Nowadays, containers are popular. Getting the image of the machine would be a better option.
- Scikit-Learn did not have a good way to train the classifier for partial data when we implemented our news classifier. Now, starting from 0.14, some of the classifiers support partial_fit. This not only removes the burden of loading all of the input into memory but also provides a good way to retrain the classifier when you gather additional data.
- If most of the computation could be done(preprocessing and preparing data) using Numpy and Pandas, you would not have a lot of performance issues that Python has problem. Both these two libraries are low-level(Fortran and C respectively), and does computation quite efficiently for you.
- Ipython notebooks are very good especially if you are just trying to explore the dataset or for your “initial” results. You could move the code into nice modules later in the production. Since you could embed images, markdown text and videos, they are also very good to preserve your “state”. You could pick up the notebooks months later and then read it and remember where you were and what you were doing. Magic!
- Try to prefer flatter file layouts. Python is not the best language if you are importing different modules from hierarchical files. Especially, relative path configuration for production machine and development machine could be a problem when you all figure that out.
- Scikit-Learn is awesome. If you are doing very common tasks, first do a search in Scikit-Learn. The odds are, they are probably implemented for you. The Cross-Validation for separating the dataset, even Stratified cross validation, various metrics and even some of the plots are all part of the library.
- Since most of the classifiers share same API in Scikit-Learn, you could easily integrate and experiment with different classifiers. Do that!
- Confidence Scores: If you want to be even more restrictive on which results to display in the application, you may want to choose a classifier which is probabilistic so that it outputs a probability rather than only a class. By doing so, you could further subselect the results of the classifier. Especially, if you have large datasets and you would not be able to show the all of the results, you may want to choose only the ones that you are “confident”.
- Try to think the classification as a complete pipeline rather than classification itself. That is because the way you preprocess text, do what type of feature selection and all these steps will contribute the classification accuracy.
- Experiment! Experimentation is quite important as you do cannot observe your input text, you would not be able to guess the best optimal methods for preprocessing, feature selection and classification. You could experiment and compare different approaches, methods in order to have a better score for your metrics.
- Fail! We failed a lot of times to get a reasonable accuracy. And even more to get a high accuracy.
Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better. Samuel Beckett
After we get similar result in the test set with training set (hopefully both scores are high), our classifier is in production and ready for classification of the news into Human Resources or not.
Even if we are doing our best to optimize our classifier for HR, there will be some misclassifications inevitably due to noise in our training set or some articles may have partial HR components but not necessarily should be classified as HR article. (did you see our classification accuracy is not %100?) If you could report those, we could learn from the mistakes and provide much better user experience and product to you.