Are there any tools for feature engineering?
161: I would like to be able to easily smooth, visualize, fill gaps, etc. Something similar to MS Excel, but that has R as the underlying language instead of VB.
Very interesting question (+1). While I am not aware of any software tools that currently offer comprehensive functionality for feature engineering, there is definitely a wide range of options in that regard. Currently, as far as I know, feature engineering is still largely a laborious and manualprocess (i.e., see this blog post). Speaking about the feature engineering subject domain, this excellent article by Jason Brownlee provides a rather comprehensive overview of the topic.
Ben Lorica, Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media Inc., has written a very nice article, describing the state-of-art (as of June 2014) approaches, methods, tools and startups in the area of automating (or, as he put it, streamlining) feature engineering.
I took a brief look at some startups that Ben has referenced and a product by Skytree indeed looks quite impressive, especially in regard to the subject of this question. Having said that, some of their claims sound really suspicious to me (i.e., "Skytree speeds up machine learning methods by up to 150x compared to open source options"). Continuing talking about commercial data science and machine learning offerings, I have to mention solutions by Microsoft, in particular their Azure Machine Learning Studio. This Web-based product is quite powerful and elegant and offers some feature engineering functionality (FEF). For an example of some simple FEF, see this nice video.
Returning to the question, I think that the simplest approach one can apply for automating feature engineering is to use corresponding IDEs. Since you (me, too) are interested in R language as a data science backend, I would suggest to check, in addition to RStudio, another similar open source IDE, called RKWard. One of the advantages of RKWard vs RStudio is that it supportswriting plugins for the IDE, thus, enabling data scientists to automate feature engineering and streamline their R-based data analysis.
Finally, on the other side of the spectrum of feature engineering solutions we can find someresearch projects. The two most notable seem to be Stanford University's Columbus project, described in detail in the corresponding research paper, and Brainwash, described in this paper.
Source: as linked
"Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.
However, this depends on a lot of things.
First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).
Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.
Here is some training set data where two predictors are used to model a two-class system (I'll unblind the data at the end):
There is also a corresponding test set that we will use below.
There are some observations that we can make:
The data are highly correlated (correlation = 0.85)
Each predictor appears to be fairly right-skewed
They appear to be informative in the sense that you might be able to draw a diagonal line to differentiate the classes
Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we'll use the area under the ROC curve on the predictor data directly.
Here are univariate box-plots of each predictor (on the log scale):
There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.
What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC's). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.
Let's compute the components:
> library(caret)> head(example_train)
PredictorA PredictorB Class 2 3278.726 154.89876 One 3 1727.410 84.56460 Two 4 1194.932 101.09107 One 12 1027.222 68.71062 Two 15 1035.608 73.40559 One 16 1433.918 79.47569 One
> pca_pp <- preProcess(example_train[, 1:2],+ method = c("center", "scale", "pca"))+ pca_pp
Call: preProcess.default(x = example_train[, 1:2], method = c("center", "scale", "pca")) Created from 1009 samples and 2 variables Pre-processing: centered, scaled, principal component signal extraction PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2])> test_pc <- predict(pca_pp, example_test[, 1:2])> head(test_pc, 4)
PC1 PC2 1 0.8420447 0.07284802 5 0.2189168 0.04568417 6 1.2074404 -0.21040558 7 1.1794578 -0.20980371
Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?
These are the test set predictors simply rotated.
PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:
There is much more separation in the second component.
This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn't guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.
However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.
How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?
The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.
These data are from the cell segmentation experiment of Hill et al, and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1 in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.
However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.
Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.
Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".
Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:
Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)
The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:
Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).
Feature Selection techniques
Why bother selecting features? Why not use all available features? Part of the answer could be that you need a solution that is simple, interpretable, and fast. This favors features that have good statistical performance and that are easy to explain to non-technical users. But there could be legal3 reasons for excluding certain features as well (e.g., the use of credit scores is discriminatory in certain situations).
Domain experts can manually pick out features, and more recently I wrote about aservice that uses crowdsourcing techniques. It’s not hard to find examples of problems where domain expertise is insufficient, and this approach isn’t particularly practical when underlying data sets are massive.
There are variable ranking procedures that use metrics like correlation, information criteria, etc. They scale to large data sets but can easily lead to strange recommendations (e.g., use “butter production in Bangladesh” to predict the S&P 500).
Techniques that take a vast feature space and reduce it to a lower-dimensional one (clustering, principal component analysis, matrix factorization).
Expect more tools to streamline Feature Discovery
In practice, feature selection and feature engineering are iterative processes where humans leverage automation4 to wade through candidate features. Statistical software have long had (stepwise) procedures for feature selection. New startups are providing similar tools:Skytree’s new user interface lets business users automate feature selection.
I’m definitely noticing much more interest from researchers and startups. A group out of Stanford5 just released a paper on a new R language extension and execution framework designed for feature selection. Their R extension enables data analysts to incorporate feature selection using high-level constructs that form a domain specific language. Some startups like ContextRelevant and SparkBeyond6, are working to provide users with tools that simplify feature engineering and selection. In some instances this includes incorporating features derived from external data sources. Users of SparkBeyond are able to incorporate the company’s knowledge databases (Wikipedia, OpenStreeMap, Github, etc.) to enrich their own data sources.
While many startups who build analytic tools begin by focusing on algorithms, many products will soon begin highlighting how they handle feature selection and discovery. There are many reasons why there will be more emphasis on features: interpretability (this includes finding actionable features that drive model performance), big data (companies have many more data sources to draw upon), and an appreciation of data pipelines(algorithms are just one component).
Building tools that automate feature discovery is an important topic in artificial engineering research. For more on recent trends in AI, check out our new series, Intelligence Matters.
(0) The quote from Alon Halevy, Peter Norvig, and Fernando Pereira is associated with big data. But features are just as important in small data problems. Read through the Kaggle blog and you quickly realize that winning entries spend a lot of their time on feature engineering.
(1) In the process documents usually get converted into structures that algorithms can handle (vectors).
(2) Once can for example create composite (e.g. linear combination) features out of existing ones.
(3) From Materialization Optimizations for Feature Selection Workloads: “Using credit score as a feature is considered a discriminatory practice by the insurance commissions in both California and Massachusetts.”
(4) Stepwise procedures in statistical regression is a familiar example.
(5) The Stanford research team designed their feature selection tool after talking to data analysts at several companies. The goal of their project was to increase analyst productivity.
(6) Full disclosure: I’m an advisor to SparkBeyond.