Reading ML 10 projects

5 machine learning projects you should definitely have a look at, in no particular order (but numbered like they are in order, because I like numbering things):

1. Deepy

Deepy is an extensible deep learning framework based on Theano. It provides a clean, high-level interface for components such as LSTMs, Batch Normalization, and Auto Encoders. Deepy clearly aims for simplicity, and itsdocumentation and examples aim for the same. It also has a sister project, which uses Deepy to implement Deep Recurrent Attentive Writer (DRAW) generative models.

For an example of Deepy's simplicity and cleanliness, here's an example of a multi-layer model with dropout, from the project's Github:

# A multi-layer model with dropout for MNIST task.from deepy import * model = NeuralClassifier(input_dim=28*28) model.stack(Dense(256, 'relu'), Dropout(0.2), Dense(256, 'relu'), Dropout(0.2), Dense(10, 'linear'), Softmax()) trainer = MomentumTrainer(model) annealer = LearningRateAnnealer(trainer) mnist = MiniBatches(MnistDataset(), batch_size=20) trainer.run(mnist, controllers=[annealer])

You may have even heard of Deepy already; its Github repo has 305 stars and has been forked 51 times, as of this writing. The project is a decent exemplar of high-level deep learning APIs and wrappers that are becoming widespread (or seem to be). Deepy is authored by Raphael Shu.

2. MLxtend

Sebastian Raschka has put together MLxtend, something he is quick to point out is a work in progress, but is also something which attempts to tick a number of different boxes. MLxtend is a collection of useful tools and extensions for machine learning tasks.

Sebastian shared the following with me, regarding the project, how it came to be, and its goals:

Essentially, it's just a collection of useful tools and reference implementations related to ML and data science in general. Why did I come up with it? There are a couple of reasons:

1. Implementations of algorithms that I couldn't find anywhere else (e.g., the Sequential Feature Selection algorithms, the Majority Voting Classifier, the Stacking estimators, plotting decision regions, ...)
2. Implementations for teaching purposes (logistic regression, softmax regression, multi-layer perceptron, PCA, kernel PCA...); these impl. focus on code readability rather than pure efficiency
3. Wrappers for convenience: tensorflow softmax regression and multi-layer perceptrons, column-wise standardization for pandas data frames

This is essentially a library of commonly-used general machine learning functions that Sebastian has written and frequently uses. Additionally, Sebastian really likes to code, and thought that if he were to offer this "zoo" of different things (as he refers to it) up to others that he may keep the code "tidier" than usual.

Many of the implemented functions share similarities with scikit-learn's API, but future addition functionality will not necessarily be restricted by this. The big takeaway here: Sebastian promises that there is much more to come... so stay tuned. There's a good chance that any feature or novel algorithm that Sebastian plays with will end up being packaged in MLxtend.

3. datacleaner

datacleaner is the work of researcher Randal Olson, who is also responsible for the fantastic TPOT machine learning pipeline project. Olson bills Data Cleaner as a "Python tool that automatically cleans data sets and readies them for analysis." He is quick to declare that it is not magic, but also points out what it can do:

What datacleaner will do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.

datacleaner is a work in progress, but is currently capable of handling the following regular (and time-consuming) data cleaning operations: optionally drops rows with missing values; replaces missing values with either mode or median, on a column by column basis; encodes non-numerical variables with numerical equivalents. Randal tells us that he is looking for contributors, especially from those with more ideas on what data cleaning operations datacleaner could perform in an automated fashion.

Randal has an attention to detail that anyone who reads his blog or his Github repos already knows, and the concise documentation for this project is no exception. I have been using datacleaner recently, and so far it delivers on its promises.

4. auto-sklearn

auto-sklearn is automated machine learning for the Scikit-learn environment.

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015.

Its documentation is quite thorough, and the repo includes a few concise examples. I have, admittedly, not used it yet, but many others have; it has collected nearly 400 stars on Github. Given my propensity for Scikit-learn, I imagine I will try this out in the near future.

auto-sklearn is developed mainly by the Machine Learning for Automated Algorithm Design group at the University of Freiburg.

5. Deep Mining

Deep Mining is a machine learning pipeline auto-tuner, coming to us fromSebastien Dubois of the CSAIL lab at MIT. From the repo:

This software will test iteratively, and smartly, some hyperparameter sets in order to find as quickly as possible the best ones to achieve the best classification accuracy that a pipeline can offer.

Deep Mining does not seem to be a well-known project, given its relatively modest number of repo stars; however, given that it comes out of CSAIL, and has development activity within the past month, it may be worth benchmarking this against other similar automated pipeline tools. It comes with a few examples, and its usage seems to be straightforward.

More on the methods used:

The folder GCP-HPO contains all the code implementing the Gaussian Copula Process (GCP) and a hyperparameter optimization (HPO) technique based on it. Gaussian Copula Process can be seen as an improved version of the Gaussian Process, that does not assume a Gaussian prior for the marginal distributions but lies on a more complex prior. This new technique is proved to outperform GP-based hyperparameter optimization, which is already far better than the randomized search.

A paper on the GCP approach is forthcoming.

Here they are: 5 more machine learning projects you should consider having a look at. They are presented in no particular order, but are numbered for convenience, and because numbering things is where it's at.

1. Rusty Machine

Rusty Machine is machine learning in Rust. Rust, itself, is only about 6 years old, with development sponsored by Mozilla. For those unfamiliar with Rust, it is a systems language with similarities to C and C++, self-described as:

Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.

Rusty Machine is actively developed, and currently supports a selection of learning techniques, including Linear Regression, Logistic Regression, K-Means Clustering, Neural Networks, Support Vector Machines, and more. The project is relatively new, and at this point leaves functionality such as cross-validation and data handling to the user. The project also has solid documentation.

Supporting data structures, such as vectors and matrices, come built-in. Perhaps familiarly, Rusty Machine provides a train and a predict function for each of its supported models, as a common interface to models. If you are a Rust user looking for a general purpose machine learning library, download Rusty Machine and give it a try.

2. scikit-image

scikit-image is image processing in Python for SciPy. Is scikit-image, itself, machine learning? Well, remember that this is a list of machine learning projects (nothing actually says they must perform machine learning), and recall that the previous post included support projects as well, such as data processing and preparation tools. scikit-image falls into this category. The project includes a number of image processing algorithms, such as point detection, filters, feature selection, and morphology.

This post from y-hat is a nice overview of image processing with scikit-image. The post also recognizes the importance of image processing in relation to machine learning:

Emphasizing important traits and diluting noisy ones is the backbone of good feature design. In the context of machine vision, this means that image preprocessing plays a huge role. Before extracting features from an image, it's extremely useful to be able to augment it so that aspects which are important to the machine learning task stand out.

Here's a quick example of using scikit-image to filter an image:

from skimage import data, io, filters image = data.coins() # or any NumPy array! edges = filters.sobel(image) io.imshow(edges) io.show()

I would suggest the project documentation and the y-hat post as good starting points if interested in using scikit-image for image processing tasks.

3. NLP Compromise

NLP Compromise is written in Javascript, and does Natural Language Processing in the browser. It has a fully-documented API, is actively developed, and has an in-progress wiki promising some additional useful information as well.

NLP Compromise is very easy to both install and use. Here's a short set of examples:

let nlp = require('nlp_compromise'); // or nlp = window.nlp_compromise nlp.noun('dinosaur').pluralize(); // 'dinosaurs' nlp.verb('speak').conjugate(); // { past: 'spoke', // infinitive: 'speak', // gerund: 'speaking', // actor: 'speaker', // present: 'speaks', // future: 'will speak', // perfect: 'have spoken', // pluperfect: 'had spoken', // future_perfect: 'will have spoken'// } nlp.statement('She sells seashells').negate().text() // "She doesn't sell seashells" nlp.sentence('I fed the dog').replace('the [Noun]', 'the cat').text() // 'I fed the cat' nlp.text('Tony Hawk did a kickflip').people(); // [ Person { text: 'Tony Hawk' ..} ] nlp.noun('vacuum').article(); // 'a' nlp.person('Tony Hawk').pronoun(); // 'he'

The project repository has gathered a high number of stars on Github (nearly 6,000), and its adoption by a handful of downstream projects is also reassuring. NLP in the browser probably can't get any easier, or more lightweight.

4. Datatest

Now this is interesting. Datatest is test driven data wrangling, in Python.

From the project's documentation:

Datatest extends the standard library’s unittest package to provide testing tools for asserting data correctness.

Datatest has detailed documentation, and perhaps the best way to get an idea of what it is and how to use it is to check out an example from the documentation:

import datatest def setUpModule(): global subjectData subjectData = datatest.CsvSource('users.csv') class TestUserData(datatest.DataTestCase): def test_columns(self): self.assertDataColumns(required={'user_id', 'active'}) def test_user_id(self): def must_be_digit(x): # <- Helper function. return str(x).isdigit() self.assertDataSet('user_id', required=must_be_digit) def test_active(self): self.assertDataSet('active', required={'Y', 'N'}) if __name__ == '__main__': datatest.main()

You can check out the entire list of available assert methods here.

Datatest is a different way of looking at data wrangling and preparation. Given that so much of your time may be spent on this task, however, perhaps a new approach is worth checking out.

5. GoLearn

Adding to our collection of non-Python machine learning libraries and/or frameworks in the post, GoLearn is a general purpose machine learning library for Go.

Here is what GoLearn has to say about itself:

GoLearn is a 'batteries included' machine learning library for Go. Simplicity, paired with customisability, is the goal. We are in active development, and would love comments from users out in the wild.

Some good news for both users of Python who may be thinking of branching out, as well as for Go users looking to make the shift to machine learning, GoLearn implements the familiar Scikit-learn Fit/Predict interface, enabling fast estimator testing and swapping. It also allows for a smooth transition, and enables dedicated Go users to take advantage of all the Scikit-learn tutorial material out there without having to recreate the foundational practical machine learning concept instructions.

GoLearn is a mature enough project that it provides cross-validation and train/test splitting helper functions, which, if you recall, the relative newcomer Rusty Machine had not yet implemented. Looking to undertake some machine learning in Go, or looking for an excuse to try out the Go language? GoLearn might just be what you're after.

Page updated

Google Sites

Report abuse