Python

7 Steps to Mastering Machine Learning With Python: Where to begin? How to proceed? Go from zero

Source: http://www.kdnuggets.com/2015/11/seven-steps-machine-learning-python.html

1. You need Python installed. 

install Anaconda. It is an industrial-strength Python implementation for Linux, OSX, and Windows, complete with the required packages for machine learning, including numpy, scikit-learn, and matplotlib. It also includes iPython Notebook, an interactive environment for many of our tutorials. I would suggest Python 2.7, for no other reason than it is still the dominant installed version. 

1.a  Open Anaconda navigator, / Spyder. / open .py / 

2. Reading lecture notes free books

3. Scientific Python Packages Overview 

so-called scientific Python libraries we put to use when performing elementary machine learning tasks (there is clearly subjectivity in this): 

▪  numpy - mainly useful for its N-dimensional array objects 

▪  pandas - Python data analysis library, including structures such as dataframes 

▪  matplotlib - 2D plotting library producing publication quality figures 

▪  scikit-learn - the machine learning algorithms used for data analysis and data mining tasks 

A good approach to learning these is to cover this material: 

▪  Scipy Lecture Notes by Gaël Varoquaux, Emmanuelle Gouillart, and Olav Vahtras 

This pandas tutorial is good, and to the point: 

▪  10 Minutes to Pandas 

Seaborn, which is a data visualization library based on matplotlib.  

The core of a wide array of machine learning tasks in Python 

3.a. Introduction to NumPy  it's a must learn.

NumPy is a blazing fast maths library for Python with a heavy emphasis on arrays. It allows you to do vector and matrix maths within Python and as a lot of the underlying functions are actually written in C, 

array Basics

 

Creation

NumPy revolves around these things called arrays. Actually nparrays, but we don't need to worry about that. With these arrays we can do all sorts of useful things like vector and matrix maths at lightning speeds. Get your linear algebra on! (Just kidding we won't be doing any heavy maths)

# 1D Array a = np.array([0, 1, 2, 3, 4]) b = np.array((0, 1, 2, 3, 4)) c = np.arange(5) d = np.linspace(0, 2*np.pi, 5)  print(a) # >>>[0 1 2 3 4]print(b) # >>>[0 1 2 3 4]print(c) # >>>[0 1 2 3 4]print(d) # >>>[ 0.          1.57079633  3.14159265  4.71238898  6.28318531]print(a[3]) # >>>3

The above code shows 4 different ways of creating an array. when we print an array with numbers of different length, it automatically pads them out. This is useful for viewing matrices.

# MD Array, a = np.array([[11, 12, 13, 14, 15],               [16, 17, 18, 19, 20],               [21, 22, 23, 24, 25],               [26, 27, 28 ,29, 30],               [31, 32, 33, 34, 35]])  print(a[2,4]) # >>>25

To create a 2D array we pass the array() function a list of lists (or a sequence of sequences). If we wanted a 3D array we would pass it a list of lists of lists, a 4D array would be a list of lists of lists of lists and so on.

Multidimensional Array Slicing

Slicing a multidimensional array is a bit more complicated than a 1D one and it's something that you will do a lot while using NumPy.

# MD slicingprint(a[0, 1:4]) # >>>[12 13 14]print(a[1:4, 0]) # >>>[16 21 26]print(a[::2,::2]) # >>>[[11 13 15]                   #     [21 23 25]                   #     [31 33 35]]print(a[:, 1]) # >>>[12 17 22 27 32]

The diagram below illustrates what the given example slices do.

Array Properties

When working with NumPy you might want to know certain things about your arrays. Luckily there are lots of handy methods included within the package to give you the information that you need.

# Array properties a = np.array([[11, 12, 13, 14, 15],               [16, 17, 18, 19, 20],               [21, 22, 23, 24, 25],               [26, 27, 28 ,29, 30],               [31, 32, 33, 34, 35]])  print(type(a)) # >>><class 'numpy.ndarray'>print(a.dtype) # >>>int64print(a.size) # >>>25print(a.shape) # >>>(5, 5)print(a.itemsize) # >>>8print(a.ndim) # >>>2print(a.nbytes) # >>>200

The shape of an array is how many rows and columns it has, the above array has 5 rows and 5 columns so its shape is (5, 5).

The 'itemsize' property is how many bytes each item takes up. The data type of this array is int64, there are 64 bits in an int64, 8 bits in a byte, divide 64 by 8 and you get how many bytes it takes up, which in this case is 8.

The 'ndim' property is how many dimensions the array has. This one has 2. A vector for example however, has just 1.

The 'nbytes' property is how many bytes are used up by all the data in the array. You should note that this does not count the overhead of an array and so the actual space that the array takes up will be a little bit larger.

Working With Arrays

 

Basic Operators

# Basic Operators a = np.arange(25) a = a.reshape((5, 5))  b = np.array([10, 62, 1, 14, 2, 56, 79, 2, 1, 45,               4, 92, 5, 55, 63, 43, 35, 6, 53, 24,               56, 3, 56, 44, 78]) b = b.reshape((5,5))  print(a + b) print(a - b) print(a * b) print(a / b) print(a ** 2) print(a < b) print(a > b)  print(a.dot(b)) 

With the exception of dot() all of these operators work element-wise on the array. For example (a, b, c) + (d, e, f) would be (a+d, b+e, c+f). It will work separately on each element, pairing the corresponding elements up and doing arithmetic on them. It will then return an array of the results. Note that when using logical operators such as < and > an array of booleans will be returned, which has a very useful application which we will go through later.

The dot() function works out the dot product of two arrays. This does not return an array, but a scalar (a value with just magnitude and no direction).

Array Specific Operators

There are also some useful operators provided by NumPy for processing an array.

# dot, sum, min, max, cumsum a = np.arange(10)  print(a.sum()) # >>>45print(a.min()) # >>>0print(a.max()) # >>>9print(a.cumsum()) # >>>[ 0  1  3  6 10 15 21 28 36 45]

The sum(), min() and max() functions are pretty obvious in what they do. Add up all the elements and find the minimum and maximum elements.

The cumsum() function however is a little less obvious. It adds together every element like sum() but it does this by first adding up the first and the second and storing the result of that calculation in a list and adding that result to the third, which again is then stored in a list. This is done for all elements in the array, returning a running total of the sum of the array as a list.

Advanced Indexing

Fancy Indexing

is a useful way of picking out specific array elements that you want to work with.

# Fancy indexing a = np.arange(0, 100, 10) indices = [1, 5, -1] b = a[indices] print(a) # >>>[ 0 10 20 30 40 50 60 70 80 90]print(b) # >>>[10 50 90]

As you can see in the above example we index the array with a sequence of the specific indexes that we want to retrieve. This in turn returns a list of the the elements we indexed.

Boolean masking

Boolean masking is a fantastic feature that allows us to retrieve elements in an array based on a condition that we specify.

# Boolean maskingimport matplotlib.pyplot as plt  a = np.linspace(0, 2 * np.pi, 50) b = np.sin(a) plt.plot(a,b) mask = b >= 0 plt.plot(a[mask], b[mask], 'bo') mask = (b >= 0) &amp; (a <= np.pi / 2) plt.plot(a[mask], b[mask], 'go') plt.show() 

The above example shows how to do boolean masking. All you have to do is pass the array a conditional involving the array and it will give you an array of the values that return true for that condition.

The example produces the following plot:

We use the conditions to select different points on the plot. The blue points (which in the diagram also include the green points, but the green points cover up the blue ones), show all the points that have a value greater than 0. The green points show all points that have a value greater than 0 and that are less than half pi.

Incomplete Indexing

Incomplete indexing is a convenient way of taking an index or slice from the first dimension of a multidimensional array. For example if you had the array a = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]], then a[3] would give you the element with index 3 in the first dimension of the array, which here would be the value 4.

# Incomplete Indexing a = np.arange(0, 100, 10) b = a[:5] c = a[a >= 50] print(b) # >>>[ 0 10 20 30 40]print(c) # >>>[50 60 70 80 90]

Where

the where() function is another useful way of retrieving elements of an array conditionally. Simply pass it a condition and it will return a list of elements where that condition is true.

# Where a = np.arange(0, 100, 10) b = np.where(a < 50) c = np.where(a >= 50)[0] print(b) # >>>(array([0, 1, 2, 3, 4]),)print(c) # >>>[5 6 7 8 9]

4. Python's de facto standard machine learning library, scikit-learn

 

The scikit-learn flow chart. 

5. in-depth explorations of the various common, and useful, algorithms. 

  k-means Clustering by Jake VanderPlas 

Next, we move back toward classification, and take a look at one of the most historically popular classification methods: 

▪  Decision Trees via The Grimm Scientist 

From classification, we look at continuous numeric prediction: 

▪  Linear Regression by Jake VanderPlas 

We can then leverage regression for classification problems, via logistic regression: 

▪  Logistic Regression by Kevin Markham 

6. Advanced

  Support Vector Machines by Jake VanderPlas 

Next, random forests, an ensemble classifier, are examined via a Kaggle Titanic Competition walk-through: 

▪  Kaggle Titanic Competition (with Random Forests) by Donne Martin 

Dimensionality reduction is a method for reducing the number of variables being considered in a problem. Principal Component Analysis is a particular form of unsupervised dimensionality reduction: 

▪  Dimensionality Reduction by Jake VanderPlas 

 Neural Networks and Deep Learning by Michael Nielsen 

7. Deep Learning in Python 

 

The learning is deep. 

Deep learning is everywhere! Deep learning builds on neural network research going back several decades, but recent advances dating to the past several years have dramatically increased the perceived power of, and general interest in, deep neural networks. 

Theano Theano is a Python deep learning library   Theano Deep Learning Tutorial by Colin Raffel 

Caffe Dreaming Deep with Caffe via Google's GitHub 

free online book: 

▪ No knowledge of programming: Python The Hard Way by Zed A. Shaw 

▪  Experience in programming but not with Python: Google Developers Python Course (highly recommended for visual learners) 

▪  An Introduction to Python for Scientific Computing (from UCSB Engineering) by M. Scott Shell (a great scientific Python intro ~60 pages) 

▪  a 30 minute crash course in Python: Learn X in Y Minutes (X = Python) 

Machine learning

Foundational Machine Learning Skills 

 there is a lot of variation in what people consider a "data scientist.

Is itnecessary to intimately understand kernel methods in order to efficiently create and gain insight from a support vector machine model? Of course not.

Like almost anything in life, required depth of theoretical understanding is relative to practical application. 

If more: time investment in a more academic setting, or via intense self-study at the very least

you don't need to possess a PhD-level understanding of the theoretical aspects of machine learning in order to practice, in the same manner that not all programmers require a theoretical computer science education in order to be effective coders. 

 Unofficial Andrew Ng course notes 

  Tom Mitchell Machine Learning Lectures