Learning programming SQL

Doing Statistics with SQL

http://www.kdnuggets.com/ This post covers how to perform some basic in-database statistical analysis using SQL.

By Jean-Nicholas Hould, JeanNicholasHould.com

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

For this post, I am using an open data set. The million songs data sets is a«freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

Why Descriptive Statistics?

 

When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (mean, median, mode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency

 

Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs  -- | avg_year | -- |----------| -- | 934      |

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_year FROM songs  ORDER BY songs.year  LIMIT 1  OFFSET (SELECT COUNT(*) FROM songs) / 2  -- | median_year | -- |-------------| -- | 0           |

Mode

The mode is the value that appears most often in a set of data.

SELECT      songs.year,     COUNT(*) as count FROM songs GROUP BY songs.year ORDER BY COUNT(*) DESC LIMIT 1  -- | year | count | -- |------|-------| -- | 0    | 5320  |

Variability

 

Min/Max Values

Minimum/Maximum value in a data set.

SELECT      MIN(songs.year) as min_year,     MAX(songs.year) as max_year FROM     songs  -- | min_year | max_year | -- |----------|----------| -- | 0        | 2010     |

Distribution of songs per year

Count of songs released in each years

SELECT      songs.year,     COUNT(*) songs_count FROM songs GROUP BY songs.year ORDER BY songs.year ASC  -- | year | song_count | -- |------|------------| -- | 0    | 5320       | -- | 1926 | 2          | -- | 1927 | 3          | -- | ...  | ...        | -- | 2009 | 250        | -- | 2010 | 64         |

Next Steps

 

The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.

Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.

7 Steps to Mastering SQL for Data Science

Follow these 7 steps to go from SQL data science newbie to seasoned practitioner quickly. No nonsense, just the necessities.

By Matthew Mayo, KDnuggets.

Structured Query Language (forever and always referred to henceforth as SQL, and usually pronounced "sequel") is the language for managing and querying data in relational database management systems (RDBMS).

So intertwined are the terms SQL and RDBMS that they are often conflated, sometimes by the uninitiated, but often simply out of convenience, and the term SQL is used adversarially to distignuish relational systems from non-relational database systems, which are often categorized by the umbrella term "NoSQL." Yet, SQL skills aren't wasted on non-RDBMS systems; the top data processing frameworks all have some implementation of SQL that sits atop their architecture, including Apache Big Data behemoths Spark andHadoop.

Together with R and Python, SQL forms the triumvirate of data science programming languages (liberal usage of the term "programming languages"); in fact, the most recent KDnuggets data science and analyticssoftware poll results indicate that these are also the 3 most-used tools by respondents, when all software is considered.

Clearly, SQL is important in data science. As such, this post aims to take a reader from SQL newbie to competent practitioner in a short time, using freely-available online resources. Lots of such resources exist on the internet, but mapping out a path from start to finish, using items which complement each other, is not always as straightforward as it may seem. Hopefully this post can be of assistance in this manner.

Step 1: Relational Database Basics

First off, since SQL is used to manage and query data in relational databases, having an understanding of what, exactly, a relational database is would be a good start.

Jennifer Widom is a renowned database researcher and professor at Stanford. She also put together one of the very first MOOCs on Coursera. Videos from this course are a good place to start:

A good introductory read on the relational database model is presented by UMass Boston's Dan Simovici here:

Extra credit: For a better theoretical understanding of the foundations upon which relational database systems are built, have a look at Widom's videos on relational algebra:

Step 2: SQL Overview

Now that we're all familiar with the relational paradigm, we shift focus to SQL.

First, get a high-level overview, again from Widom:

Read the first few pages of Simovici's SQL notes, or until it stops making intuitive sense:

For a great extended introductory overview of basic SQL commands, watch the following video from Khan Academy:

Once finished with the video, the following is a good treatment of SQL commands, with explanations, and can also serve as a useful reference moving forward, so keep it handy:

Finally, in preparation of the of next step, get an SQL environment up and running. You may not want to enter every SQL statement you encounter, but having an SQL interpreter up and running just makes sense. I suggest installing SQLite locally; it is a simple, but capable, SQL installation:

Step 3: Selecting, Inserting, Updating

While SQL can be used to perform a number of varied data management tasks, querying databases with SELECT, inserting records with INSERT, and updating existing records with UPDATE are some of the most heavily-used commands, and are good places to get started in practical SQL. Read and go through the following preliminary exercises:

Select

Insert

Update

This tutorial covers SQL basics and beginner to intermediate queries, and is a solid review of what we have seen to this point, and then some. It is presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.