Doing Statistics with SQLhttp://www.kdnuggets.com/ This post covers how to perform some basic indatabase statistical analysis using SQL. By JeanNicholas Hould, JeanNicholasHould.com I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it. For this post, I am using an open data set. The million songs data sets is a«freelyavailable collection of audio features and metadata for a million contemporary popular music tracks». If you want to follow along, here are the steps:
Why Descriptive Statistics? Central Tendency The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.
Median The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.
Mode The mode is the value that appears most often in a set of data.
Variability Minimum/Maximum value in a data set.
Distribution of songs per year Count of songs released in each years
Next Steps In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper. Bio: JeanNicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com. 7 Steps to Mastering SQL for Data ScienceFollow these 7 steps to go from SQL data science newbie to seasoned practitioner quickly. No nonsense, just the necessities. By Matthew Mayo, KDnuggets. Structured Query Language (forever and always referred to henceforth as SQL, and usually pronounced "sequel") is the language for managing and querying data in relational database management systems (RDBMS). So intertwined are the terms SQL and RDBMS that they are often conflated, sometimes by the uninitiated, but often simply out of convenience, and the term SQL is used adversarially to distignuish relational systems from nonrelational database systems, which are often categorized by the umbrella term "NoSQL." Yet, SQL skills aren't wasted on nonRDBMS systems; the top data processing frameworks all have some implementation of SQL that sits atop their architecture, including Apache Big Data behemoths Spark andHadoop. Together with R and Python, SQL forms the triumvirate of data science programming languages (liberal usage of the term "programming languages"); in fact, the most recent KDnuggets data science and analyticssoftware poll results indicate that these are also the 3 mostused tools by respondents, when all software is considered. Clearly, SQL is important in data science. As such, this post aims to take a reader from SQL newbie to competent practitioner in a short time, using freelyavailable online resources. Lots of such resources exist on the internet, but mapping out a path from start to finish, using items which complement each other, is not always as straightforward as it may seem. Hopefully this post can be of assistance in this manner. Step 1: Relational Database Basics First off, since SQL is used to manage and query data in relational databases, having an understanding of what, exactly, a relational database is would be a good start. Jennifer Widom is a renowned database researcher and professor at Stanford. She also put together one of the very first MOOCs on Coursera. Videos from this course are a good place to start: A good introductory read on the relational database model is presented by UMass Boston's Dan Simovici here: Extra credit: For a better theoretical understanding of the foundations upon which relational database systems are built, have a look at Widom's videos on relational algebra: Step 2: SQL Overview Now that we're all familiar with the relational paradigm, we shift focus to SQL. First, get a highlevel overview, again from Widom: Read the first few pages of Simovici's SQL notes, or until it stops making intuitive sense: For a great extended introductory overview of basic SQL commands, watch the following video from Khan Academy: Once finished with the video, the following is a good treatment of SQL commands, with explanations, and can also serve as a useful reference moving forward, so keep it handy: Finally, in preparation of the of next step, get an SQL environment up and running. You may not want to enter every SQL statement you encounter, but having an SQL interpreter up and running just makes sense. I suggest installing SQLite locally; it is a simple, but capable, SQL installation: Step 3: Selecting, Inserting, Updating While SQL can be used to perform a number of varied data management tasks, querying databases with SELECT, inserting records with INSERT, and updating existing records with UPDATE are some of the most heavilyused commands, and are good places to get started in practical SQL. Read and go through the following preliminary exercises: Select Insert Update This tutorial covers SQL basics and beginner to intermediate queries, and is a solid review of what we have seen to this point, and then some. It is presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.

Learning Hub >