## Basic Statistics for Data Science

By Ani Rud | Statistics

Basic rules of probability

Expected value

sample and population quantities

Signal and Noise

probability densities and mass functions

variability and distribution

Statistical meaning of overfitting

Tag: Statistics for Data Science

Statistics for analytics

In the previous posts we have discussed the 4 most crucial aspects of analytics- Starting with a business question(s), readying data, analyzing it (using visualization and models), and communicating the results. In practice these steps are not carried out in isolation but they often impact each other. For example while analyzing data we may find out that the data is inadequate or incomplete and hence you may need to revisit the data.

In short analytics is the process of iterative, methodical exploration of an organization’s data to answer business questions to facilitate data driven decision making.

Going by the above definition there are 3 prerequisites to the analytics process. Having the questions that need to be answered, having the data and having the tools to analyze it.

Statistics offers us techniques for survey sampling to collect the proper data and it helps us with design of experiments to analyze the data. In this post I’ll discuss some of the ideas and tools that statistics offers and that are widely used in analytics to answer business questions.

If you are new to the field of statistics you may get confused by the standard terminologies like random processes or stochastic variables. So, I’ll not go strictly by those terminologies here. I’ll rather try to link those terminologies to examples that we are familiar with.

**Random variables and Probability theory**

Let’s say in an imaginary village there are 100 families. And we know that each family may have between one and three kids under the age of ten years. In this example if we pick up any family in this village it will be an experiment or random trial. The number of kids in the family that we pick up is a random variable. The different possible values for “number of kids under 10” are 1,2 and 3 and they constitute the sample space. Any subset of the sample space is called an event. In this case we can define an event E as the family selected has less than 3 kids. If the family that we pick up randomly has 1 or 2 kids in it we’d say the event E has occurred.

Remember every value in the sample space or every event has a probability associated with it. We’ll now see how to calculate the probability.

Let’s say we’ve discovered that in the village, 50 families have 2 kids under age of 10, 30 families have 1 kid and 20 families have 3 kids under 10. So, we can define the probabilities as:

P (kids under 10 = 2) = 50/100=0.5

P( kids under 10 = 1) = 30/100=0.3

P( kids under 10 = 3) = 20/100=0.2

The probability function follows 2 rules:

For all possible events probability is greater than or equal to zero but less than or equal to 1.

Sum of probability of all possible outcomes is 1.

In our case, if we denote probability of number of kids under age 10 =K as P(K), we have,

P(1) + P(2) + P(3) =1

In predictive analytics tutorials we will come across many such variables whose values are known for certain experiments (denoted by rows of the table) and unknown for other experiments. Our job is to find out the probability based on the given data and use that probability to predict the values of unknown experiments. We often use models to make these predictions but by understanding how these models calculate the probabilities we will be able to use them more effectively.