Aug 16

Understanding of Databases for Data Scientists

By ganpati | Getting Started

The majority of data stored by businesses is in these relational databases. And in addition, these databases are exceptionally good at storing complicated business data sets as well as allowing for efficient information retrieval. So having a strong understanding of relational databases is essential to being an effective data scientist.

To start with you should have a grasp of filters, joins, aggregations etc. for querying the database. Here is a post that you can follow: Filters, Joins, Aggregations, and All That: A Guide to Querying in SQL

In addition to that, an understanding of high-performance parallel databases designed to deal with large data sets (like Terradata and HP Vertica) would be helpful.

For very large data sets, hadoop, the Hadoop Distributed File System (HDFS), and MapReduce are typically used to store and analyze these large data sets. Apache Hive is an implementation of SQL on top of MapReduce which brings the power of SQL to hadoop. Apache Pig and Scalding are similar competitors.

Bill Howe’s coursera class Introduction to Data Science has a good discussion of SQL query optimizers in his lectures on “Relational Databases, Relational Algebra“.


Aug 12

Basic Statistics for Data Science

By Ani Rud | Statistics

Basic rules of probability

Expected value

sample and population quantities

Signal and Noise

probability densities and mass functions

variability and distribution

Statistical meaning of overfitting

Tag: Statistics for Data Science

Statistics for analytics

In the previous posts we have discussed the 4 most crucial aspects of analytics- Starting with a business question(s), readying data, analyzing it (using visualization and models), and communicating the results. In practice these steps are not carried out in isolation but they often impact each other. For example while analyzing data we may find out that the data is inadequate or incomplete and hence you may need to revisit the data.

In short analytics is the process of iterative, methodical exploration of an organization’s data to answer business questions to facilitate data driven decision making.

Going by the above definition there are 3 prerequisites to the analytics process. Having the questions that need to be answered, having the data and having the tools to analyze it.

Statistics offers us techniques for survey sampling to collect the proper data and it helps us with design of experiments to analyze the data. In this post I’ll discuss some of the ideas and tools that statistics offers and that are widely used in analytics to answer business questions.

If you are new to the field of statistics you may get confused by the standard terminologies like random processes or stochastic variables. So, I’ll not go strictly by those terminologies here. I’ll rather try to link those terminologies to examples that we are familiar with.

Random variables and Probability theory

Let’s say in an imaginary village there are 100 families. And we know that each family may have between one and three kids under the age of ten years. In this example if we pick up any family in this village it will be an experiment or random trial. The number of kids in the family that we pick up is a random variable. The different possible values for “number of kids under 10” are 1,2 and 3 and they constitute the sample space. Any subset of the sample space is called an event. In this case we can define an event E as the family selected has less than 3 kids. If the family that we pick up randomly has 1 or 2 kids in it we’d say the event E has occurred.

Remember every value in the sample space or every event has a probability associated with it. We’ll now see how to calculate the probability.

Let’s say we’ve discovered that in the village, 50 families have 2 kids under age of 10, 30 families have 1 kid and 20 families have 3 kids under 10. So, we can define the probabilities as:

P (kids under 10 = 2) = 50/100=0.5

P( kids under 10 = 1) = 30/100=0.3

P( kids under 10 = 3) = 20/100=0.2

The probability function follows 2 rules:

For all possible events probability is greater than or equal to zero but less than or equal to 1.

Sum of probability of all possible outcomes is 1.

In our case, if we denote probability of number of kids under age 10 =K as P(K), we have,

P(1) + P(2) + P(3) =1

In predictive analytics tutorials we will come across many such variables whose values are known for certain experiments (denoted by rows of the table) and unknown for other experiments. Our job is to find out the probability based on the given data and use that probability to predict the values of unknown experiments. We often use models to make these predictions but by understanding how these models calculate the probabilities we will be able to use them more effectively.

1 192 193 194