All Posts by Ani Rud

About the Author

Jan 15

7 Professions That Leverage Big Data

By Ani Rud | Analytics Career

Statistician– A statistician is someone who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects. Some of the responsibilities of statisticians include:

    • Apply statistical theories and methods to solve practical problems of various industries.
      Determine methods for finding and collecting data.
      Design surveys or experiments to collect data
      Analyze and interpret data and create reports based on the analysis.
  • Actuary– Actuaries are some of the highest paid professionals in the insurance industry. An actuary is a business professional who analyzes the financial consequences of risk. Actuaries use mathematics, statistics, and financial theory to study uncertain future events, especially those of concern to insurance and pension programs.

    They quantify the financial impact of risk, price insurance products based on statistical analysis, and help establish adequate reserves so companies remain solvent.

    On a day-to-day basis, they often use excel or other software to analyze big data, or work with senior level staff to establish a risk averse business strategy.

    Bioinformaticians – A field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data.

    Bioinformaticians use computational tools to gather and analyze data in fields such as population, biology, genetics, and pharmaceutical development. It is an interdisciplinary approach using data collection and modeling to analyze biological data. Bioinformaticians create mathematical models, develop dynamic simulations, and perform pattern analyses of biological systems. They are also known as biostatisticians, biometricians, and computational biologists. Careers related to bioinformaticians include biochemists, biophysicists, and medical scientists.

    Bioinformaticians work within different medical science and health fields, including biology, genetics, proteomics, and pharmaceuticals. Some professionals come from a biomedical research background while others specialize in computational tools.

    Financial Engineering– Financial engineering draws on tools from applied mathematics, computer science, statistics and economic theory. In broadest definition, anyone who uses technical tools in finance could be called a financial engineer, for example any computer programmer in a bank or any statistician in a government economic bureau.

    • They are generally responsible for building the financial tools and creation of model design, data preparation, model development, implementation, documentation, and on-going monitoring & refinement of existing models.
      The financial engineer uses different computer programming languages, mathematics, and statistical methods for designing financial models.
      These engineers generally design software applications that can predict any movements in price, forecast losses, perform competitor analysis and incorporate the macroeconomic factors to portfolio, understand Pricing sensitivity to the customers etc.
      They also conduct advanced statistical analyses to identify suspicious activity patterns, industry risks, or any complex, critical issues concerning the trends associated with money laundering or other financial crimes, assist simulation and analysis for model development or ad-hoc analysis.
  • Social Statistics – Social statistics is the use of statistical measurement systems to study human behavior in a social environment. This can be accomplished through polling a group of people, evaluating a subset of data obtained about a group of people, or by observation and statistical analysis of a set of data that relates to people and their behaviors. See more in wikipedia

    Operations Research– A discipline that deals with the application of advanced analytical methods to help make better decisions. Employing techniques from other mathematical sciences, such as mathematical modeling, statistical analysis, and mathematical optimization, operations research arrives at optimal or near-optimal solutions to complex decision-making problems.

    Because of its emphasis on human-technology interaction and because of its focus on practical applications, operations research has overlap with other disciplines, notably industrial engineering and operations management, and draws on psychology and organization science. Operations research is often concerned with determining the maximum (of profit, performance, or yield) or minimum (of loss, risk, or cost) of some real-world objective. Originating in military efforts before World War II, its techniques have grown to concern problems in a variety of industries.

    Epidemiologists are public health professionals who investigate patterns and causes of disease and injury in humans. They seek to reduce the risk and occurrence of negative health outcomes through research, community education, and health policy.

    Field epidemiologists are scientists who study the spread of infectious diseases with the goals of containing the current outbreak and preventing future recurrences. They conduct and interpret analyses, develop appropriate statistical models, manipulate complex databases, and track and evaluate patterns of care and outcomes.

    Jan 01

    Start Solving Kaggle Problem With R: One Hour Tutorial

    By Ani Rud | Tutorials

    Once I came across a question “How to Learn R in a day?”. Though it sounds an impossible task, you can surely gain some basic understanding of the tool in a very short time. Interestingly R has an easy learning curve at the beginning but once you proceed to learn advanced topics the learning curve gets steeper and steeper, partly due to the statistical knowledge required for advanced learning.

    This post is for those who want to gain initial understanding of R in a short span of time with some hands-on tutorials. I’ll start with a Kaggle problem. Kaggle is one of the many places to find interesting problems in data science. As you have decided to learn R, you must be already knowing that R is a free and powerful statistical programming language available for free. For this tutorial I’ll use the R console. You may install R here if you have not already done so.

    Once you have R installed, you can download the the ‘test’ and ‘train’ data files for the competition Titanic: Machine Learning from Disaster and you are all set to begin.

    As I’m a strong believer in the role of questions in learning and thinking I’ve tried to follow a question and answer method in this post. I’m sure there are many more questions that may arise in your mind and I’ll be happy to answer them in the comments section. We’ll start by 3 basic questions which I believe you should ask at the beginning of each Kaggle problem:

    1. What is being predicted?
    2. What data do we have?
    3. What’s the prime motive or behavior in play here?

    The answers:

    We are predicting the fate of all the passengers who were aboard ocean liner ‘Titanic’ on its fatal maiden voyage. In predictive analytics terminology we are predicting the variable ‘survival’ which is also called target variable.

    The train data consists of details of passengers, each row having several information of a passenger like- class (denotes the socio-economic status and not the class of journey), sex, age etc. and most importantly whether the passenger survived. In the test data the value for the variable survived is not given and we have to predict it.

    The prime motive or behavior is, since there is a shortage of lifeboats, a pecking order will be followed to decide who has access to the lifeboats first. Our instinct says women and children will be the first ones to be saved, followed by the elderly. Somewhere in between the influential people in the society will cut in (like Hockley in the movie). Let’s find out with our analysis.

    We’ll begin the analysis by asking a few questions about the data-

    How to read the data in R?

    # store the path to current directory in a variable cur_dir

    cur_dir <- "location of the folder where you stored the train and test files"

    # set the variable cur_dir as your working directory using setwd command

    setwd(cur_dir)

    # read the train data and store in in train object using read.csv command

    # stringsAsFactors is a logical that is TRUE by default

    # by making it FALSE we indicate string shouldn’t be treated as factor variables.

    train <- read.csv("train.csv", stringsAsFactors=FALSE)

    We’ll discuss the stringsAsFactors variable later. For now just note that it’s TRUE by default and we make it FALSE so as to avoid conversion of all strings to factors.

    Let’s try finding answers to the next set of questions:

    1. How to summarize the data to understand it properly?
    2. What are the values in the Target Variable?
    3. What are the most important predictors influencing the target variable?
    4. Based on behavior of data identified earlier what predictors can be found?
    5. What are characteristics of predictor variables: Are they able to predict the Target Variable independently, or they should be combined with other predictors?

    In order to summarize the data we will use the str function and the output will be as below:

    str(train)

    starting_data_analysis_img1

    Let’s start with the data types here. The ‘int’ data type indicates integer variables which means the variable can only have whole number values. ‘num’ indicates numeric variables which can take decimal values. ‘factors’ are like categories. By default all text is imported as factors but if we specify stringsAsFactors= FALSE as we did in the read.csv function text is imported as ‘chr’ and not as factors. ‘chr’ is a string variable.

    The str function also gives us the number of observations in the data as well as the number of variables. It also gives few initial values for each column of the data as you can see here.

    To find out what values can be taken by the target variable, we need to check out the unique values in the ‘survived’ column.

    We can access the survived column using a dollar sign as: train$survived

    train$survived will give you a vector of all the values present in the column. To find the unique values present in the column we need to use the unique function:

    unique (train$Survived)

    will give us the output as:
    [1] 0 1

    This means this is a binary classifier problem where target variable can take only two values: 0 and 1

    The unique function gives us the unique values in a column but what if we want to know how many values of each type are present in the column? for that we need the table function.

    table(train$Survived)

    gives us output as below:
    0 1
    549 342

    Now, we know the number of 0’s and 1’s in the data.

    The next question is to find out the predictors. From the list of variables available in the summary which do you think should be a likely predictor. Well, intuitively we can say the Age and Sex should be likely predictors as women and children are given preference in the rescue operations during a tragedy. But how shall check this assumption? We can do that by just adding another column to our table function.

    table(train$Survived, train$Gender)

    female male
    0 81 468
    1 233 109

    Now, we have the number of males and females who survived the tragedy. It will be better if we can see this as a proportion. That can be done by using prop.table table function which converts values into proportions:

    prop.table(table(train$Survived, train$Sex))
    female male
    0 0.09090909 0.52525253
    1 0.26150393 0.12233446

    Though we found out the proportions, but this is for the complete table. We would prefer to find out the proportions of males and females that survived and for that we need to find column wise proportions. To specify the column wise proportion, we have to add 2 in this function as:

    prop.table(table(train$Survived, train$Sex),2)

    female male
    0 0.2579618 0.8110919
    1 0.7420382 0.1889081

    Now, we have a more meaningful representation. 74.2% of the females survived the disaster compared to 18.89% male. So our initial assumption was correct. Let’s now check out for age variable. If you use unique function on the data you will find there are 89 values for age. It will be wise to reduce the number of values so as to get an easy comparison.

    Let’s consider the following age groups: 1-12, 13-18, 19-35 and 35+. This classification is not strictly based on any reasoning but it’s just for initial analysis. Later on we’ll find better age brackets with proper techniques. We’ll now learn the bracket operator that’ll help us find out values in a column that meets certain condition and we’ll create a new variable Age_grp to store the age group values:

    Here’s how we create Age_grp variable:

    train$Age_grp[train$Age<13] <- "grp1"

    train$Age_grp[train$Age>12 & train$Age<19] <- "grp2"

    train$Age_grp[train$Age>18 & train$Age<36] <- "grp3"

    train$Age_grp[train$Age>35] <- "grp4"

    Now, if we use the table proportion function with our Age_grp values we’ll get new insights about the data:
    prop.table(table(train$Survived, train$Age_grp),2)

    grp1 grp2 grp3 grp4
    0 0.4202899 0.5714286 0.6173184 0.6175115
    1 0.5797101 0.4285714 0.3826816 0.3824885

    As you can see only 57.97% and 42.857% children survived compared to 38% of the adults. So, apparently there is some effort given to saving children.

    Now, before we proceed to the other questions, based on our analysis so far we will create a submission file and upload it to Kaggle to see our score. Here’s how we’ll make our prediction for the submission:

    All females survived and all males below the age of 12 survived.

    we’ll first read the test file and modify the values in the Survived column and write it again.

    test<- read.csv("test.csv")

    # first we make all the values in the Survived column as 0

    test$Survived <-0

    # then we modify those values which we predicted as survived as 1

    test$Survived[test$Sex==”female”] <- 1

    test$Survived[test$Sex==”male” & test$Age<12] <- 1

    # now we’ll write the value stored in the test object to the file test.csv in current directory

    write.csv(test, file=”test.csv”)

    As a last step before submission we have to create the submit object and write it in a submission file as below:

    submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "first_submission.csv", row.names = FALSE)

    If we now go to the Make a Submission link for Titanic tutorial we’ll see a submission area like below:

    starting_data_analysis_img2

    We just have to click on the click on the highlighted button, select the first_submission.csv file from our current directory and upload the file. Then click on the submit button as indicated and you’re done.

    starting_data_analysis_img3

    You’ve just scored 0.77033 and left behind almost a 1000 competitors by predicting the obvious. Congratulations!! In the next tutorials as we fine-tune our models we will see a huge jump in our ranking.

    Now, coming to the next question, what are the other predictors we can single out based on behavior of the data? In data mining you’ll often find that algorithms mine the data to find the behavior. But in data science, you can use your judgement to find out certain predictors very easily and this will save you a lot of effort where there are large number of variables which can be predictors.

    Now, based on our understanding of society, we know that people having substantial social influence have a better shot at accessing the lifeboat. How can we identify such people? This people must have purchased tickets at a higher price and they must be belonging to a higher class. Let’s test this assumption.

    In order to test this assumption, I’ll introduce an amazing function called aggregate. Let’s find out how it can be used to analyze the Pclass and Fare variables.

    aggregate(Survived~Pclass, data=train, mean)
    Pclass Survived
    1 1 0.6296296
    2 2 0.4728261
    3 3 0.2423625

    In the above function I’ve used mean function to find the fraction of people who survived. Since Survived variables has values 0 and 1, mean will add all 1’s and divide it by total number of values (sum of 0’s and 1’s). Effectively we’ll get the fraction of people who survived. You can get a detailed understanding of the aggregate function here.

    It’s apparent that people belonging to class 3 have a much lower survival rate of 24.2% compared to class 1 which has 62.96% survival rate. Let’s now check the fare variable. But there must be multiple fares and it might produce a long list of aggregate values. Let’s find out:

    length(unique(train$Fare))
    248

    So, we have 248 values. Let’s classify these values into more meaningful categories of high and low fares. For that we have to plot a histogram to see how the values are distributed.

    hist(train$Fare)
    starting_data_analysis_img4

    We can see that most of the passengers paid a fare less than 50. Only a small fraction paid above 500 and some paid between 200 to 500. So we can heuristically assign the following categories for the fares:

    0 to 50, 50 to 100, 100 to 150, 150 to 500 and 500+

    We’ll introduce another variable called Fare_type and apply the aggregate function on it:

    train$Fare_type[train$Fare<50]<-"low"

    train$Fare_type[train$Fare>50 & train$Fare<=100]<-"med1"

    train$Fare_type[train$Fare>100 & train$Fare<=150]<-"med2"

    train$Fare_type[train$Fare>150 & train$Fare<=500]<-"high"

    train$Fare_type[train$Fare>500]<-"vhigh"

    # now applying the aggregate function on Fare_type

    aggregate(Survived~Fare_type, data=train,mean)

    Fare_type Survived
    1 high 0.6538462
    2 low 0.3191781
    3 med1 0.6542056
    4 med2 0.7916667
    5 vhigh 1.0000000

    As you can see, all the people who paid very high fares survived compared to only 31.9% of those who paid low fares.

    Based on the above methods you can try all variables and see for yourself how you can improve the accuracy of your prediction.

    In the next post we’ll discuss about the various models that can be applied to this problem for better predictions.

    I’d seriously suggest you work out these steps on your own, as we move forward with this tutorial, to get a first hand experience of the programming and modeling. Have fun with R and Kaggle.

    Dec 09

    First Step in Data Analytics: Defining the Business Objective

    By Ani Rud | Data Analysis

    Each industry has its own set of problems and CXO’s look for specific solutions in analytics. That’s the reason analytics is becoming popular every day, and every company is trying to hire the best talent from the market. The reason there is a scarcity of analytics talent is because, until recently, there were not enough trainings available in this field that could provide all the necessary skills. But before telling you about the necessary skills, I’ll fast forward a little and tell you about the problems that you need to solve as an analytics professional. Mind you, sometimes the problem statements aren’t that simple and it takes few days to weeks to clearly define the problem statements. But to start with I’ll demonstrate a few simpler ones.

    The steps in problem solving process using analytics are listed here. Some analytics processes may be a subset of these steps but I’ve explained the whole journey so that you know where you’re going with this. And you’ll know when to jump a step, with experience!
    The checklist for the steps in analytics are:

    1. Defining the goals
    2. Explore the data
    3. Analyze the data
    4. Test alternatives
    5. Find the best solution- Optimize
    6. Implement the decision
    7. Monitor results from the decision and define new goals if necessary

    The first step in the business analytics process

    If you are preparing for an interview this is a very important step to understand in order to make a first impression. Because the first question you’re asked about a business case is to define the goal. Though in some cases you’ll be given the goal and even then you have to analyze that goal to arrive at a problem statement for using analytics. Remember, you should take a pause before stating the goal. Because this will make or break the complete solution you will create for the case. Without goal you can’t give a proper solution.

    In this post we will primarily focus on the first step that involves understanding the problem. As Einstein famously said, “if I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions”.

    Just think for a while – what the business would like to improve on or the problem it wants solved?

    If the goal is too big to analyze at one go, break it into smaller parts. Here’s an example how.

    Not long back a friend of mine wanted to try his luck in the restaurant business. Coming from an engineering background he knew very little about the industry. All he knew about the industry was from his own visits to more than a hundred restaurants in the city and the opinions of his friends. But what he did next was quite amazing. And that’s why I’ve decided to use his business as a case study to explain analytics in few of the early posts.

    His goal was simple- to open a successful restaurant in Bangalore.

    But wait. What’s ‘successful’ here? The first step is to define the business objectives clearly. According to him the success he was looking for in terms of business are below: (aside from other parameters like financial freedom, happiness, job satisfaction etc.)
    -Select the best location for the restaurant
    – Create a menu with most popular items
    -A lot of word of mouth publicity for his restaurant
    -A lot of fan following in social media
    -A handsome revenue from restaurant operations
    -Repeat customers
    -Substantial revenue from online ordering

    Now, as he knows what he really wants to achieve in his business, he wanted to apply some analytics techniques he had used in his previous job to this. And for that he has to define a set of analytics goals. But analytics goals are different from the qualitative goals that he has stated above. Analytics goals are measurable. And in order to apply analytics we have to break down the above goals into measurable goals. Here’s how we do it:

    • Go through each business objective and, if necessary, rephrase it using terms that are quantifiable.
    • Once you have a list of measurable items, think which ones can be measured directly
    • The ones that are complex to measure should be broken down into simple measurable goals.
    • Prepare a final list of measurable analytics goals

    If you look at the goals carefully you’ll find there is always something to be measured and some action specific to the measurement. So let’s first find out what’s measurable in each of the above goals-

    -Best location- a best location fro a restaurant can be any area where lot of people visit during the day or night. We can further break it down in terms of what age group we are looking for?, what income group we are targeting, what type of restaurants are already present in the vicinity?, can we access the various residential area easily for online ordering? Etc. All these questions are ultimately answers one questions- which location will maximize visitors to our restaurant?
    Based on above let’s say we will identify the best location as the one that will attract a footfall of 2000 customers per month in our restaurant.

    -Create a suitable menu- A suitable menu consists of items that will be liked by most of the visitors. It shouldn’t be too long so that it burdens the restaurant’s inventory and it shouldn’t be too short so that visitors don’t find a dish he wants to have. The restaurant should also create a positioning about the kind of dishes it offers best. It may be fast food, Chinese, ethnic Indian dishes, European cuisine etc. But at the same time it should offer some of the dishes that are commonly ordered by customers across all restaurants like rice, paratha etc. Once all these are take into account the menu has to be revisited over time so as to find out customer likes and dislikes and additions and alterations can be made on basis of feedback. But for the time being it’s best to find the suitable food based on demography, target customers, availability, expertise etc. that will attract maximum customers. A measurable goal in this this regard can be answer to the customer feedback question “Do you like the dishes offered in our menu?” and “Would you like more items to be added to the list?” The second question can be asked only for initial two months after which the restaurant can establish its positioning and run analytics on items ordered and answer to fist question to modify the menu. Else it will find it too difficult to include all the items to cater to everyone’s taste. An effective analytics goal in this case will be a 90% positive feedback on the menu after initial experimentation phase of 2 months is over.

    -word of mouth publicity- first we have to analyze the number of customers who has ‘liked’ the pages of similar restaurants in the city. Based on that we can set a specific number say 10,000 likes that we want to achieve within a year. Also, the number of customers who come through a referral are good way of knowing the word of mouth. Let’s say we keep a target referral of 2000 customers for the first year.

    -A lot of fan following in social media- This again can be measured from customer review forms and social media pages. If the customer replies to the question ‘How you came to know about us?” as social media or internet it means social media presence is significant. Here we will keep the same target as 10,000 likes but we will revise the target based on customer feedback that we receive.

    -A handsome revenue from restaurant operations- We can keep the revenue target as 1 crore for the first year based on 30% discount over similar restaurants in the area.

    -Repeat customers- we want 20% of the customers to visit again within a period of 6 months. This can be tracked using customer feedback form. “Have you visited within past 6 months? If no, will you visit us again within next 6 months?”

    -Substantial revenue from online ordering- We want at least a revenue of 30 lacs from online ordering- again based on similar restaurants in the area.

    So our final analytical goals will look something like below:

     A location that will attract 3000 customers every month on average
     90% positive feedback on the menu items
     1 crore of total sales
     30 lacs of online ordering over internet and phone
     10,000 likes in social media pages
     20% repeat customers in any six months period
     2000 customers visiting through referrals

    This concludes the identification of goals for the purpose of solving analytics problem. In many real world scenarios the goals are much more complicated and assumptions are much more difficult to incorporate. We will how how to deal such situations in our advanced discussions. In the next post we will see how we can gather data from various sources to explore and analyze, so that proper decisions can be taken to achieve the above goals.