## How to Become A Data Scientist

By ganpati | Getting Started

Over the past few years the role of predictive modeler has broadened and received a lot of attention, so as to make it the sexiest job of the century. Job of a Data Scientist, as we call them, is rewarding both in terms of salary and recognition.

Data Scientists are now required to posses **not just statistical and coding knowledge but also an understanding of industry and the products**. This post primarily focuses on a bird’s eye view of topics and related resources for building expertise on the topics through hands-on tutorials. There is a separate post dedicated towards thought process and traits for a data scientist. If you’re self-studying for data science you should cover all the below listed topics and other relevant topics from the free ebooks available for data science.

**Also read What is Business Analytics**

You may also like the infographic on Data Scientist’s resources

# Data Science Prerequisites

## Mathematics and Statistics:

### Probability

- Random variables and expectations
- Probability mass functions
- Probability density functions
- Expected values
- Expected values for PDFs
- Independent and dependednt variables
- Conditional Probability
- Bayes’ rule

### Basic Statistics

You may go through the video tutorials available on SAS for the below topics on basic statistics:

- Distribution Analysis
- One-Way Frequency Analysis
- Table Analysis
- Correlation Analysis
- Sample Tests (One-Sample, Two-sample, Paired t Tests)
- ANOVA (One-Way, N-way, Nonparametric One-Way)
- Linear Regression (Simple, Multiple )
- Analysis of Covariance
- Binary Logistic Regression
- Generalized Linear Models

The below topics are most important for analyzing data statistically. A detailed discussion on these topics are available in book thinkstats (Free pdf) and the MOOC- Design and Interpretation of Clinical Trials from coursera.

### Statistical Approach

In our day to day life we come across many **anecdotal evidence** which are based on data that is unpublished and usually personal. But in a professional set up we might want evidence that is more persuasive and anecdotal evidence usually fails, because of Small number of observations, Selection bias, Confirmation bias and inaccuracy. A statistical approach helps us overcome all these shortcomings and helps us make convincing arguments in front of different stakeholders in the business.

### Descriptive Statistics

Means and averages, Variance, distributions and various visualization techniques.

### Experimental Design and Associated Methods

Most simple example of a statistical experiment is a coin toss. There is more than one possible outcome. Each possible outcome (i.e., heads or tails) can be listed down in advance. There is also an element of chance due to the uncertainty of outcome.

Some practical areas of significance are marketing campaigns and clinical trials. Results from randomized clinical trials are usually considered the highest level of evidence for determining whether a treatment is effective.

On the other hand, marketers use this technique to increase the number of variables tested in a single campaign (product offers, messages, incentives, mail formats and so on) and to test multiple offers in the market simultaneously. Marketers learn exactly which variables entice consumers to act. As a result, response rates rise dramatically, the effectiveness of future campaigns improves and overall return on spending increases.

Deriving insights from data require an understanding of proper experimental design. You’ll need to be able to interpret when causal effects are significant, and be able to work with more complicated setups involving sequential/overlapping/multiple experiments.

Associate analysis methods include t-tests to ANOVAs etc. which are useful in making inference from these experiments.

There are many questions that can’t be answered just with data alone, especially questions like How does this new feature impact latency/conversion/CTR?

### Quasi-experiments or observational studies

These are more difficult than experiments to analyze, and require a very strong understanding of assumptions and causal models in order to draw convincing conclusions from them. Not everything can be randomized, so you’ll often have to face these observational studies or quasi-experiments.

### Time Series

Most of the BI metrics we’ll discuss through out the tutorials come in a time series. With time series analysis you can use past data to forecast the future.

### Linear modeling

It is important and common for regression and classification exercises. It gives you a good framework to combine information from covariates and predict an outcome or class.

**Some other statistical topics of interest are**: Survey analysis, Causal inference, Bayesian data analysis, Nonparametric methods etc.

### Statistical Methods used in Machine Learning

For classification problems discriminant analysis seems to be used a lot by data scientists.

Techniques such as clustering, trees, etc. also help with regression, classification, and more complicated modeling questions such as understanding text, pictures, and audio.

## Tools (one or more of below tools):

### Learning Python programming:

Step by step guide to learn Pyhton

### Learning R programming:

Why is R a language of choice for data scientists?

Here is a step by step guide with 12 tutorials that help you learn Data Science with R

Here is a trove of learning materials from IDRE, UCLA

### Learning SAS:

Visit SAS website for tutorials

## Database Tools:

As a data scientist you must know how data is stored, queried, managed, handled, cleaned, modelled and reported. Though you may not need to create data pipelines and manage data warehouses.

As a student you’ll most often work with textfiles or csv’s to access and store data. But once you join an industry you’ll be required to access the databases to fetch your data and do the analysis. **MySQL, MongoDB and Cassandra** are few of the names. It’s important that you knoa at least basics of querying the data and other concepts of relational databases (datamodels etc.) that will give you the ease of working in a real life setup.

## Mastering Machine Learning Algorithms

Once you have gained a basic understanding of the statistical topics and tools you should focus on various techniques used in machine learning. There are 3 broad types of learning: Supervised Learning, Unsupervised Learning and Semi-Supervised Learning. Below is a list of various types of algorithms that you should learn for Data Science applications:

- Decision Tree Algorithms (Conditional Decision Trees, CART etc.)
- Bayesian Algorithms (Naive Bayes etc.)
- Clustering Algorithms (k-means, Hierarchical Clustering etc.)
- Artificial Neural Network Algorithms (Perceptron etc.)
- Deep Learning Algorithms (Deep Belief Networks etc.)
- Dimensionality Reduction Algorithms (PCA, PCR etc.)
- Regression Algorithms (Linear Regression, Logistic Regression etc.)
- Instance-based Algorithms (KNN etc.)
- Regularization Algorithms (Ridge Regression etc.)
- Ensemble Algorithms (Random Forest, GBM etc.)
- Other Algorithms

Here’s an excellent blog post that takes you on a tour of machine learning algorithms.

## Data Cleaning, Wrangling, Visualization

Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated.

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

Any advanced tool for data analysis will help you in cleaning, wrangling and visualization.

ggplot2 is a powerful tool in R that comes handy for Data Scientists. Here’s a cheatsheet on ggplot2.

## Reporting and Presentation

Once you have completed your analysis and come up with conclusions it’s time to convince your stakeholders about your ideas. Reporting and presentation techniques will help you reach out to different stakeholders with your ideas.

## Industry Metrics

Engagement / retention rate, churn, conversion, cancellation, Competitive products / duplicates matching, how to measure them, spam detection, behavioral traits.

Cost functions: Log-loss, other entropy-based, DCG/NDCG, etc.

## Other Useful Topics

### A Grasp of End-to-End Development

The stuffs that Data Scientists build- the graphs, charts, reports, analysis and presentations, are for the consumption of product managers and other decision makers across the company. These ideas need to be integrated into other systems that are either already functional or will be introduced in future. So, in order o make practical suggestions to the decision makers it’s important to understand the end-to-end processes in your company and how things work.

### Topics related to Computer Science

- Theory of computation / Analysis of algorithms
- Data structures and algorithms
- Software engineering
- Parallel programming / Massive computation (for processing huge datasets)
- Network analysis

### Develop Your Intuition

Data Science is a huge subject area and even if you master a lot of tools and techniques, you’ll still need a good deal of intuition and rational guesses to complement the scientific knowledge. That’s because in a practical scenario you’ll most often work with inadequate data, inefficient systems that may not allow to process data above a certain limit, strict timelines and so on. Working on some real time puzzles and guesstimation problems will open your eyes towards developing your own thought process to tackle such situations.

### Familiarity with Big Data Tools and Fraeworks

While working with data that runs into Petabytes, you may need to work with distributed processing as such volumes can’t be processed by a single machine. The volume, velocity and variety of the data will require a different approach than what you would apply to ordinary data. Familiarity with some of the Big Data tools and frameworks like Hadoop, MapReduce and

Apache Spark may prove to be a plus depending on which company you’re aiming to work for.

### Economics

A few nice topics to cover are: Behavioral economics, Game theory, Auction design

## Free Online Courses

I’ve listed here some highly recommended courses that are available online for free. If you are looking for a thorough analysis of different online courses please visit the post on analysis of 12 online data science courses.

### Data Science

**CS109 Data Science @ Harvard**

Estimated efforts: 120 to 200 hours

Tool used: Python, d3

Link to course

**Introduction to Data Science @ COURSERA**

Estimated efforts: 100 to 140 hours

Certificate for $49

Tool used: Python, R, SQL

Link to course

### Data Analysis Courses Online

**Data Analysis: Take It to the MAX() @ EDX**

Estimated efforts: 30 to 50 hours

Certificate for $50

Tool used: MS-Excel, python

Link to course

**The Analytics Edge @ EDX**

Data Analysis & Statistics

Estimated efforts: 120 to 180 hours

Certificate for $100

Tool used: R

Link to course

### Machine Learning

**Learning From Data @ CALTECH**

Estimated efforts: 120 to 140 hours

Tool used: Any tool

Link to course

**Machine Learning (by Andrew NG) @ COURSERA**

Machine Learning

Estimated efforts: 80 to 130 hours

Certificate for $49

Tool used: Octave

Link to course

## Hands-on with Kaggle

Kaggle is the best platform for learning by doing. Whether you are just a beginner, a learner, a professional or a maestro Kaggle has something in store for you. Besides working on real life problems it also helps you get noticed by prospective employers. But like every other platform the key to use Kaggle to your best advantage is through forming the right team and choosing the right problems. We have started a community to help you get into the right team based on your expertise and interest. Just fill in this simple form and we’ll get in touch with you.

If you are just thinking of trying your hands at some of the Kaggle problems, best advice would be to start by asking some basic questions and exploring the data before moving to the advanced analysis.