# Tag Archives for " data scientist courses "

Dec 01

## How to Become A Data Scientist

Over the past few years the role of predictive modeler has broadened and received a lot of attention, so as to make it the sexiest job of the century. Job of a Data Scientist, as we call them, is rewarding both in terms of salary and recognition.

Data Scientists are now required to posses not just statistical and coding knowledge but also an understanding of industry and the products. This post primarily focuses on a bird’s eye view of topics and related resources for building expertise on the topics through hands-on tutorials. There is a separate post dedicated towards thought process and traits for a data scientist. If you’re self-studying for data science you should cover all the below listed topics and other relevant topics from the free ebooks available for data science.

You may also like the infographic on Data Scientist’s resources

# Data Science Prerequisites

## Mathematics and Statistics:

### Probability

• Random variables and expectations
• Probability mass functions
• Probability density functions
• Expected values
• Expected values for PDFs
• Independent and dependednt variables
• Conditional Probability
• Bayes’ rule

### Basic Statistics

You may go through the video tutorials available on SAS for the below topics on basic statistics:

• Distribution Analysis
• One-Way Frequency Analysis
• Table Analysis
• Correlation Analysis
• Sample Tests (One-Sample, Two-sample, Paired t Tests)
• ANOVA (One-Way, N-way, Nonparametric One-Way)
• Linear Regression (Simple, Multiple )
• Analysis of Covariance
• Binary Logistic Regression
• Generalized Linear Models

The below topics are most important for analyzing data statistically. A detailed discussion on these topics are available in book thinkstats (Free pdf) and the MOOC- Design and Interpretation of Clinical Trials from coursera.

### Statistical Approach

In our day to day life we come across many anecdotal evidence which are based on data that is unpublished and usually personal. But in a professional set up we might want evidence that is more persuasive and anecdotal evidence usually fails, because of Small number of observations, Selection bias, Confirmation bias and inaccuracy. A statistical approach helps us overcome all these shortcomings and helps us make convincing arguments in front of different stakeholders in the business.

### Descriptive Statistics

Means and averages, Variance, distributions and various visualization techniques.

### Experimental Design and Associated Methods

Most simple example of a statistical experiment is a coin toss. There is more than one possible outcome. Each possible outcome (i.e., heads or tails) can be listed down in advance. There is also an element of chance due to the uncertainty of outcome.

Some practical areas of significance are marketing campaigns and clinical trials. Results from randomized clinical trials are usually considered the highest level of evidence for determining whether a treatment is effective.

On the other hand, marketers use this technique to increase the number of variables tested in a single campaign (product offers, messages, incentives, mail formats and so on) and to test multiple offers in the market simultaneously. Marketers learn exactly which variables entice consumers to act. As a result, response rates rise dramatically, the effectiveness of future campaigns improves and overall return on spending increases.

Deriving insights from data require an understanding of proper experimental design. You’ll need to be able to interpret when causal effects are significant, and be able to work with more complicated setups involving sequential/overlapping/multiple experiments.

Associate analysis methods include t-tests to ANOVAs etc. which are useful in making inference from these experiments.

There are many questions that can’t be answered just with data alone, especially questions like How does this new feature impact latency/conversion/CTR?

### Quasi-experiments or observational studies

These are more difficult than experiments to analyze, and require a very strong understanding of assumptions and causal models in order to draw convincing conclusions from them. Not everything can be randomized, so you’ll often have to face these observational studies or quasi-experiments.

### Time Series

Most of the BI metrics we’ll discuss through out the tutorials come in a time series. With time series analysis you can use past data to forecast the future.

### Linear modeling

It is important and common for regression and classification exercises. It gives you a good framework to combine information from covariates and predict an outcome or class.

Some other statistical topics of interest are: Survey analysis, Causal inference, Bayesian data analysis, Nonparametric methods etc.

### Statistical Methods used in Machine Learning

For classification problems discriminant analysis seems to be used a lot by data scientists.

Techniques such as clustering, trees, etc. also help with regression, classification, and more complicated modeling questions such as understanding text, pictures, and audio.

## Tools (one or more of below tools):

• R
• Python
• SAS
• Matlab
• SPSS and many others
• ### Learning Python programming:

Step by step guide to learn Pyhton

### Learning R programming:

Why is R a language of choice for data scientists?

Here is a step by step guide with 12 tutorials that help you learn Data Science with R

Here is a trove of learning materials from IDRE, UCLA

### Learning SAS:

Visit SAS website for tutorials

## Database Tools:

As a data scientist you must know how data is stored, queried, managed, handled, cleaned, modelled and reported. Though you may not need to create data pipelines and manage data warehouses.

As a student you’ll most often work with textfiles or csv’s to access and store data. But once you join an industry you’ll be required to access the databases to fetch your data and do the analysis. MySQL, MongoDB and Cassandra are few of the names. It’s important that you knoa at least basics of querying the data and other concepts of relational databases (datamodels etc.) that will give you the ease of working in a real life setup.

## Mastering Machine Learning Algorithms

Once you have gained a basic understanding of the statistical topics and tools you should focus on various techniques used in machine learning. There are 3 broad types of learning: Supervised Learning, Unsupervised Learning and Semi-Supervised Learning. Below is a list of various types of algorithms that you should learn for Data Science applications:

• Decision Tree Algorithms (Conditional Decision Trees, CART etc.)
• Bayesian Algorithms (Naive Bayes etc.)
• Clustering Algorithms (k-means, Hierarchical Clustering etc.)
• Artificial Neural Network Algorithms (Perceptron etc.)
• Deep Learning Algorithms (Deep Belief Networks etc.)
• Dimensionality Reduction Algorithms (PCA, PCR etc.)
• Regression Algorithms (Linear Regression, Logistic Regression etc.)
• Instance-based Algorithms (KNN etc.)
• Regularization Algorithms (Ridge Regression etc.)
• Ensemble Algorithms (Random Forest, GBM etc.)
• Other Algorithms

Here’s an excellent blog post that takes you on a tour of machine learning algorithms.

## Data Cleaning, Wrangling, Visualization

Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated.

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

Any advanced tool for data analysis will help you in cleaning, wrangling and visualization.

ggplot2 is a powerful tool in R that comes handy for Data Scientists. Here’s a cheatsheet on ggplot2.

## Reporting and Presentation

Once you have completed your analysis and come up with conclusions it’s time to convince your stakeholders about your ideas. Reporting and presentation techniques will help you reach out to different stakeholders with your ideas.

## Industry Metrics

Engagement / retention rate, churn, conversion, cancellation, Competitive products / duplicates matching, how to measure them, spam detection, behavioral traits.

Cost functions: Log-loss, other entropy-based, DCG/NDCG, etc.

## Other Useful Topics

### A Grasp of End-to-End Development

The stuffs that Data Scientists build- the graphs, charts, reports, analysis and presentations, are for the consumption of product managers and other decision makers across the company. These ideas need to be integrated into other systems that are either already functional or will be introduced in future. So, in order o make practical suggestions to the decision makers it’s important to understand the end-to-end processes in your company and how things work.

### Topics related to Computer Science

• Theory of computation / Analysis of algorithms
• Data structures and algorithms
• Software engineering
• Parallel programming / Massive computation (for processing huge datasets)
• Network analysis

### Develop Your Intuition

Data Science is a huge subject area and even if you master a lot of tools and techniques, you’ll still need a good deal of intuition and rational guesses to complement the scientific knowledge. That’s because in a practical scenario you’ll most often work with inadequate data, inefficient systems that may not allow to process data above a certain limit, strict timelines and so on. Working on some real time puzzles and guesstimation problems will open your eyes towards developing your own thought process to tackle such situations.

### Familiarity with Big Data Tools and Fraeworks

While working with data that runs into Petabytes, you may need to work with distributed processing as such volumes can’t be processed by a single machine. The volume, velocity and variety of the data will require a different approach than what you would apply to ordinary data. Familiarity with some of the Big Data tools and frameworks like Hadoop, MapReduce and
Apache Spark may prove to be a plus depending on which company you’re aiming to work for.

### Economics

A few nice topics to cover are: Behavioral economics, Game theory, Auction design

## Free Online Courses

I’ve listed here some highly recommended courses that are available online for free. If you are looking for a thorough analysis of different online courses please visit the post on analysis of 12 online data science courses.

### Data Science

CS109 Data Science @ Harvard
Estimated efforts: 120 to 200 hours
Tool used: Python, d3

Introduction to Data Science @ COURSERA
Estimated efforts: 100 to 140 hours
Certificate for \$49
Tool used: Python, R, SQL

### Data Analysis Courses Online

Data Analysis: Take It to the MAX() @ EDX
Estimated efforts: 30 to 50 hours
Certificate for \$50
Tool used: MS-Excel, python

The Analytics Edge @ EDX
Data Analysis & Statistics
Estimated efforts: 120 to 180 hours
Certificate for \$100
Tool used: R

### Machine Learning

Learning From Data @ CALTECH
Estimated efforts: 120 to 140 hours
Tool used: Any tool