Over the past few years the role of predictive modeler has broadened and received a lot of attention, so as to make it the sexiest job of the century. Job of a Data Scientist, as we call them, is rewarding both in terms of salary and recognition.
Data Scientists are now required to posses not just statistical and coding knowledge but also an understanding of industry and the products. This post primarily focuses on a bird’s eye view of topics and related resources for building expertise on the topics through hands-on tutorials. There is a separate post dedicated towards thought process and traits for a data scientist. If you’re self-studying for data science you should cover all the below listed topics and other relevant topics from the free ebooks available for data science.
You may also like the infographic on Data Scientist’s resources
You may go through the video tutorials available on SAS for the below topics on basic statistics:
The below topics are most important for analyzing data statistically. A detailed discussion on these topics are available in book thinkstats (Free pdf) and the MOOC- Design and Interpretation of Clinical Trials from coursera.
In our day to day life we come across many anecdotal evidence which are based on data that is unpublished and usually personal. But in a professional set up we might want evidence that is more persuasive and anecdotal evidence usually fails, because of Small number of observations, Selection bias, Confirmation bias and inaccuracy. A statistical approach helps us overcome all these shortcomings and helps us make convincing arguments in front of different stakeholders in the business.
Means and averages, Variance, distributions and various visualization techniques.
Most simple example of a statistical experiment is a coin toss. There is more than one possible outcome. Each possible outcome (i.e., heads or tails) can be listed down in advance. There is also an element of chance due to the uncertainty of outcome.
Some practical areas of significance are marketing campaigns and clinical trials. Results from randomized clinical trials are usually considered the highest level of evidence for determining whether a treatment is effective.
On the other hand, marketers use this technique to increase the number of variables tested in a single campaign (product offers, messages, incentives, mail formats and so on) and to test multiple offers in the market simultaneously. Marketers learn exactly which variables entice consumers to act. As a result, response rates rise dramatically, the effectiveness of future campaigns improves and overall return on spending increases.
Deriving insights from data require an understanding of proper experimental design. You’ll need to be able to interpret when causal effects are significant, and be able to work with more complicated setups involving sequential/overlapping/multiple experiments.
Associate analysis methods include t-tests to ANOVAs etc. which are useful in making inference from these experiments.
There are many questions that can’t be answered just with data alone, especially questions like How does this new feature impact latency/conversion/CTR?
These are more difficult than experiments to analyze, and require a very strong understanding of assumptions and causal models in order to draw convincing conclusions from them. Not everything can be randomized, so you’ll often have to face these observational studies or quasi-experiments.
Most of the BI metrics we’ll discuss through out the tutorials come in a time series. With time series analysis you can use past data to forecast the future.
It is important and common for regression and classification exercises. It gives you a good framework to combine information from covariates and predict an outcome or class.
Some other statistical topics of interest are: Survey analysis, Causal inference, Bayesian data analysis, Nonparametric methods etc.
For classification problems discriminant analysis seems to be used a lot by data scientists.
Techniques such as clustering, trees, etc. also help with regression, classification, and more complicated modeling questions such as understanding text, pictures, and audio.
Visit SAS website for tutorials
As a data scientist you must know how data is stored, queried, managed, handled, cleaned, modelled and reported. Though you may not need to create data pipelines and manage data warehouses.
As a student you’ll most often work with textfiles or csv’s to access and store data. But once you join an industry you’ll be required to access the databases to fetch your data and do the analysis. MySQL, MongoDB and Cassandra are few of the names. It’s important that you knoa at least basics of querying the data and other concepts of relational databases (datamodels etc.) that will give you the ease of working in a real life setup.
Once you have gained a basic understanding of the statistical topics and tools you should focus on various techniques used in machine learning. There are 3 broad types of learning: Supervised Learning, Unsupervised Learning and Semi-Supervised Learning. Below is a list of various types of algorithms that you should learn for Data Science applications:
Here’s an excellent blog post that takes you on a tour of machine learning algorithms.
Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated.
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Any advanced tool for data analysis will help you in cleaning, wrangling and visualization.
ggplot2 is a powerful tool in R that comes handy for Data Scientists. Here’s a cheatsheet on ggplot2.
Once you have completed your analysis and come up with conclusions it’s time to convince your stakeholders about your ideas. Reporting and presentation techniques will help you reach out to different stakeholders with your ideas.
Engagement / retention rate, churn, conversion, cancellation, Competitive products / duplicates matching, how to measure them, spam detection, behavioral traits.
Cost functions: Log-loss, other entropy-based, DCG/NDCG, etc.
The stuffs that Data Scientists build- the graphs, charts, reports, analysis and presentations, are for the consumption of product managers and other decision makers across the company. These ideas need to be integrated into other systems that are either already functional or will be introduced in future. So, in order o make practical suggestions to the decision makers it’s important to understand the end-to-end processes in your company and how things work.
Data Science is a huge subject area and even if you master a lot of tools and techniques, you’ll still need a good deal of intuition and rational guesses to complement the scientific knowledge. That’s because in a practical scenario you’ll most often work with inadequate data, inefficient systems that may not allow to process data above a certain limit, strict timelines and so on. Working on some real time puzzles and guesstimation problems will open your eyes towards developing your own thought process to tackle such situations.
While working with data that runs into Petabytes, you may need to work with distributed processing as such volumes can’t be processed by a single machine. The volume, velocity and variety of the data will require a different approach than what you would apply to ordinary data. Familiarity with some of the Big Data tools and frameworks like Hadoop, MapReduce and
Apache Spark may prove to be a plus depending on which company you’re aiming to work for.
A few nice topics to cover are: Behavioral economics, Game theory, Auction design
I’ve listed here some highly recommended courses that are available online for free. If you are looking for a thorough analysis of different online courses please visit the post on analysis of 12 online data science courses.
CS109 Data Science @ Harvard
Estimated efforts: 120 to 200 hours
Tool used: Python, d3
Link to course
Introduction to Data Science @ COURSERA
Estimated efforts: 100 to 140 hours
Certificate for $49
Tool used: Python, R, SQL
Link to course
Data Analysis: Take It to the MAX() @ EDX
Estimated efforts: 30 to 50 hours
Certificate for $50
Tool used: MS-Excel, python
Link to course
The Analytics Edge @ EDX
Data Analysis & Statistics
Estimated efforts: 120 to 180 hours
Certificate for $100
Tool used: R
Link to course
Learning From Data @ CALTECH
Estimated efforts: 120 to 140 hours
Tool used: Any tool
Link to course
Machine Learning (by Andrew NG) @ COURSERA
Estimated efforts: 80 to 130 hours
Certificate for $49
Tool used: Octave
Link to course
Kaggle is the best platform for learning by doing. Whether you are just a beginner, a learner, a professional or a maestro Kaggle has something in store for you. Besides working on real life problems it also helps you get noticed by prospective employers. But like every other platform the key to use Kaggle to your best advantage is through forming the right team and choosing the right problems. We have started a community to help you get into the right team based on your expertise and interest. Just fill in this simple form and we’ll get in touch with you.
If you are just thinking of trying your hands at some of the Kaggle problems, best advice would be to start by asking some basic questions and exploring the data before moving to the advanced analysis.