Resources

As more and more new, quality materials are published for better understanding of big data and machine learning concepts, finding such latest materials is becoming increasingly challenging. And, since it’s the most competitive field, staying updated on the changes is important. Things you were told 2 years ago may not matter at all today.

I know many professionals who prefer to spend an hour or two in Google or Reddit to discover the best materials getting published. Fortunately, that really isn’t necessary if you have an indexed, properly curated and constantly updated source that gets constant feedback from its readers.

In this regularly updated post, I’ll give you resources you can use to learn about Big Data and machine learning and stay on top of the job market, including some free tools that’ll be useful for getting the job done. Enjoy

1000+ Most Popular Resources on Big Data/ML/Data Science and Visualization Across the Web

Help us stay updated with best big data and ML resources by sharing the best with the world
Get weekly updates about popular articles on these topics by email

About Data Science

Doing Data Science at Twitter It talks about how machine learning has played an increasingly prominent role across many core Twitter products that were previously not ML driven and how the data science landscape in Twitter has changed in the recent past

Data Science Salary Survey 2015 the 2015 version of the Data Science Salary Survey explores patterns in tools, tasks, and compensation through the lens of clustering and linear models. The research is based on data collected through an online 32-question survey, including demographic information

Some Real World Machine Learning Examples The post talks about what are some real-world examples of applications of machine learning in the field- ranging from Computational Biology & Drug Discovery/Design to web Search and recommendation engines, finance etc.

Programming for Data Science

25 Java Machine Learning Tools & Libraries Lists 25 Java Machine learning tools & libraries like Weka, Meka, ADAMS, Mallet, Encog etc.

R vs Python In the battle of “best” data science tools, python and R both have their pros and cons. Selecting one over the other will depend on the use-cases, the cost of learning, and other common tools required.Here’s an analysis.

How to Learn R R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

A two-hour introduction to data analysis in R If you’re looking for a non-diamonds or non-nycflights13 introduction to R / ggplot2 / dplyr feel free to use materials from this workshop.

Intro to Python for Data Science Unlike other Python tutorials, this course focuses on Python specifically for data science. In this Intro to Python class, you will learn about powerful ways to store and manipulate data as well as cool data science tools to start your own analyses.

Introduction to machine learning in Python with scikit-learn (video series) Scikit-learn is Python’s library for machine learning. Here’s a series of nine video tutorials totaling four hours in partnership with Kaggle.

Intro to Julia– Julia aims to address the “two language problem” that is all too common in technical computing. Visit this post for a fresh approach to numerical computing and data science using Julia.

Cheat sheets on various data science tools Here’s a good starting point. You can find many additional references here (Python, Excel, Spark, R, Deep Learning, AI, SQL, NoSQL, Graph Databses, Visualization, etc.)

Top 10 R Packages to be a Kaggle Champion Across all major surveys, R has clearly dominated as one of the top programming choices for data scientists. Thus, it is no wonder that knowing the important R packages can be a vital advantage in Kaggle competitions. Here’s a list of 10 R packages that played a key role in getting a top 10 ranking in more than 15 Kaggle competitions

Integrating Python and R into a Data Analysis Pipeline The first in a series of blog posts that: outline the basic strategy for integrating Python and R, run through the different steps involved in this process; and give a real example of how and why you would want to do this.

Machine Learning

Steps in Machine Learning
Data Exploration with Python– Here is a cheat sheet to help you with various codes and steps while performing exploratory data analysis in Python. There is also a pdf version of the sheet o that you can easily copy / paste these codes.

Data Exploration with SAS Exploring data sets and developing deep understanding about the data is one of the most important skill every data scientist should possess. People estimate that time spent on these activities can go as high as 80% of the project time in some cases. this guide, This guide uses NumPy, Matplotlib, Seaborn and Pandas to perform data exploration.

Data Exploration with R– Detailed tutorial on Data Exploration using R

Types of algorithms– This post gives you two ways to think about and categorize the algorithms you may come across in the field. The first is a grouping of algorithms by the learning style.The second is a grouping of algorithms by similarity in form or function (like grouping similar animals together).
Arriving at an Algorithm
List of Statistical Data Mining Tutorials by Andrew Moore

Model Selection

Performance Estimation: Generalization Performance Vs. Model Selection
Predictive model selection – quick tricks
Frequentism and Bayesianism V: Model Selection
Survival of Fitness: How Model Selection Happens In The Natural Order of Data Science
Machine learning for model selection in population genomics

Feature Engineering

The Data Science Machine, or ‘How To Engineer Feature Engineering’
Feature Engineering versus Feature Extraction: Game On!
Feature Engineering for Fraud Detection Models

Boosting, Bagging and Stacking

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)
Quick Introduction to Boosting Algorithms in Machine Learning
Learn Gradient Boosting Algorithm for better predictions (with codes in R)
What’s the similarities and differences beetween this 3 methods: bagging, boosting, stacking?
Model ensembling for Kaggle

Evaluating Machine Learning Models

How to Evaluate Machine Learning Models: Classification Metrics

Overfitting

Overfitting or generalized? Comparison of ML classifiers – a series of articles
Data Science 101: Preventing Overfitting in Neural Networks
Decision Trees – Handling Overfitting using Forests

Dealing With Unstructured Data

Unlocking The Value Of Unstructured Data
5 Easy Steps to Structure Highly Unstructured Big Data, via Automated Indexation
The Applications of Machine Learning Through Unstructured Text Data

Recommender System

How to Build a Recommender System
Building the Next New York Times Recommendation Engine
Collaborative filtering recommendation engine implementation in python
Basic recommendation engine using R
Building a Real-Time Geospatial-Aware Recommendation Engine
How to build your own recommendation engine using machine learning on Google Compute Engine
Apache Mahout The Recommender System for Big Data
Recommender System with Mahout and Elasticsearch
The Netflix Recommender System: Algorithms, Business Value, and Innovation
Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin

Text Mining

Overview of Text Mining
10 Common NLP Terms Explained for the Text Mining Novice
Hacks to perform faster Text Mining in R
Text Mining Analysis: some theory and practice in R
Text Mining Shakespeare with MATLAB
Use case: Text analytics vs survey analysis

Deep Learning

Microsoft Neural Net Shows Deep Learning Can Get Way Deeper
Deep learning – Convolutional neural networks and feature extraction with Python
Google’s New AI System Could Be ‘Machine Learning’ Breakthrough TensorFlow is the first serious implementation of a framework for ‘deep learning,’ backed by both very experienced and very capable team at Google. Here’s an article that introduces you to TensorFlow.

Other Topics

Applied Spatial Data Science
Distributed Data-structures

Popular Books

10 Big Data Books To Boost Your Career – InformationWeek
Quick Reviews: 3 Books on Visualizing Data
15 Must Read Books for Entrepreneurs in Data Science
Data at work: a data visualization book for Excel users
60+ Free Books on Big Data, Data Science, Data Mining, Machine Learning, Python, R, and more
16 Free Data Science Books
Free Must Read Books on Statistics & Mathematics for Data Science
15 Books every Data Scientist Should Read

Lessons from Kaggle

Anthony Goldbloom gives you the secret to winning Kaggle competitions
Learning from the OTTO Group Kaggle competition
Doing Data Science: A Kaggle Walkthrough – Cleaning Data
Understanding Text Mining Using Kaggle

Visualization

7 tools for data visualization in R, Python and Julia

Databases

SQL VS. NOSQL- What You Need to Know

Statistics

Becoming a Full-Stack Statistician in 6 Easy Steps
Advice for applying Machine Learning by Andrew Ng