Category Archives for "Big Data Trends"

Sep 15

Toolkit for Data Scientists and Big Data Analysts

By ganpati | Big Data Trends

Data Scientist Toolkit

Source: Bigstockphoto

Many of the tools that we are going to discuss here are far from mature and this landscape massively shifts very fast. There’s a lot of overlap in functionality so people use whatever they’re fastest with to get the job done. After all data science is a field where you pick your tools in the smartest possible way.

Things are not as standardized as say “What tools does a systems programmer use?”

I haven’t been actively doing data work for about 8 months so I might already be slightly out of the loop, but this is what I remember using myself off the top of my head.

Data Analysis

The R Project for Statistical Computing is still the most popular. I agree with William that it’s best used through R Studio.
Pandas is a set of Python libraries if you don’t want to learn a new language
The Julia Language is an upcoming alternative to R. I spent a little bit of time learning it and would like to keep track of where the project goes.

Data Warehousing

MySQL: It can comfortably handle datasets that are a few GBs. Don’t prematurely go to Hive. MySQL is optimized to death and is super good at latency for running ad-hoc queries.
CSV Files: You’d be surprised how far you can get with using these as your primary storage.
Hive/Shark/Redshift: For when your big data is actually big. Hive can do giant joins while Redshift is better for latency but more limited in its joins.

Data Visualization

D3.js for pretty visualizations to put on the web
Matplotlib for ad-hoc Python plotting
ggplot2 for R.

Machine Learning

I’ve mostly trained models in R and used it directly or over Hive.
I’ve played around with scikit-learn in the past and it seems to be maturing.
I’ve used Weka for trying out standard algorithms quickly on new datasets in the past but its hard to productionize anything with it.

Social Network Analysis

It’s been a while since I’ve done hands-on SNA, so things might have changed recently.
NetworkX is a pretty good Python library for SNA but isn’t distributed
Stanford Network Analysis Project
Apache Giraph is an open-source implementation of the Google Pregel paper.

Data Analytics

SAS Provides Advanced Analytics, Data Management and Business Intelligence solutions
SPSS– IBM SPSS predictive analytics software offers advanced techniques in an easy-to-use package to help you find new opportunities, improve efficiency and minimize risk.
Statistica – provides comprehensive array of data analysis, data management, data visualization, and data mining procedures. Its techniques include the widest selection of predictive modeling, clustering, classification, and exploratory techniques in one software platform.
KXEN
WinCross
Knowledge Studio
RISK: Risk Analysis Software using Monte Carlo Simulation for Excel

A note about the author: Abhinav Sharma is a Product Designer at Quora with a B.S. in Computer Science from Carnegie Mellon University. Here’s a link to his linkedin profile

This answer was originally published in Quora by Abhinav Sharma.

We’ve added few points as additional comments.

Sep 11

Are We Seeing A Data Science War: Kaggle v Watson?

By ganpati | Big Data Trends

Not enough professionals to meet the growing demand

Source: Bigstockphoto

Forbes
By Bernard Marr

As far back as 2012, Gartner estimated that there would be a shortage of 100,000 data scientists by 2020. In a 2016 data science report, CrowdFlower found that 83 percent of respondents felt there weren’t enough data scientists — up from 79 percent the year before.

Part of the problem is that data science education isn’t necessarily turning out applicants who are ready to jump into a data scientist position (though they may be ready for a more entry level position). In fact, only six universities reviewed by U.S. News and World Report offer data science programs to undergraduates (the remaining 23 programs are only available to graduate students).

Read more+

Aug 01

Top Big Data Analytics Blogs By Traffic

By ganpati | Big Data Trends

We’ve listed some of the most popular data science blogs for machine learning, business intelligence, data visualization and general data science based on estimated monthly traffic volume. Data about their traffic comes from various publicly available sources.

This is a work in progress. If there’s a blog that you think is missing, leave it in the comments and I’ll add it to the list.

Datacamp

Estimated Traffic: 970k-1.4M
Visit website

Kdnuggets

Estimated Traffic: 800k-1.3M
Visit website

Analyticsvidhya

On development of analytical skills, analytic industry best practices, and more
Estimated Traffic: 970k-1M
Visit website

Datasciencecentral

Estimated Traffic: 540K-720K
Visit website

Machinelearningmastery

Machine Learning Mastery by Jason Brownlee, on programming & machine learning.
Estimated Traffic: 350K-420K
Visit website

Dataquest Blog

Estimated Traffic: 350K-400K
Visit website

Mastersindatascience

Estimated Traffic: 210K-220K
Visit website

Predictiveanalyticstoday

Estimated Traffic: 190K-220K
Visit website

Bigdata-madesimple

Estimated Traffic: 170K-200K
Visit website

Datatau

A list of interesting articles submitted by readers
Estimated Traffic: 140K-165K
Visit website

Dataconomy

Estimated Traffic: 150K-170K
Visit website

Datanami

Estimated Traffic: 130K-180K
Visit website

Listendata

Estimated Traffic: 120K-130K
Visit website

Datafloq

Estimated Traffic: 120K-200K
Visit website

Listendata

Estimated Traffic: 100K-130K
Visit website

There are several other popular blogs/sites that have been excluded from this report as the relevant traffic estimate for data mining, Data science, machine learning part are not available. Also the blogs that focus primarily on web analytics/ digital marketing have been excluded from this report. Please feel free to post other popular Big data analytics blogs drawing heavy traffic in the comments section and we will be happy to include it.

Note: The statistics contained here has been collected from multiple publicly available sources and is for information purposes only. It is believed to be reliable; however Analyticscosm does not warrant its completeness, timeliness or accuracy.

Suggested Articles:

Role of Data Scientists in Sports Analytics
Role of Data Scientists in Data Driven Companies
A Simple Step by Step Guide to WEKA
Data Science and Data Analyst Internships