Many of the tools that we are going to discuss here are far from mature and this landscape massively shifts very fast. There’s a lot of overlap in functionality so people use whatever they’re fastest with to get the job done. After all data science is a field where you pick your tools in the smartest possible way.
Things are not as standardized as say “What tools does a systems programmer use?”
I haven’t been actively doing data work for about 8 months so I might already be slightly out of the loop, but this is what I remember using myself off the top of my head.
The R Project for Statistical Computing is still the most popular. I agree with William that it’s best used through R Studio.
Pandas is a set of Python libraries if you don’t want to learn a new language
The Julia Language is an upcoming alternative to R. I spent a little bit of time learning it and would like to keep track of where the project goes.
MySQL: It can comfortably handle datasets that are a few GBs. Don’t prematurely go to Hive. MySQL is optimized to death and is super good at latency for running ad-hoc queries.
CSV Files: You’d be surprised how far you can get with using these as your primary storage.
Hive/Shark/Redshift: For when your big data is actually big. Hive can do giant joins while Redshift is better for latency but more limited in its joins.
I’ve mostly trained models in R and used it directly or over Hive.
I’ve played around with scikit-learn in the past and it seems to be maturing.
I’ve used Weka for trying out standard algorithms quickly on new datasets in the past but its hard to productionize anything with it.
Social Network Analysis
It’s been a while since I’ve done hands-on SNA, so things might have changed recently.
NetworkX is a pretty good Python library for SNA but isn’t distributed
Stanford Network Analysis Project
Apache Giraph is an open-source implementation of the Google Pregel paper.
SAS Provides Advanced Analytics, Data Management and Business Intelligence solutions
SPSS– IBM SPSS predictive analytics software offers advanced techniques in an easy-to-use package to help you find new opportunities, improve efficiency and minimize risk.
Statistica – provides comprehensive array of data analysis, data management, data visualization, and data mining procedures. Its techniques include the widest selection of predictive modeling, clustering, classification, and exploratory techniques in one software platform.
RISK: Risk Analysis Software using Monte Carlo Simulation for Excel
A note about the author: Abhinav Sharma is a Product Designer at Quora with a B.S. in Computer Science from Carnegie Mellon University. Here’s a link to his linkedin profile
This answer was originally published in Quora by Abhinav Sharma.
We’ve added few points as additional comments.