# Category Archives for "News Updates"

Aug 14

## A Simplified Understanding of Naïve Bayes Algorithm

Commonly used in Machine Learning, Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of any other feature. So for example, a vegetable may be considered to be a chili if it is red, tastes hot, and about 4″ in length. A Naive Bayes classifier considers each of these “features” (red, hotness , 4” in length) to contribute independently to the probability that the vegetable is a chili, regardless of any correlations between features. Features, however, aren’t always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why it’s labeled “naive”.

## The Bayes’ theorem

This theorem forms the core of the whole concept of naive Bayes classification. In order to understand how naive Bayes classifiers work, we have to briefly recapitulate the concept of Bayes’ rule. The probability model that was formulated by Thomas Bayes (1701-1761) is quite simple yet powerful; it can be written down in simple words as follows:

In the context of a classification problem, it can be interpreted as: “What is the probability that a particular object belongs to class i given its observed feature values?”

To put it simply Naive Bayes works well with linear class boundaries as shown below:

For a detailed understanding you may check out this post on Naive Bayes

## Why is the Naive Bayes algorithm fast?

Naive Bayes is fast because all it needs are the apriori and conditional probabilities that can be ‘learnt’ or rather determined with trivial operations like Counting and Dividing values that do not change and can be stored and reused.

## Difference between Naïve Bayes and Logistic regression

Reference: Brendan O’Connor’s answer in Quora

The difference between Naïve Bayes and Logistic regression is based on how you fit the weights from training data.

In NB, you set each feature’s weight independently, based on how much it correlates with the label. (Weights come out to be the features’ log-likelihood ratios for the different classes.)

In logistic regression, by contrast, you set all the weights together such that the linear decision function tends to be high for positive classes and low for negative classes. (Linear SVM’s work the same, except for a technical tweak of what “tends to be high/low” means.)

The difference between NB and LogReg happens when features are correlated. Say you have two features which are useful predictors — they correlate with the labels — but they themselves are repetitive, having extra correlation with each other as well. NB will give both of them strong weights, so their influence is double-counted. But logistic regression will compensate by weighting them lower.

This is a way to view the probabilistic assumptions of the models; namely, Naive Bayes makes a conditional independence assumption, which is violated when you have correlated/repetitive features.

One nice thing about NB is that training has no optimization step. You just calculate a count table for each feature and you’re done with it — it’s single pass and trivially parallelizable every which way.

One nice thing about LR is that you can be sloppy with feature engineering. You can throw in multiple variations of a feature without hurting the overall model (provided you’re regularizing appropriately); but in NB this can be problematic.

Aug 09

## Companies Hiring Maximum Data Scientists

The number of job postings for data scientist is growing at a fast pace and it’s still outpaced by the amount of people searching for data scientists.

But there is one problem with job post data. And that is “data scientist” is a loosey-goosey term. Generally speaking, practitioners are in demand for applying statistical tools for analysis, predictive modeling and programming. Oh, and having a certain artistic flair to guide how results are visualized is a definite plus. But ask a dozen hiring managers and you may get a dozen different views. We’ve tried to cater to all those views to come up with this article.
The companies highlighted in the chart below employ maximum data scientists and together they employ close to 8% of all data scientists in the world between them as per our analysis of linkedin data. Keep in mind that this is not a statistical analysis and is based on user-generated content, which, in many cases isn’t very accurate. But at the same time it gives you a fair idea about the top players in the industry and will help you to plan your career moves.

 Company Industry Microsoft Corporation Technology IBM Technology and Consulting Facebook Online Social Networking Google Internet-related services and products GSK Pharmaceutical Apple Consumer Electronics, Computer software and Online Services Capital One Bank holding company Booz Allen Hamilton Management Consulting Novartis Pharmaceutical HP Information Technology Linkedin Business-oriented Social Networking Amazon Electronic Commerce and Cloud Computing Nielsen N.V. Information and Measurement Oracle Technology Accenture Consulting Intel Technology Roche Pharmaceuticals and Diagnostics SAS Analytics Software Uber Online Taxi Dispatch Stanford University Education and Research Twitter Online Social Networking SAP Enterprise Software Pfizer Pharmaceutical Cognizant Information Technology, Consulting Capgemini Information Technology, Consulting Quintiles Pharmaceutical Terradata Technology Airbnb Online Lodging Provider AT&T Telecommunications PayPal Online Payments System Netflix Online Streaming Cisco Technology Deloitte Professional Services EMC Technology American Express Financial services Yahoo Internet, Search Engine eBay e-commerce Ford Motor company Automaker Groupon e-commerce Marketplace Dell Technology Citi Investment Banking and Financial Services Intuit Technology (Software) Walmart Retail Bank of America Banking and Financial Services Adobe Computer Software Merck Pharmaceutical Civis Analytics Consulting Karvy Analytics Consulting Salesforce Cloud Computing Allstate Personal Lines Insurer PWC Professional Services Tata Consultancy Services Information Technology

You might have noticed that most of the companies listed here are technology companies with a B2B business model or internet companies. One may wonder as to why traditional companies with huge data and large workforce don’t have as many data scientists. There can be many reason for companies to hire lesser number of data scientists and that doesn’t mean missing out on the big data analytics action. The reasons can be 1. They outsource many of the data management to information technology companies or consulting companies 2. Many of them don’t use the data scientist designation as the work that they perform involve more of business knowledge and less of tools and algorithms. You may also refer to our analysis comparing the number of data scientists with the number of business analytics professionals in various companies which will explain why some companies famous for their analytics oriented business models have lesser data scientists on their payroll:

The business analytics professionals are more in number in conventional businesses like banks, retail etc. These companies emphasize more on having people who can understand their business better and can also apply the analytical models to improve their business when needed. They’ll probably prefer outsourcing their hardcore data analytics jobs to some outside vendor. The various designations in these companies are Research analyst, Business Analytics Consultant, Business analytics manager, business analytics lead etc.

Please feel free to suggest other companies that are hiring lots of data scientists by commenting on this post.

Aug 07

## Data Analytics vs Data Science- Master the Jargons

First things first, doing stuff with data, whatever you want to call it is going to require some investment – fortunately the entry price has come right down and you can do pretty much all of this at home with a reasonably priced machine and online access to a host of free or purchased resources. Commercial organizations have realized that there is huge value hiding in the data and are employing the techniques you ask about to realize that value. Ultimately what all of this work produces is insights, things that you may not have known otherwise. Insights are the items of information that cause a change in behavior.

Let’s begin with a real world example, looking at a farm that is growing strawberries

What would a farmer need to consider if they are growing strawberries? The farmer will be selecting the types of plants, fertilizers, pesticides. Also looking at machinery, transportation, storage and labor. Weather, water supply and pestlience are also likely concerns. Ultimately the farmer is also investigating the market price so supply and demand and timing of the harvest (which will determine the dates to prepare the soil, to plant, to thin out the crop, to nurture and to harvest) are also concerns.

So the objective of all the data work is to create insights that will help the farmer make a set of decisions that will optimize their commercial growing operation.

Let’s think about the data available to the farmer, here’s a simplified breakdown:

1. Historic weather patterns

2. Plant breeding data and productivity for each strain

3. Fertilizer specifications

4. Pesticide specifications

5. Soil productivity data

6. Pest cycle data

7. Machinery cost, reliability, fault and cost data

8. Water supply data

9. Historic supply and demand data

10. Market spot price and futures data

Now to explain the definitions in context (with some made-up insights, so if you’re a strawberry farmer, this might not be the best set of examples):

## Big Data

Using all of the data available to provide new insights to a problem. Traditionally the farmer may have made their decisions based on only a few of the available data points, for example selecting the breeds of strawberries that had the highest yield for their soil and water table. The Big Data approach may show that the market price slightly earlier in the season is a lot higher and local weather patterns are such that a new breed variation of strawberry would do well. So the insight would be switching to a new breed would allow the farmer to take advantage of a higher prices earlier in the season, and the cost of labor, storage and transportation at that time would be slightly lower. There’s another thing you might hear in the Big Data marketing hype: Volume, Velocity, Variety, Veracity – so there is a huge amount of data here, a lot of data is being generated each minute (so weather patterns, stock prices and machine sensors), and the data is liable to change at any time (e.g. a new source of social media data that is a great predictor for consumer demand),

## Data Analysis

Analysis is really a heuristic activity, where scanning through all the data the analyst gains some insight. Looking at a single data set – say the one on machine reliability, I might be able to say that certain machines are expensive to purchase but have fewer general operational faults leading to less downtime and lower maintenance costs. There are other cheaper machines that are more costly in the long run. The farmer might not have enough working capital to afford the expensive machine and they would have to decide whether to purchase the cheaper machine and incur the additional maintenance costs and risk the downtime or to borrow money with the interest payment, to afford the expensive machine.

## Data Analytics

Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. Looking at the weather data and pest data we see that there is a high correlation of a certain type of fungus when the humidity level reaches a certain point. The future weather projections for the next few months (during planting season) predict a low humidity level and therefore lowered risk of that fungus. For the farmer this might mean being able to plant a certain type of strawberry, higher yeild, higher market price and not needing to purchase a certain fungicide.

## Data Mining

This term was most widely used in the late 90’s and early 00’s when a business consolidated all of its data into an Enterprise Data Warehouse. All of that data was brought together to discover previously unknown trends, anomalies and correlations such as the famed ‘beer and diapers’ correlation (Diapers, Beer, and data science in retail). Going back to the strawberries, assuming that our farmer was a large conglomerate like Cargill, then all of the data above would be sitting ready for analysis in the warehouse so questions such as this could be answered with relative ease: What is the best time to harvest strawberries to get the highest market price? Given certain soil conditions and rainfall patterns at a location, what are the highest yielding strawberry breeds that we should grow?

## Data Science

A combination of mathematics, statistics, programming, the context of the problem being solved, ingenious ways of capturing data that may not be being captured right now plus the ability to look at things ‘differently’ (like this Why UPS Trucks Don’t Turn Left ) and of course the significant and necessary activity of cleansing, preparing and aligning the data. So in the strawberry industry we’re going to be building some models that tell us when the optimal time is to sell, which gives us the time to harvest which gives us a combination of breeds to plant at various times to maximize overall yield. We might be short of consumer demand data – so maybe we figure out that when strawberry recipes are published online or on television, then demand goes up – and Tweets and Instagram or Facebook likes provide an indicator of demand. Then we need to align demand data up with market price to give us the final insights and maybe to create a way to drive up demand by promoting certain social media activity.

Machine Learning: this is one of the tools used by data scientist, where a model is created that mathematically describes a certain process and its outcomes, then the model provides recommendations and monitors the results once those recommendations are implemented and uses the results to improve the model. When Google provides a set of results for the search term “strawberry” people might click on the first 3 entries and ignore the 4th one – over time, that 4th entry will not appear as high in the results because the machine is learning what users are responding to. Applied to the farm, when the system creates recommendations for which breeds of strawberry to plant, and collects the results on the yeilds for each berry under various soil and weather conditions, machine learning will allow it to build a model that can make a better set of recommendations for the next growing season.

I am adding this next one because there seems to be some popular misconceptions as to what this means. My belief is that ‘predictive’ is much overused and hyped.

## Predictive Analytics

Creating a quantitative model that allows an outcome to be predicted based on as much historical information as can be gathered. In this input data, there will be multiple variables to consider, some of which may be significant and others less significant in determining the outcome. The predictive model determines what signals in the data can be used to make an accurate prediction. The models become useful if there are certain variables than can be changed that will increase chances of a desired outcome. So what might be useful for our strawberry farmer to want to predict? Let’s go back to the commercial strawberry grower who is selling product to grocery retailers and food manufacturers – the supply deals are in tens and hundreds of thousands of dollars and there is a large salesforce. How can they predict whether a deal is likely to close or not? To begin with, they could look at the history of that company and the quantities and frequencies of produce purchased over time, the most recent purchases being stronger indicators. They could then look at the salesperson’s history of selling that product to those types of companies. Those are the obvious indicators. Less obvious ones would be the what competing growers are also bidding for the contract, perhaps certain competitors always win because they always undercut. How many visits the rep has paid to the prospective client over the year, how many emails and phone calls. How many product complaints has the prospective client made regarding product quality? Have all our deliveries been the correct quantity, delivered on time? All of these variables may contribute to the next deal being closed. If there is enough historical data, we can build a model that will predict that a deal will close or not. We can use a sample of the historic data set aside to test if the model works. If we are confident, then we can use it to predict the next deal.

A note about the author: Gam Dias is a data strategist and the founder of Mo-Data.com. He has a 10-year track record of success in building enterprise data strategies, analytic products and data-driven transformations for global multi-billion dollar organizations. Here’s a link to his linkedin profile

This answer was originally published in Quora by Gam Dias.