Category Archives for "Big Data Trends"

Oct 08

Open TLC Data Reveals the Taxi Industry’s Contraction, Uber’s Growth

By ganpati | Big Data Trends


By Todd W. Schneider

The New York City Taxi & Limousine Commission publishes summary reports that include aggregate statistics about taxi, Uber, and Lyft usage. These are in addition to the trip-level data that I wrote about previously; although the summary reports contain much less detail, they’re updated more frequently, which provides a more current glimpse into the state of the cutthroat NYC taxi market.

Taxi data is currently available through Jul 31, 2016, Uber/Lyft data is currently available through Jul 30, 2016

This post published in provides a thorough analysis based on Trips Per Day in NYC: Taxi vs. Uber vs. Lyft,

Some important insights that came out from the analysis are: yellow taxis provided 60,000 fewer trips per day in January 2016 compared to one year earlier, while Uber provided 70,000 more trips per day over the same time horizon.

The summary reports also include the total number of vehicles dispatched by each service: as of January 2016 there are just over 13,000 yellow taxis in New York, a number that is strictly regulated by the taxi medallion system. Uber has grown from 10,000 vehicles dispatched per week at the beginning of 2015 to over 25,000 in January 2016, while Lyft accounts for another 5,000.

Read more+

Sep 24

R with Power BI: Import, Transform, Visualize and Share

By ganpati | Big Data Trends

By David Smith

Power BI, Microsoft’s data visualization and reporting platform, has made great strides in the past year integrating the R language. This Computerworld article describes the recent advances with Power BI and R. In short, you can:

  • Import data into Power BI by using an R script
  • cleanse and transform other data sources coming into Power BI using R functions
  • create custom charts in a Power BI dashboard using the R language, like these maps
  • share R scripts with others for use with Power BI in the R Scripts Showcase
  • create dashboards with Power BI desktop and R on your local machine, and share them with others using Power BI Online


Sep 17

4 Inefficiencies Affecting Data Scientists

By ganpati | Big Data Trends

Data scientists spend 80% of their time with data preparation. But that lengthy process doesn’t reflect on their inefficiency. That is their JOB!

If you are not good at data preparation, you are NOT a good data scientist.Period.

The validity of any analysis is resting almost completely on the preparation. The algorithm you end up using is close to irrelevant.Complaining about data preparation is the same as being a farmer and complaining about having to do anything but harvesting and please have somebody else deal with the pesky watering, fertilizing, weeding, etc.

Raw Data Collection

This being said – data preparation can be made difficult by the process of raw data collection. Designing a system that collects data in a form that is useful and easily digestible by data science is a high art. Providing full transparency to DS how exactly the data flows to the system is another. It involves processes that consider sampling, data annotation, matching, etc. It does not include things like replacing missing value and excessive normalization. Creating an effective data environment for DS needs to involve DS and cannot be entirely owned by engineering. DS is often NOT able to spec such system requirements in sufficient detail to allow for a clean handover.

Irrelevant Problems

But in the bigger picture, there are more important things to consider. The by far biggest issue I see is data science solving irrelevant problems. This is a huge waste of time and energy. The reason is typically that whoever has the problem is lacking data science understanding to even express the issue and data scientists end up solving whatever they understood might be be the problem, ultimately creating a solution that is not really helpful (and often far too complicated). A typical category are ‘underdefined’ tasks: “Find actionable insights in this dataset!”. Well – most data scientists do not know which actions can be taken. They also do not know what insights are trivial vs. interesting. So there is really no point sending them on a wild goose chase.

Wild Goose Chase

The “solving the wrong problem” is pervasive in part because the data science is not sufficiently involved in the decision process (thanks to Meta for asking me to clarify). Now – not EVERY data scientist can and should be expected to be able to shape the problem as well as the solution (back to the unicorn problem), but at least one data scientist on the team should. The bigger issue is however not the lack of ability/willingness from the data science side (although indeed there are plenty who just like to solve a cute problem, not matter how relevant) – but often a corporate culture where analytics, IT, etc is considered an ‘execution’ function. Management decides what is needed and everybody else goes and does it.

Lack of Skepticism

On an individual level and a given (worthwhile) problem I would blame lack of data understanding, data intuition, and finally skepticism as most limiting factors to efficiency. What makes these factors contribute to inefficiency is NOT that it takes longer get to an answer (in fact lack of the three typically leads to results much more quickly) but rather how long it takes to a (almost) right answer.

Inspired from Quora post of Claudia Perlich, Chief Scientist Dstillery, Adjunct Professor at NYU