Secret Sauce Behind 9 Kaggle Winning Ideas

By ganpati | Getting Started

Aug 13

Restaurant Revenue Prediction

Competition brief

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis.

Read more

Feature engineering:
i) Square root transformation was applied to the obfuscated P variables with maximum value >= 10, to make them into the same scale, as well as the target variable “revenue”.

ii) Random assignments of uncommon city levels to the common city levels in both training and test set, which diversified the geo location information contained in the city variable and in some of the obfuscated P variables.

iii) Missing value indicator for multiple P variables, i.e. P14 to P18, P24 to P27, and P30 to P37 was created to help differentiate synthetic and real test data.

Models Used: Gradient boosting

Tools and packages R and package mice for imputation, lubridate for extracting date related features and caret for modelling

Link to solution

Winner’s interview: Read winner’s interview

Crowdflower Search Results Relevance

To evaluate search relevancy, CrowdFlower has had their crowd evaluate searches from a handful of eCommerce websites. A total of 261 search terms were generated, and CrowdFlower put together a list of products and their corresponding search terms. Each rater in the crowd was asked to give a product search term a score of 1, 2, 3, 4, with 4 indicating the item completely satisfies the search query, and 1 indicating the item doesn’t match the search term.

Read more

Link to solution

Otto Group Product Classification Challenge

Each row in the data corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further. Read more

Feature engineering: There are about 33 models that were used in predictions as meta features for the 2nd level, also there are 8 engineered features. Refer to the solution for details.
Models Used: There are 3 models trained using 33 meta features + 7 features from 1st level: XGBOOST, Neural Network(NN) and ADABOOST with ExtraTrees.
Tools and packages: As mentioned in the winner’s blog some of the tools are: Vowpal Wabbit(many configurations), R glm, glmnet, scikit SVC, SVR, Ridge, SGD, etc… but none of these helped improving performance on second level. Also some preprocessing like PCA, ICA and FFT without improvement.
Link to solution
Read winner’s interview:

Walmart Recruiting II: Sales in Stormy Weather

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.
Read more

Feature engineering:
– weekday is the most important
– month periodicity is on some store/items
– around Black Friday sales fluctuates a lot
– weather features are not effective almost at all
-In the data, people go shopping as usual however much it rains.

Models and Tools Used:– Linear regression using vowpal wabbit with many features.
Link to solution

How Much Did It Rain?

In this competition, you are given polarimetric radar values and derived quantities at a location over the period of one hour. You will need to produce a probabilistic distribution of the hourly rain gauge total, i.e., produce
where y is the rain accumulation and Y lies between 0 and 69 mm (both inclusive) in increments of 1 mm. For every row in the dataset, submission files should contain 71 columns: Id and 70 numbers.

Understanding the data

The training data consists of NEXRAD and MADIS data collected the first 8 days of Apr to Nov 2013 over midwestern corn-growing states. Time and location information have been censored, and the data have been shuffled so that they are not ordered by time or place. The test data consists of data from the same radars and gauges over the same months but in 2014.
Link to competition
Feature engineering: . FM and FTRL
Link to solution

ECML/PKDD 15: Taxi Trip Time Prediction (II)

an accurate dataset describing a complete year (from 01/07/2013 to 30/06/2014) of the trajectories for all the 442 taxis running in the city of Porto, in Portugal (i.e. one CSV file named “train.csv”). These taxis operate through a taxi dispatch central, using mobile data terminals installed in the vehicles. We categorize each ride into three categories: A) taxi central based, B) stand-based or C) non-taxi central based. For the first, we provide an anonymized id, when such information is available from the telephone call. The last two categories refer to services that were demanded directly to the taxi drivers on a B) taxi stand or on a C) random street.

Each data sample corresponds to one completed trip. It contains a total of 9 (nine) features.

The Hunt for Prohibited Content

Data for this competition consists mainly of Russian text. All files are encoded in UTF-8 and are in tab separated format (.tsv). To help you transform Russian text into a set of features there is intoductory code, where it’s recommend, which modules in Python to use.

Also note that uncompressed training and test data together take ~4GB of space.

Training and Test data sets consist of individual ads that have either been blocked for illicit content or that have never been blocked. All ads that participate in this competition have already been closed.

External data is allowed in this competition with approval.

Read more+

Models used: A series of SGD on pieces of the text data (one for title, one for description, one for title+description, one for attrs)

Feature engineering: feature engineering has provided little to no value. A bunch of features were derived from text (i.e. count of !, ?, mixed words, length of text,…) and none seemed to add much. For the attributes feature, anything more fancy than running it through a tfidf added nothing.

Solution thread

Liberty Mutual Group – Fire Peril Loss Cost

This data represents almost a million insurance records and the task is to predict a transformed ratio of loss to total insured value (called “target” within the data set). The provided features contain policy characteristics, information on crime rate, geodemographics, and weather.

The train and test sets are split randomly. For each id in the test set, you must predict the target using the provided features.

Read more+

Link to solution

Galaxy Zoo – The Galaxy Challenge

The first column in each solution is labeled GalaxyID; this is a randomly-generated ID that only allows you to match the probability distributions with the images. The next 37 columns are all floating point numbers between 0 and 1 inclusive. These represent the morphology (or shape) of the galaxy in 37 different categories as identified by crowdsourced volunteer classifications as part of the Galaxy Zoo 2 project. These morphologies are related to probabilities for each category; a high number (close to 1) indicates that many users identified this morphology category for the galaxy with a high level of confidence. Low numbers for a category (close to 0) indicate the feature is likely not present.

Link to solution

About the Author

Leave a Comment:

Leave a Comment: