This is a step by step guide for beginners to install R and its important packages and use available data sets to start analytics journey with R-programming.
Basic R is available for download under open source license and can be installed in Unix-like, Windows and Mac operating systems. Detailed guidelines for download, installation steps, usage and help is available in the CRAN Website.
CRAN stands for The “Comprehensive R Archive Network”.
If you have any questions regarding current versions, system compatibility, installation guidelines etc., it’s available in the CRAN FAQ Section
You should also install RStudio along with the basic R. R studio is the IDE (integrated Development Environment) for R. It makes your programming easier and faster, once you are familiar with its interface.
R studio is also available for free download and the guidelines are available here.
I use both R and R-studio for R programming based on the problem at hand. Since R studio allows better ways to handle plots, install and maintain packages and navigate programming, I use it whenever I have to do a lot of visualizations on the data.
The true power of R lies in its packages. Packages are collections of R functions, data, and compiled code in a well-defined reusable format. The directory where packages are stored is called the library.
It’s difficult to find a package that doesn’t automatically do what you need. There are thousands of options out there written by people who needed the functionality and published their work. You can easily add these packages within R with just a couple of commands.
Here are some powerful R packages that I found very useful: ggplot, party, Hmisc, car, MASS, plyr, rattle, rpart.plot, RColorBrewer, xgboost, DMwR (for knnImputation), stringr, gbm, recommenderlab, randomForest
Installing them is a onetime exercise. The command to be executed is: install.packages(‘
Once all important packages are installed you have to just load the appropriate package in the program based on what you require using library or require function which you’ll observe in the tutorials going forward.
Be careful about the capital letters while using functions or packages as R is case sensitive tool.
Once you have installed R and loaded the packages the only thing that remains to start analysis is data. R itself provides some 90 datasets that are very useful for practice and learning. To see the list use:
Apart from that there are many other sources for data that can be explored for learning.
Though exploring data in itself can lead to useful insights, in a practical scenario, we’ll first come up with a problem that needs to be solved using the data. In the next part of this post I’ve discussed about the Kaggle platform which offers some interesting problems along with the datasets. Solving those problems will give you a fair idea of how analytics is used in businesses for decision making.
A wonderful entry point to the field of data science is the Kaggle platform. It provides you a wide array of competitions to choose from depending on the area of your interest. If you are a trader or a scientist, or a programmer with some domain experience, a student, a manager, an analyst, a researcher or just an enthusiast- chances are Kaggle has something interesting in store for you. Another interesting part of the platform is, you can submit entries to competitions that happened in the past just to check where you would have been on the leaderboard. Now, leaderboard itself is an interesting concept that I’ve explained in a separate post. In short, it is the list of top 10 participants ranked in the order of their performance.
For beginners, I’ve chosen an interesting competition that tries to predict the income of a person by the help of census data. The actual competition link is here (https://inclass.kaggle.com/c/competition-1-mipt-fivt-ml-spring-2015) .
Income of a person always intrigues us. We try guessing the income of celebrities, bosses, colleagues and more importantly our potential partners by the means of their occupation, education, houses, family etc. So I thought it would be interesting to work on this income prediction problem, with the help of a powerful tool R and understand which factors can effectively predict the income.
In Kaggle problems (as in most of the practical problem solving scenarios) it’s important to divide the data into two sets- test data and training data. Basically we divide the dataset that we have, consisting of past observations into two parts- training and test data. Training data will be the one based on which we shall build our models.
Once the model is built we have to test it with a different dataset so as to know how it performs under normal circumstances. We test this model on test dataset, because it’s easy for us to compare the results predicted by the model with actual results as we already have the complete data available for test data. Let’s take an example to make this point clear.
Let’s say we are analyzing a number of people based on their propensity to respond to a marketing campaign for a particular age group. We have a data of 10,000 respondents who have in past either responded or not responded to similar campaigns. We also have the demographic and other details of all these 10,000 people.
We have to now build our model based on the data and test the model’s performance. But where can we test the model. Can we test it on the same data on which the model was created. If we do that our model might give a 100% accuracy and we will never know if it’s overfitted. Overfitting is over emphasizing on patterns that can’t be generalized to different samples of the actual data.