All Posts by ganpati

About the Author

Aug 16

Understanding of Databases for Data Scientists

By ganpati | Getting Started

The majority of data stored by businesses is in these relational databases. And in addition, these databases are exceptionally good at storing complicated business data sets as well as allowing for efficient information retrieval. So having a strong understanding of relational databases is essential to being an effective data scientist.

To start with you should have a grasp of filters, joins, aggregations etc. for querying the database. Here is a post that you can follow: Filters, Joins, Aggregations, and All That: A Guide to Querying in SQL

In addition to that, an understanding of high-performance parallel databases designed to deal with large data sets (like Terradata and HP Vertica) would be helpful.

For very large data sets, hadoop, the Hadoop Distributed File System (HDFS), and MapReduce are typically used to store and analyze these large data sets. Apache Hive is an implementation of SQL on top of MapReduce which brings the power of SQL to hadoop. Apache Pig and Scalding are similar competitors.

Bill Howe’s coursera class Introduction to Data Science has a good discussion of SQL query optimizers in his lectures on “Relational Databases, Relational Algebra“.


1 186 187 188