Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

  • Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
  • R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
  • SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
  • Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
  • Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Continuing Education

I don’t know if you’ve noticed, but I’m continually asking questions in my blog posts. In fact, you can find questions in a few of my books and more than a few readers have commented when I ask them questions as part of my correspondence with them. I often get the feeling that people think I should know everything simply because I write books of various sorts. In fact, I had to write a post not long ago entitled No, I Don’t Know Everything to address the issue. Experts become experts by asking questions and finding the answers. They remain experts by asking yet more questions and finding yet more answers. Often, these answers come from the strangest sources, which means that true experts look in every nook and cranny for answers that could easily elude someone else. Good authors snoop more than even the typical expert—yes, we’re just plain nosy. So, here I am today asking still more questions.

This year my continuing education has involved working with the latest version of the Entity Framework. The results of some of my efforts can be found in Microsoft ADO.NET Entity Framework Step by Step. You can also find some of my thoughts in the Entity Framework Development Step-by-Step category. I’ve been using some of my new found knowledge to build some applications for personal use. They may eventually appear as part of a book or on this blog (or I might simply choose to keep them to myself).

However, my main technical focus has been on browser-based application technology. I think the use of browser-based application technology will make it possible for the next revolution in computing to occur. It certainly makes it easier for a developer to create applications that run anywhere and on any device. You can find some of what I have learned in two new books HTML5 Programming with JavaScript for Dummies and CSS3 for Dummies. Of course, there are blog categories for these two books as well: HTML5 Programming with JavaScript for Dummies and Developing with CSS3 for Dummies. A current learning focus is on the SCAlable LAnguage (SCALA), which is a functional language (akin to F# and many other languages of the same type) based on Java.

Anyone who knows me very well realizes that my life doesn’t center on technology. I have a great number of other interests. When it comes to being outdoors, I’ve explored a number of new techniques this year as I planted some new trees. In fact, I’ll eventually share a technique I learned for removing small tree stumps. I needed a method for removing stumps of older fruit trees in order to plant new trees in the same location.

I’ve also shared a number of building projects with you, including the shelving in our larder and a special type of dolly you can use for moving chicken tractors safely. Self-sufficiency often involves building your own tools. In some cases, a suitable tool doesn’t exist, but more often the problem is one of cost. Buying a tool from the store or having someone else build it for you might be too expensive.

The point I’m trying to make is that life should be a continual learning process. There isn’t any way that you can learn everything there is to learn. Even the most active mind picks and chooses from the vast array of available learning materials. No matter what your interests might be, I encourage you to continue learning—to continue building your arsenal of knowledge. Let me know your thoughts on the learning process at [email protected].