Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

  • Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
  • R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
  • SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
  • Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
  • Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Choosing a First Language to Learn

My first programming experience (during the time of the dinosaurs) involved using a light panel to enter machine code into a rudimentary computer with 3 KB (yes, that’s KB) of RAM. The output was also in light form and I needed to decode the lights to determine if my code worked right. I worked with various systems in various ways over the next several years. By the time I got to college, the first language I learned there was BASIC (Beginner’s All-purpose Symbolic Instruction Code), then PC assembler, followed by Pascal. In fact, I’ve just stopped counting the number of languages I’ve learned over the years because each language has a place in my programmer’s toolbox. Of course, the question is what language you should learn first. I get asked that question quite often because there are a huge number of languages available today and no one wants to invest time in a language that’s going nowhere.

Part of the answer to the question of what to learn first is what you intend to do with the language. Each language has features that make it better at performing specific tasks. Programming languages can be split into those that are designed for a special purpose and those that are designed for a general purpose. A special purpose language, such as Structured Query Language (SQL), could be a good choice if you intend to move into database work immediately. However, for most people, a general purpose language works better because you can use it for a wider variety of tasks without bending yourself into a pretzel shape to do it.

A good place to start if you want to choose a language that’s popular enough to help you get a job afterward is the TIOBE index. It shows a listing of which languages are most popular today. As I’m writing this, Python is the most popular language on the list, but that could change tomorrow. Generally, any of the top ten languages on the list are good choices.

Of course, you want a programming language that is easy to learn. C/C++, C#, and Java are all complex languages with great flexibility. Furthermore, C/C++ and C# can help you work at a low level with the computer hardware. These languages have a steep learning curve and may not provide the best choices for a starting point. That said, if you have a line on a job that uses any of these languages, you could do worse than start here, just be prepared to burn the midnight oil learning.

The language I suggest people learn as a starting point is Python. In fact, Beginning Programming with Python For Dummies, 3rd Edition makes a point of showing you just how easy things can be. You don’t even need to invest in any special software, the book shows you how to use Google Colab so that you could conceivably learn how to program on your smartphone or TV. Others must agree with me because Python has turned into the language that the education industry turns to most often for budding programmers.

There are a lot of programming languages available today. You need to research the choice by taking into account what your personal needs are and what sort of job you want to get afterwards. You might find that something like JavaScript or Ruby will provide benefits that you can’t get with Python. Which language do you think will work best for you? Let me know your reasons at [email protected].

Continuing Education

I don’t know if you’ve noticed, but I’m continually asking questions in my blog posts. In fact, you can find questions in a few of my books and more than a few readers have commented when I ask them questions as part of my correspondence with them. I often get the feeling that people think I should know everything simply because I write books of various sorts. In fact, I had to write a post not long ago entitled No, I Don’t Know Everything to address the issue. Experts become experts by asking questions and finding the answers. They remain experts by asking yet more questions and finding yet more answers. Often, these answers come from the strangest sources, which means that true experts look in every nook and cranny for answers that could easily elude someone else. Good authors snoop more than even the typical expert—yes, we’re just plain nosy. So, here I am today asking still more questions.

This year my continuing education has involved working with the latest version of the Entity Framework. The results of some of my efforts can be found in Microsoft ADO.NET Entity Framework Step by Step. You can also find some of my thoughts in the Entity Framework Development Step-by-Step category. I’ve been using some of my new found knowledge to build some applications for personal use. They may eventually appear as part of a book or on this blog (or I might simply choose to keep them to myself).

However, my main technical focus has been on browser-based application technology. I think the use of browser-based application technology will make it possible for the next revolution in computing to occur. It certainly makes it easier for a developer to create applications that run anywhere and on any device. You can find some of what I have learned in two new books HTML5 Programming with JavaScript for Dummies and CSS3 for Dummies. Of course, there are blog categories for these two books as well: HTML5 Programming with JavaScript for Dummies and Developing with CSS3 for Dummies. A current learning focus is on the SCAlable LAnguage (SCALA), which is a functional language (akin to F# and many other languages of the same type) based on Java.

Anyone who knows me very well realizes that my life doesn’t center on technology. I have a great number of other interests. When it comes to being outdoors, I’ve explored a number of new techniques this year as I planted some new trees. In fact, I’ll eventually share a technique I learned for removing small tree stumps. I needed a method for removing stumps of older fruit trees in order to plant new trees in the same location.

I’ve also shared a number of building projects with you, including the shelving in our larder and a special type of dolly you can use for moving chicken tractors safely. Self-sufficiency often involves building your own tools. In some cases, a suitable tool doesn’t exist, but more often the problem is one of cost. Buying a tool from the store or having someone else build it for you might be too expensive.

The point I’m trying to make is that life should be a continual learning process. There isn’t any way that you can learn everything there is to learn. Even the most active mind picks and chooses from the vast array of available learning materials. No matter what your interests might be, I encourage you to continue learning—to continue building your arsenal of knowledge. Let me know your thoughts on the learning process at [email protected].