Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

  • Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
  • R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
  • SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
  • Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
  • Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Spaces in Paths

This is an update of the previous post originally created on April 20, 2016.

A number of readers have recently written me about an error they see when attempting to compile or execute an application or script in books such as, C++ All-In-One for Dummies, 4th EditionBeginning Programming with Python For Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning for Dummies, 2nd Edition. Development environments often handle spaces differently because they’re designed to perform tasks such as compiling applications and running scripts. When you see an error message that tells you that a file or path isn’t found, you need to start looking at the path and determine whether it contains any spaces. The best option is to create a directory to hold your source code and to place that directory off the root directory of your drive if at all possible. Keeping the path small and simple is your best way to avoid potential problems compiling code or running scripts.

The problem for many readers is that the error message is buried inside a whole bunch of nonsensical looking text. The output from your compiler or interpreter can contain all sorts of useful debugging information, such as a complete listing of calls that the compiler, interpreter, or application made. However, unless you know how to read this information, which is often arcane at best, it looks like gobbledygook. Simply keep scanning through the output until you see something that humans can read and understand. More often than not, you see an error message that helps you understand what went wrong, such as not being able to find a file or path. Please let me know if you ever have problems making the code examples in my books work, but also be sure to save yourself some time and effort by reading those error messages. Let me know if you have any thoughts or concerns about spaces in directory paths at [email protected].