The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.
Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:
- Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
- R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
- SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
- Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
- Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.
There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].