Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

  • Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
  • R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
  • SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
  • Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
  • Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Shards of Glass

Oh how sweet,
What a treat!
Shards of shining glass.

Nice and shiny,
Grains so tiny.
Shards of sparkling glass.

Oh so thirsty,
Feel so hungry.
Shards of wayward glass.

Destroys your health,
Consumes your wealth.
Shards of broken glass.

Sores appearing,
Disease nearing.
Shards of toxic glass.

Eyesight failing,
Time of wailing.
Shards of dark’ning glass.

Time’s a flying,
Soon you’re dying.
Shards of deadly glass.

Inspired by Barbara McPherson.
Dedicated to all who suffer from diabetes.
Copyright 2011 John Paul Mueller

Ongoing Education

This is an update of a post that originally appeared on April 7, 2011.

In looking at a lot of my previous posts, I try to see the truth value in them. Did they hold up to the test of time? When it comes to ongoing education, it doesn’t matter how old you get, you still need to learn new things. I encountered a friend at the library the other day, long retired, he’s still feeding his brain with new things and doing so makes him a very interesting person to talk to.

There are only 24 hours in each day, making it impossible for any one human to know the sum of human knowledge or even a small portion of it. The 24 hour limit means we must actually choose carefully and depend on others to have knowledge we don’t possess. Anything that isn’t growing is dead and a form of growth is the increase of both knowledge and wisdom.

Of course, I do obtain daily increases in knowledge. The art of writing technical books is embracing a strategy of learning all the time. I read voraciously, subscribe to word builders, and conduct experiments to see just how things really work (as contrasted to the theoretical discussions in the books and magazines I read). The very act of writing involves learning something new as I discover new ways to express myself in writing and convey information to readers. I’ve picked up books I wrote early in my career and am often appalled at what I considered good writing at the time, but it was good writing given my experience, even though it would be unacceptable today.

Learning is part of every activity in my life and I relish every learning event. So it was on this particular weekend twelve years ago that my wife and I packed our lunch and sent to Get Ready…Get Set…Garden! We looked forward to it every year. This year we took classes on hostas (for fun) and horseradish (as part of our self-sufficiency).

Before we went to class though, we just had to spend a little time looking at some of the displays. A personal favorite of mine was the gourds:

A display of interesting things made with gourds.
A display of interesting things made with gourds.

I’ve always wanted to make some bird houses, but never quite have the time. They actually had a class on the topic this year and I was tempted to take it, but that would have meant missing out on the horseradish class, which I considered more important. Here’s Rebecca and me standing in front of one of the displays:

John and Rebecca standing in front of one of the displays.
John and Rebecca standing in front of one of the displays.

Well, onto the classes. I found out that there are over 2,000 varieties of hostas, which I found amazing. They originated in Japan, Korea, and China.  It takes five years on average to grow a hosta to full size, but it can take anywhere from three years to ten years depending on the variety. I found out our place for growing them is perfect, but our watering technique probably isn’t, so we’ll spend a little more time watering them this summer. Our presenter went on to discuss techniques for dealing with slugs and quite a few other pests. Most important for me is that I saw some detailed pictures of 50 of the more popular varieties that are easy to get in this area. I’ll be digging out some of the old hostas in my garden and planting new as time allows.

The horseradish session was extremely helpful. I learned an entirely new way to grow horseradish that involves laying the plant on its side for the first six weeks, digging it up partially, removing the suckers, and then reburying it. The result is to get a far bigger root that’s a lot easier to grind into food. I can’t wait to try it out. Of course, our instructor had us sample a number of horseradish dishes while we talked. I’m not sure my breath was all that pleasant when we were finished, but I enjoyed the tasty treats immensely.

So, what are your educational experiences like? Do you grow every day? Let me know at [email protected].

Effects of the Mistruths of Data on Model Output

A number of the books Luca and I have written or I have written on my own, including Artificial Intelligence for Dummies, 2nd Edition, Machine Learning for Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning Security Principles, talk about the five mistruths of data: commission, omission, bias, perspective, and frame of reference. Of these five mistruths, the one that receives the most attention is bias, but they’re all important because they all affect how any data science model you build will perform. Because the data used to create the model isn’t free of mistruths, the model can’t perform as expected in many situations. Consequently, the assertions in the article, LGBTQ+ bias in GPT-3, don’t surprise me at all. The data used to create the model is flawed, so the output is flawed as well.

I chose the article in question as a reference because the author takes the time to point out a problem in ever generating a perfect model, the constant change in human perspective. Words that were considered toxic in the past are no longer considered toxic today, but new words have taken their place. Even if a model were to somehow escape bias today, it would be biased tomorrow due to the mistruth of perspective.

So, why have I been using the term mistruth instead of the term lie? A lie is information that is passed off as true in order to avoid responsibility, to harm others in some way, or to knowingly pass off information that is untrue for personal gain. However, humans use mistruths all of the time to reduce the potential for arguments, to save someone’s ego, or simply because the information that the person has is inaccurate. A mistruth doesn’t have the intent of deceiving another for personal gain, but it’s still not true. So, when someone asks, “Do these pants make me look fat?” and another person states, tactfully, that, “They make you look voluptuous.” the statement could be true or a mistruth, but is done to keep an argument at bay and make the other person feel good about themselves. However, machine learning algorithms have no concept of this interplay and the model created using such statements will be biased.

Anthropomorphizing machine learning models doesn’t change the fact that they’re essentially statistically-based mathematical models of data with no understanding of anything built into them. So, true or not, the model sees all data as being the same—the only solution to the problem of bias is to clean the data. However, the human expectation after getting emotionally attached to their AI is that the AI will somehow just know when something is hurtful. Articles like, What is the Proper Pronoun for GPT-4?, point to the problem of why someone would ask the question at all. The interactions with AI have taken on a harmful aspect because humans are seeing them as sentient when they most certainly aren’t. The issue will come to a head when people try to use the bad advice obtained from their AI as a defense in court. Personally, I see the continued use of “it” as essential to remind people that no matter how human a model may seem, it’s still just a model.

It’s important to understand that I see AI as an amazing tool that will only get more amazing as data scientists, developers, and others add to it’s potential by doing things like adding more memory. Although I don’t agree that any machine learning model can be someone’s friend, articles like How to Use ChatGPT in Daily Life? do make a strong argument for using machine learning models as tools to enhance human capability. However, it is a person who thought up these uses, not the machine learning model. Machine learning models will remain limited because they can’t understand the data they manipulate. In fact, articles like Why ChatGPT Won’t Replace Coders Just Yet point out just how limited machine learning models remain. However, I do think that machine learning models will get expanded by humans to perform new tasks as described in articles like How To Build Your Own Custom ChatGPT With Custom Knowledge Base. Of course, if that custom knowledge base is biased in any way, then the output from the new model will also be biased.

It’s important to know that AI is moving forward and that it will be extended to do even more in assisting humans to realize their full potential. However, it’s also important to know that the five mistruths will continue to be a problem because machine learning models are unable to understand the data used to train them and to provide knowledge bases for their output. Realistic expectations will help improve AI as a tool that augments human capabilities and helps us achieve amazing things in the future. Let me know your thoughts on machine learning bias at [email protected].

Debugging a CodeBlocks Application with Command Line Arguments

This is an update of a post that originally appeared on November 1, 2011.

Most application environments provide a means of setting command line arguments and CodeBlocks is no exception. The example shown in Listing 6-12 on page 167 of C++ All-In-One for Dummies, 4th Edition requires that you set command line arguments in order to see anything but the barest output from the debugger. This post discusses the requirements for setting command line arguments for debugging purposes.

Let’s begin with the example without any configuration. Every application has one command line argument—the path and application executable name. To see this argument, change the line that currently reads for (int index=1; index < argc; index++) to read for (int index=0; index < argc; index++) instead (setting index=1 causes the program not to show the first argument). So, now when you run the example shown in Listing 6-12 you’ll see the path and executable name as a minimum, as shown here.

The first argument passed to an application is the application executable name and path.
The first argument passed to an application.

If you run this example, you may see a different path, but the command line executable should be the same. The point is that you see at least one argument as output. However, most people will want to test their applications using more than one argument. In order to do this, you must pass command line arguments to the application. Start by changing the code back to its original form where index=1. The following steps tell how to perform add command line arguments.

  1. Choose Project | Set Program’s Arguments. You’ll see the Select Target dialog box shown here.
    Change the command line arguments for the debug or release versions.
  2. Select Debug as the target, as shown in the figure.
  3. Type the arguments you want to use, such as Hello World I Love You!, in the Program Arguments field and click OK. The IDE is now set to provide command line arguments to the application when you’re using the specified target, which is Debug in this case.

When you run the application after adding the command line argument, you should see them in the output like this:

The output shows addition of the command line arguments.
The output shows addition of the command line arguments.

Testing for command line arguments in a CodeBlocks application consists of telling the IDE what to pass in the Select Target dialog box. Let me know if you have any questions about this process at [email protected].

Compiling Python

None of my Python books, including Algorithms for Dummies, 2nd Edition, Beginning Programming with Python For Dummies, 3rd EditionMachine Learning for Dummies, 2nd Edition,  Machine Learning Security Principles, and Python for Data Science for Dummies, show how to compile a Python program. This is because the interpreted nature of Python makes it easier to work with scripts for these reasons:

  • The interpreter provides instant results to make learning faster.
  • It’s easier and faster to fix errors.
  • The use of notebooks, as is found in all of the books, makes creating output easier.
  • The use of literate programming techniques helps create an environment where acquired knowledge is more likely to remain acquired.
  • Using literate programming techniques also makes it possible to document the code in a manner that’s more like reading a textbook than looking at source code.
  • The use of scripts promotes experimentation, which leads to new ideas and techniques.

These are all great reasons to use scripts in books. In fact, I’m sure that many people will have other reasons to use scripts. The one thing you should note is that Python does automatically compile some files to do things like reduce loading time. Anytime you see a .pyc file, the file has been compiled by Python to bytecode through various means, including importing the script. It’s also possible to pre-compile a script using the python interpreter’s -m command line switch. The resulting output appears in the __pycache__ folder with a .pyc extension. You can further modify the compilation process by using the -o and -oo command line switches, which offer various optimizations to make the code load even faster. The problems with these outputs is that they’re only mildly obfuscated, so if your intent is to hide your code from prying eyes, this isn’t the best option.

Another built-in compilation option is to use the compile() function, which performs a compilation directly in your code. The purpose of using this function is to speed up code that is used often within your application. For example, you might use it to compile code that appears within a loop. Obviously, you get no obfuscation advantage using this approach, but you do get a speed advantage. If you don’t want to go through the bother of using the compile() function, you could always use a third party product like Numba, which reduces the task to one of adding a decorator to your code.

None of the solutions discussed so far do anything more than turn your Python script into bytecode, which is still interpreted (albeit, much faster than using a human language script). There is also an option for turning your Python code into actual machine code through various intermediate steps. A Python compiler usually turns your Python script into an intermediate language, which is then compiled into actual machine code that is native to the host platform. However, it may simply run your script online, so you need to know in advance whether you’ll end up with an executable file in the end. An executable file can offer these advantages:

  • The source code is fully obfuscated, protecting your development investment.
  • The code runs significantly faster than any other means of interacting with it.
  • Instead of a host of script files, you usually end up with just a few executable files, perhaps even just one.
  • Because it’s harder to modify, an executable file can be more secure and reliable than using scripts.

If your goal is to exclusively create an executable output, then a product like auto-py-to-exe might be your best option. This way you get to use your interpreter of choice to develop the application, then use another product to turn the result into an .exe file. The idea is to get the best of both worlds. The point of all this is that you don’t strictly have to interact with Python code in one way, using an interpreter. You have a great many options at your disposal. Let me know your thoughts about working with compiled Python code at [email protected].

True Peace

Peace, lend a hand.
Peace isn't the absence of war,
It isn't the result of war.
Peace begins within,
but it ends without.

Peace, lend a hand.
You cannot win peace alone,
everyone labor and groan.
Envy, greed, and lust,
these make peace a bust.

Peace, lend a hand.
All your neighbors you must see,
as your friends with spirits free.
Your lives intertwined,
spirit, body, mind.

Peace, lend a hand.
All peace with faith must begin,
a peace within is too thin.
Look around to see,
what your life could be.

Copyright John Paul Mueller, 2011

Errors in Writing

This is an update of a post that originally appeared on March 18, 2011.

I get upwards of 65 e-mails about my books on most days. Some of the conversations I have with readers are amazing and many readers have continued to write me for years. It’s gratifying to know that my books are helping people—it’s the reason I continue writing. Although I make a living from writing, I could easily make more money doing just about anything else. The thought that I might help someone do something special is why I stay in this business. When I actually hear about some bit of information that has really helped someone, it makes my day. I just can’t get the smile off my face afterward.

Of course, I’m constantly striving to improve my writing and I do everything I can to help the editors that work with me do a better job too. Good editors are the author’s friend and keep the author from looking like an idiot to the reading public. In fact, it’s the search for better ways to accomplish tasks that led me to create the beta reader program so many years ago. Essentially, a beta reader is someone who reads my books as I write them and provides feedback. The extra pair of eyes can make a big difference. Beta readers receive my thanks in the book’s Acknowledgments. Sometimes I provide other perks, such as a free copy of the book, depending on the level of beta reader input. (If you’d like to be a beta reader, please contact me at [email protected] for additional details.)

A typical book has five beta readers, but sometimes there are more or less of them. They provide all sorts of input that ranges from finding grammatical, spelling, and technical errors, to providing advice on how to approach a particular topic for readers from other nations or those with disabilities. Some of my beta readers are critical thinkers and play devil’s advocate, others are great at pointing out inconsistencies, especially in my artwork. So, there is no typical beta reader; they have a very wide range of experiences and provide me with a wide range of insights.

You’d think that with all the pairs of eyes looking at my books, they’d come out error free. After all, it isn’t just me looking at the book, but several editors and the beta readers as well. Unfortunate as it might seem, my books still come out with an error or two in them. The more technical the topic, the greater the opportunity for errors to creep in. Naturally, the errors are amazingly easy for just about everyone else to pick up! (I must admit to asking myself how I could have missed something so utterly obvious.) When there is an error found in the book, I’ll provide the information to the publisher so it’s fixed in the next printing. The error will also appear on the book’s errata page on the publisher’s site. If the error is significant enough, I’ll blog about it as well. In short, I want you to have a good reading experience so I’ll do everything I can to hunt the errors down and correct them.

However, not every seeming error is actually an error. There are times where an apparent error is simply a difference of opinion or possibly a configuration difference between my system and the reader’s system. I’ll still try to figure these errors out, but I can’t always guarantee that I’ll fix things in your favor. After all, another reader has probably found still other results or has yet another opinion on how I should present material in the book.

The long and short of things is that despite my best efforts, you’ll probably encounter an error or two in my books and I apologize for them in advance. We’ll also continue have differences of opinion and that’s usually the source for new ideas and new ways of viewing things. I’m honest enough to admit that I do need your help in creating better books, so I’ll always listen to you and think about what you have to say. I hope that you’ll continue to read my books and do amazing things with the information you find therein. The results of your researches are truly the reason I remain in this business and I realize that we’re in this together. Thanks for your continued support!

Resetting Your Code::Blocks Configuration

This is an update of a post that originally appeared on April 12, 2013.

Quite a few people have written to me about issues they have with C++ All-In-One for Dummies, 4th Edition that involve getting Code::Blocks up and running. The posts in the C++ All-in-One for Dummies, 4th Edition archive normally provide everything needed to get the compiler up and running. However, there are rare times when no matter how much you try, you simply can’t get the compiler to work.

One technique I haven’t really covered until now is to reset the Code::Blocks configuration. The problem with this approach is that it resets all of your settings, not just those that could be in error. This is the reason that I’ve taken a more measured approach to helping readers through problems until now. My concern is that resetting everything will actually cause more problems and end up confusing some readers, so you really do want to try those other posts first. That said, there are situations where resetting Code::Blocks is the only course of action that will work.

To reset your settings, open your copy of CodeBlocks. Choose Settings | Compiler. You see the Compiler and Debugger Settings dialog box similar to the one shown here.

A view of the Global Compiler Settings dialog box.
A view of the Global Compiler Settings dialog box.

Click Reset Defaults. This action will reset all of the defaults so that they match the initial installation configuration unless you have created a default of your own. Make absolutely certain that the Selected Compiler field shows GNU GCC Compiler as shown in the figure and then click OK. Close and then reopen Code::Blocks before you test your configuration.

Let me know if you have any questions about this procedure at [email protected]. It’s always my goal to make my books as useful to you as possible.

Machine Learning Security Principles Now Available as an Audiobook

I love to provide people with multiple ways to learn. One of the more popular methods of learning today is the audiobook. You can listen and learn while you do something else, like drive to work. Machine Learning Security Principles is now available in audiobook form and I’m really quite excited about it because this is the first time that one of my books has appeared in this format. You can get this book in audiobook format on the O’Reilly site at https://www.oreilly.com/videos/machine-learning-security/9781805124788/.

After listening to the book myself, I have to say that the audio is quite clear and it does add a new way for me to learn as well. If you do try this audiobook, please let me know how it works for you. I’ll share any input you provide with the publisher as well so that we can work together to provide you with the best possible book materials in a format that works best for you. Please let me know your thoughts at [email protected].