Data Science – John's Random Thoughts and Discussions

Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Effects of the Mistruths of Data on Model Output

A number of the books Luca and I have written or I have written on my own, including Artificial Intelligence for Dummies, 2nd Edition, Machine Learning for Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning Security Principles, talk about the five mistruths of data: commission, omission, bias, perspective, and frame of reference. Of these five mistruths, the one that receives the most attention is bias, but they’re all important because they all affect how any data science model you build will perform. Because the data used to create the model isn’t free of mistruths, the model can’t perform as expected in many situations. Consequently, the assertions in the article, LGBTQ+ bias in GPT-3, don’t surprise me at all. The data used to create the model is flawed, so the output is flawed as well.

I chose the article in question as a reference because the author takes the time to point out a problem in ever generating a perfect model, the constant change in human perspective. Words that were considered toxic in the past are no longer considered toxic today, but new words have taken their place. Even if a model were to somehow escape bias today, it would be biased tomorrow due to the mistruth of perspective.

So, why have I been using the term mistruth instead of the term lie? A lie is information that is passed off as true in order to avoid responsibility, to harm others in some way, or to knowingly pass off information that is untrue for personal gain. However, humans use mistruths all of the time to reduce the potential for arguments, to save someone’s ego, or simply because the information that the person has is inaccurate. A mistruth doesn’t have the intent of deceiving another for personal gain, but it’s still not true. So, when someone asks, “Do these pants make me look fat?” and another person states, tactfully, that, “They make you look voluptuous.” the statement could be true or a mistruth, but is done to keep an argument at bay and make the other person feel good about themselves. However, machine learning algorithms have no concept of this interplay and the model created using such statements will be biased.

Anthropomorphizing machine learning models doesn’t change the fact that they’re essentially statistically-based mathematical models of data with no understanding of anything built into them. So, true or not, the model sees all data as being the same—the only solution to the problem of bias is to clean the data. However, the human expectation after getting emotionally attached to their AI is that the AI will somehow just know when something is hurtful. Articles like, What is the Proper Pronoun for GPT-4?, point to the problem of why someone would ask the question at all. The interactions with AI have taken on a harmful aspect because humans are seeing them as sentient when they most certainly aren’t. The issue will come to a head when people try to use the bad advice obtained from their AI as a defense in court. Personally, I see the continued use of “it” as essential to remind people that no matter how human a model may seem, it’s still just a model.

It’s important to understand that I see AI as an amazing tool that will only get more amazing as data scientists, developers, and others add to it’s potential by doing things like adding more memory. Although I don’t agree that any machine learning model can be someone’s friend, articles like How to Use ChatGPT in Daily Life? do make a strong argument for using machine learning models as tools to enhance human capability. However, it is a person who thought up these uses, not the machine learning model. Machine learning models will remain limited because they can’t understand the data they manipulate. In fact, articles like Why ChatGPT Won’t Replace Coders Just Yet point out just how limited machine learning models remain. However, I do think that machine learning models will get expanded by humans to perform new tasks as described in articles like How To Build Your Own Custom ChatGPT With Custom Knowledge Base. Of course, if that custom knowledge base is biased in any way, then the output from the new model will also be biased.

It’s important to know that AI is moving forward and that it will be extended to do even more in assisting humans to realize their full potential. However, it’s also important to know that the five mistruths will continue to be a problem because machine learning models are unable to understand the data used to train them and to provide knowledge bases for their output. Realistic expectations will help improve AI as a tool that augments human capabilities and helps us achieve amazing things in the future. Let me know your thoughts on machine learning bias at [email protected].

Review of The Kaggle Book

A picture of The Kaggle Book cover. — The Kaggle Book tells you everything needed about competing on Kaggle.

The Kaggle Book by Konrad Banachewicz and Luca Massaron is a book about competing on Kaggle. The introductory chapters tell you all about Kaggle and the competitions it sponsors. The bulk of the book provides details on how to compete better against a variety of adversaries. The book ends with some insights into how competing in Kaggle can help with other areas of your life. If the book ended here, it would still be well worth reading cover-to-cover as I did, but it doesn’t end here.

My main reasons for reading the book were to find out more about how data is created and vetted on Kaggle, and to obtain some more insights on how to write better data science applications. In the end, the bulk of this book is a rather intense treatment of data science with a strong Kaggle twist. If you’re looking for datasets and to understand techniques for using them effectively, then this is the book you want to get because the authors are both experts in the field. The biases in the book are toward data management, verification, validation, and checks for model goodness. It’s the model goodness part that is hard to find in any other book (at least, the ones I’ve read so far).

The book does contain interviews from other people who have participated in Kaggle competitions. I did read a number of these interviews and found that they didn’t help me personally because of my goals in reading the book. However, I have no doubt that they’d help someone who was actually intending to enter a Kaggle competition, which sounds like a great deal of work. Before I read the book, I had no idea of just how much goes into these competitions and what the competitors have to do to have a chance of winning. What I found most important is that the authors stress the need to get something more out of a competition than simply winning—that winning is just a potential outcome of a much longer process of learning, skill building, and team building.

You really need this book if you are into data science at all because it helps you gain new insights into working through data science problems and ensuring that you’re getting a good result. I know that my own person skills will be improved as I apply the techniques described in the book, which really do apply to every kind of data science development and not just to Kaggle competitions.

Technology and Child Safety

This is an update of a post that originally appeared on January 20, 2016.

I wrote a little over seven years ago that I had read an article in ComputerWorld, Children mine cobalt used in smartphones, other electronics, that had me thinking yet again about how people in rich countries tend to ignore the needs of those in poor countries. I had sincerely hoped at the time that things would be different, better, in seven years. Well, they’re worse! We’ve increased our use of cobalt dramatically in order to create supposedly green cars. The picture at the beginning of the ComputerWorld article says it all, but the details will have you wondering whether a smartphone or an electric car really is worth some child’s life. That’s right, any smartphone or electric car you buy may be killing someone and in a truly horrid manner. Children as young as 7 years old are mining the cobalt needed for the batteries (and other components) in the smartphones and electric cars that people seem to feel are so necessary for life (they aren’t you know; food, water, clothing, shelter, sleep, air, and reproduction are necessities, everything else is a luxury).

The problem doesn’t stop when someone gets rid the smartphone, electric car, or other technology. Other children end up dismantling the devices sent for recycling. That’s right, a rich country’s efforts to keep electronics out of their landfills is also killing children because countries like India put these children to work taking them apart in unsafe conditions. Recycled wastes go from rich countries to poor countries because the poor countries need the money for necessities, like food. Often, these children are incapable of working by the time they reach 35 or 40 due to health issues induced by their forced labor. In short, the quality of their lives is made horribly low so that it’s possible for people in rich countries to enjoy something that truly isn’t necessary for life. To make matters worse, the vendors of these products build in obsolescence (making them unrepairable) so they can sell more products and make more money, increasing the devastation visited on children.

I’ve written other blog posts about the issues of technology pollution. However, the emphasis of these previous articles has been on the pollution itself. Taking personal responsibility for the pollution you create is important, but we really need to do more. Robotic (autonomous) mining is one way to keep children out of the mines and projects such as UX-1 show that it’s entirely possible to use robots in place of people today. The weird thing is that autonomous mining would save up to 80% of the mining costs of today, so you have to wonder why manufacturers aren’t rushing to employ this solution.

In addition, off world mining would keep the pollution in space, rather than on planet earth. Of course, off world mining also requires a heavy investment in robots, but it promises to provide a huge financial payback in addition to keeping earth a bit cleaner. The point is that there are alternatives that we’re not using. Robotics presents an opportunity to make things right with technology and I’m excited to be part of that answer in writing books such as Machine Learning Security Principles, Artificial Intelligence for Dummies, 2nd Edition, Algorithms for Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning for Dummies, 2nd Edition.

Unfortunately, companies like Apple, Samsung, and many others simply thumb their noses at laws that are in place to protect the children in these countries because they know you’ll buy their products. Yes, they make official statements, but read their statements in that first article and you’ll quickly figure out that they’re excuses and poorly made excuses at that. They don’t have to care because no one is holding them to account. People in rich countries don’t care because their own backyards aren’t sullied and their own children remain safe. It’s not that I have a problem with technology, quite the contrary, I have a problem with the manner in which technology is currently being made and supported. We need to do better. So, the next time you think about buying electronics, consider the real price for that product. Let me know what you think about polluting other countries to keep your country clean at [email protected].

Python for Data Science for Dummies Errata on Page 221

The downloadable source for Python for Data Science for Dummies contains a problem that doesn’t actually appear in the book. If you look at page 221, the code block in the middle of the page contains a line saying import numpy as np. This line is essential because the code won’t run without it. The downloadable source for Chapter 12 is missing this line so the example doesn’t run. This P4DS4D; 12; Stretching Pythons Capabilities link provides you with a .ZIP file that contains the replacement source code. Simple remove the P4DS4D; 12; Stretching Pythons Capabilities.ipynb file from the archive and use it in place of your existing file.

Luca and I always want you to have a great experience with our book, so keep those emails coming. Please let me know if you have any questions about source code file update at [email protected]. I’m sorry about any errors that appear in the downloadable source and appreciate the readers who have pointed them out.

Python for Data Science for Dummies Errata on Page 145

Python for Data Science for Dummies contains two errors on page 145. The first error appears in the second paragraph on that page. You can safely disregard the sentence that reads, “The use_idf controls the use of inverse-document-frequency reweighting, which is turned off in this case.” The code doesn’t contain a reference to the use_idf parameter. However, you can read about it on the Scikit-Learn site. This parameter defaults to being turned on, which is how it’s used for the example.

The second error is also in the second paragraph. The discussion references the tf_transformer.transform() method call. The actual method call is tfidf.transform(), which does appear in the sample code. The discussion about how the method works is correct, just the name of the object is wrong.

Please let me know if you have any questions about either of these changes at [email protected]. I’m sorry about any errors that appear in the book and appreciate the readers who have pointed them out.

Python for Data Science for Dummies Errata on Page 124

Python for Data Science for Dummies contains an error in the example that appears on the top half of page 124. In the first of the two grey boxes, the code computes the results of four print statements. The bottom-most print statement, print x[1:2, 1:2], is supposed to compute a result based on rows 1 and 2 of columns 1 and 2, and the bottom grey box seems to confirm that interpretation by the showing the result as [[[14 15 16] [17 18 19]] [[24 25 26] [27 28 29]]]. However, the answer provided for this example in the downloadable source code is [[[14 15 16]]], which doesn’t agree with that in the text.

The good news is that the downloadable source contains the correct code. The error appears only in the book. The last print statement in the book is wrong. Here is the correct code (with output) for this example:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
 [[11,12,13], [14,15,16], [17,18,19],],
 [[21,22,23], [24,25,26], [27,28,29]]])

print x[1,1]
print x[:,1,1]
print x[1,:,1]
print
print x[1:3, 1:3]
[14 15 16]
[ 5 15 25]
[12 15 18]

[[[14 15 16]
 [17 18 19]]

[[24 25 26]
 [27 28 29]]]

Please let me know if you have any questions about this example at [email protected]. I’m sorry about the error that appears in the book and appreciate the readers who have pointed it out.

Missing XMLData2.xml File

A number of readers have written to report that XMLData2.xml is missing from the downloadable source for Python for Data Science for Dummies. You encounter this file in Chapter 6, on page 108. The publisher has already added the file to the downloadable source, but you might be missing the file from your copy. If so, you can download it by clicking XMLData2.zip. I’m truly sorry about any problems that the missing file might have caused. Please be sure to let me know about your book specific question at [email protected].

Tip Error in Python for Data Science for Dummies

There is a small error on page 318 of Python for Data Science for Dummies. You can find it near the middle of the page in the Tip text. The current text on the second line of that paragraph says, “k as a number near the squared number of available observations.” However, the text should really read, “k as a number near the squared root number of available observations.” The word root is missing, which obviously changes the mathematical meaning of the text. Please accept our apologies for the typo. Let me know if you find any other errors of a technical nature in the book at [email protected] and I’ll be sure to provide a blog post about it here. Thank you for your support!

Missing File from Python for Data Science for Dummies Downloadable Source

A reader recently contacted me regarding a missing file from the downloadable source for Python for Data Science for Dummies. This is the P4DS4D; 01; Quick Overview.ipynb you need for the first chapter. Simply click here to download P4DS4D; 01; Quick Overview.ipynb. I’m also asking the publisher to add the missing file to the downloadable source found on the Dummies site at http://www.dummies.com/store/product/Python-for-Data-Science-For-Dummies.productCd-1118844181,descCd-DOWNLOAD.html. If you encounter any other problems with the book, please be sure to let me know at [email protected]. Thank you for your patience!