Machine Learning Security Principles – John's Random Thoughts and Discussions

Warning Messages in Jupyter Notebook Example Code

You’re working with the downloadable source code from a book like Algorithms for Dummies, 2nd Edition, Beginning Programming with Python For Dummies, 3rd Edition, Machine Learning for Dummies, 2nd Edition, Python for Data Science for Dummies, or Machine Learning Security Principles and see a warning message like this:

C:\Users\John\anaconda3\lib\site-packages\sklearn\feature_selection\_sequential.py:206: FutureWarning: Leaving `n_features_to_select` to None is deprecated in 1.0 and will become 'auto' in 1.3. To keep the same behaviour as with None (i.e. select half of the features) and avoid this warning, you should manually set `n_features_to_select='auto'` and set tol=None when creating an instance.
  warnings.warn(

Well, that’s pretty confusing looking and if you’re just learning to work with Python may give you the idea that you’ve done something seriously wrong. There are a couple things to note here. First, this is a warning message. In fact, it’s a FutureWarning message, which means the change mentioned in the warning hasn’t actually taken effect yet.

Second, if you’re using the version of Jupyter Notebook and Python mentioned in the book, it’s unlikely that the effects described in the message will become a problem anytime soon, so you can usually ignore them. (This is one reason that I always ask which version of Jupyter Notebook and Python you’re using because a newer version can definitely cause error messages to appear.) Of course, if this warning ever does turn into an error, Luca and I definitely want to hear about it at [email protected].

Third, the message does state a potential fix for the problem. If the fix is simple enough, you can always try to make the required change to see if it works. However, this is a do it at your own risk sort of modification. The point is that the warning isn’t keeping you from using the downloadable source today, so ignoring it is probably the best action to take.

If you really don’t want to see these warnings, you can always add two lines of code the to first cell of the downloadable source. The warning isn’t actually going away, you just won’t see it:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

So, what causes these warning messages in the first place? Is the book’s source code faulty? There is nothing wrong with the book’s source code. What you’re seeing is the result of a library upgrade. Python uses a huge number of libraries and a change in any one of them can create a warning message of the sort you’ve seen. Luca and I work hard to ensure that the source code you get with the book is functional (and warning free) on all of the supported platforms at the time of writing, but it would be impossible for us to constantly update the book’s code to keep up with these library changes.

An Interesting Review of ChatGPT and Other AIs

I’ve written in the past about limitations of AI from a number of perspectives. In Effects of the Mistruths of Data on Model Output I discuss how the data fed to a model must necessary affect its output in a number of ways, including bias and other unwanted effects. Considering the Four Levels of Intelligence Management tells why it’s not possible for an AI to approach human intelligence today. In Fooling Facial Recognition Software I provide a detailed discussion of why it’s so easy to fool certain types of AI-powered applications. And you learn about why some types of occupations are reasonably safe from AI in Automation and the Future of Human Employment. However, I haven’t really done a detailed investigation of AIs like ChatGPT that seem almost human-like in their understanding, but fall remarkably short in many simple areas. On Artifice and Intelligence is one of the more detailed analysis I’ve found to date on the subject and what it reveals will surprise you. There are a lot of simple problems that ChatGPT and other Large Language Models (LLMs) can’t solve.

What I found interesting in the article is that the author, Shlomi Sher, was able to show that one area where an AI should be strong, math, actually isn’t all that strong at all. He talks about Euclid’s proofs. The artifice is that ChatGPT 4 and other AIs can tell you all about prime numbers and even provide seemingly creative output about them. However, depending on how you ask some basic questions, ChatGPT 4 either gets the answer wrong or right, when the answer is pretty much apparent to any human who knows what prime numbers are. What I liked most about the article is that the author takes time to explain why humans can understand the problem, but the AI can’t.

If it seems as if I have a continuing desire to dissuade others from anthropomorphizing AIs, I most certainly do. When it comes to AI, it’s all about the math and nothing more. However, that doesn’t mean that AIs lack functionality and ability as tools to augment human endeavors. It’s likely that the use of AIs will continue to increase over time. In addition, I think that as we better understand precisely how AIs work, we’ll also come to realize that they’re amazing tools, but most definitely not humans in the making. Let me know your thoughts on ChatGPT at [email protected].

IDE Screenshot Usage in Books

There are cases where it’s very tough to figure out the correct presentation of material in a book, which is made more difficult by some readers preferring one presentation and other readers another. It comes down to how people learn in many cases. Visual learners prefer screenshots, abstract learners prefer text. Of course, there are all sorts of learners between these two extremes. So, what seems like a simple question can become quite complex.

The question at hand is whether to present screenshots of an IDE in a book with the associated example code and its output. The problem is that vendors now assume that developers have very large displays and so have made use of all of that extra screen real estate. In addition, book publishers don’t want books where a single image consumes an entire page. The result is that it’s very hard to get a screenshot where the text is completely readable. It can be done, but the text will generally still be smaller than the print in the book. Older readers complain that they need a magnifying glass to see the text at all.

However, there are benefits to using screenshots. The most important benefit is that, even if the text isn’t completely readable, visual learners can see what their IDE should look like as they follow the progress of procedures in the book. This feedback lets the visual learner know that they are doing things correctly and are getting the correct result. Another benefit is that an example tends to stay in one piece. The graphical output of an example doesn’t end up several pages away from the source code that produces it. Sometimes, textual output is wider than the page will allow using the normal font size. So, the options are to print the output in the book at the normal font size, but in a truncated form, which means that it’s no longer complete. A screenshot can show the complete textual output, but at a smaller font. For beginner readers, the second form, while not optimal, is preferred because truncating the output produces questions in the reader’s mind.

So, how do you feel about IDE screenshots in books? Are they more helpful or more confusing? Part of the reason for posts like this is to get your opinion and discover more about you as a reader. Obviously, a book author wants to use the communication techniques that work best overall for everyone, book space often not allowing for the investigation of every presentation alternative. Let me know your thoughts at [email protected].

Jupyter Notebook vs JupyterLite

There seems to be some confusion for readers of Algorithms for Dummies, 2nd Edition, Beginning Programming with Python For Dummies, 3rd Edition, Machine Learning for Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning Security Principles lately due to the similarity of names of two Integrated Development Environments (IDEs) available now. Even though I’m sure that JupyterLite is a very good product, even the website states, “Not all the usual features available in JupyterLab and the Classic Notebook will work with JupyterLite, but many already do!” This lack of support becomes a problem when you try to run the downloadable source using JupyterLite. In addition, Luca and I haven’t tested the downloadable source with this product, so we can’t even tell you what will and won’t work.

The two supported IDEs for our books are Google Colab (recommended for those of you who want to use a mobile device) and Jupyter Notebook (recommended for those of you who have a desktop system). It’s actually preferred that you get Jupyter Notebook as part of the Anaconda toolset because Anaconda makes it very easy for you to perform some advanced setup tasks found in some of our books. For example, you gain access to the Anaconda prompt and the associated Conda utility that definitely makes it easier for you to manage some of the machine learning packages found in our books. Using either Google Colab or Jupyter Notebook makes it very much easier for Luca and I to help you with your book-specific questions.

Please let me know if you have any questions or concerns about how to setup your programming environment for our books at [email protected]. Remember to use the version of the products listed in the book for optimal results in working with the downloadable source. In addition, always remember to use the downloadable source to enhance your learning experience.

Programming Languages Commonly Used for Data Science

The world is packed with programming languages, each of them proclaiming their particular forte and telling you why you need to learn them. A good developer does learn multiple languages, each of which becomes a tool for a certain kind of development, but even the most enthusiastic developer won’t learn every programming language out there. It’s important to make good choices.

Data Science is a particular kind of development task that works well with certain kinds of programming languages. Choosing the correct tool makes your life easier. It’s akin to using a hammer to drive a screw rather than a screwdriver. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Data scientists usually use only a few languages because they make working with data easier. With this in mind, here are the top languages for data science work in order of preference:

Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes it easy to use multi-processing on large datasets — reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation actually shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. It also sees use for all sorts of scientific tasks.
R (special purpose statistical): In many respects, Python and R share the same sorts of functionality but implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, R doesn’t appear to mix with other languages with the ease that Python provides.
SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, there is often a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
Java (general purpose): Some data scientists perform other kinds of programming that require a general purpose, widely adapted and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science, but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries.
Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM) it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis. In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.

There are likely other languages that data scientists use, but this list gives you a good idea of what to look for in any programming language you choose for data science tasks. What it comes down to is choosing languages that help you perform analysis, work with huge datasets, and allow you to perform some level of general programming tasks. Let me know your thoughts about data science programming languages at [email protected].

Effects of the Mistruths of Data on Model Output

A number of the books Luca and I have written or I have written on my own, including Artificial Intelligence for Dummies, 2nd Edition, Machine Learning for Dummies, 2nd Edition, Python for Data Science for Dummies, and Machine Learning Security Principles, talk about the five mistruths of data: commission, omission, bias, perspective, and frame of reference. Of these five mistruths, the one that receives the most attention is bias, but they’re all important because they all affect how any data science model you build will perform. Because the data used to create the model isn’t free of mistruths, the model can’t perform as expected in many situations. Consequently, the assertions in the article, LGBTQ+ bias in GPT-3, don’t surprise me at all. The data used to create the model is flawed, so the output is flawed as well.

I chose the article in question as a reference because the author takes the time to point out a problem in ever generating a perfect model, the constant change in human perspective. Words that were considered toxic in the past are no longer considered toxic today, but new words have taken their place. Even if a model were to somehow escape bias today, it would be biased tomorrow due to the mistruth of perspective.

So, why have I been using the term mistruth instead of the term lie? A lie is information that is passed off as true in order to avoid responsibility, to harm others in some way, or to knowingly pass off information that is untrue for personal gain. However, humans use mistruths all of the time to reduce the potential for arguments, to save someone’s ego, or simply because the information that the person has is inaccurate. A mistruth doesn’t have the intent of deceiving another for personal gain, but it’s still not true. So, when someone asks, “Do these pants make me look fat?” and another person states, tactfully, that, “They make you look voluptuous.” the statement could be true or a mistruth, but is done to keep an argument at bay and make the other person feel good about themselves. However, machine learning algorithms have no concept of this interplay and the model created using such statements will be biased.

Anthropomorphizing machine learning models doesn’t change the fact that they’re essentially statistically-based mathematical models of data with no understanding of anything built into them. So, true or not, the model sees all data as being the same—the only solution to the problem of bias is to clean the data. However, the human expectation after getting emotionally attached to their AI is that the AI will somehow just know when something is hurtful. Articles like, What is the Proper Pronoun for GPT-4?, point to the problem of why someone would ask the question at all. The interactions with AI have taken on a harmful aspect because humans are seeing them as sentient when they most certainly aren’t. The issue will come to a head when people try to use the bad advice obtained from their AI as a defense in court. Personally, I see the continued use of “it” as essential to remind people that no matter how human a model may seem, it’s still just a model.

It’s important to understand that I see AI as an amazing tool that will only get more amazing as data scientists, developers, and others add to it’s potential by doing things like adding more memory. Although I don’t agree that any machine learning model can be someone’s friend, articles like How to Use ChatGPT in Daily Life? do make a strong argument for using machine learning models as tools to enhance human capability. However, it is a person who thought up these uses, not the machine learning model. Machine learning models will remain limited because they can’t understand the data they manipulate. In fact, articles like Why ChatGPT Won’t Replace Coders Just Yet point out just how limited machine learning models remain. However, I do think that machine learning models will get expanded by humans to perform new tasks as described in articles like How To Build Your Own Custom ChatGPT With Custom Knowledge Base. Of course, if that custom knowledge base is biased in any way, then the output from the new model will also be biased.

It’s important to know that AI is moving forward and that it will be extended to do even more in assisting humans to realize their full potential. However, it’s also important to know that the five mistruths will continue to be a problem because machine learning models are unable to understand the data used to train them and to provide knowledge bases for their output. Realistic expectations will help improve AI as a tool that augments human capabilities and helps us achieve amazing things in the future. Let me know your thoughts on machine learning bias at [email protected].

Compiling Python

None of my Python books, including Algorithms for Dummies, 2nd Edition, Beginning Programming with Python For Dummies, 3rd Edition, Machine Learning for Dummies, 2nd Edition, Machine Learning Security Principles, and Python for Data Science for Dummies, show how to compile a Python program. This is because the interpreted nature of Python makes it easier to work with scripts for these reasons:

The interpreter provides instant results to make learning faster.
It’s easier and faster to fix errors.
The use of notebooks, as is found in all of the books, makes creating output easier.
The use of literate programming techniques helps create an environment where acquired knowledge is more likely to remain acquired.
Using literate programming techniques also makes it possible to document the code in a manner that’s more like reading a textbook than looking at source code.
The use of scripts promotes experimentation, which leads to new ideas and techniques.

These are all great reasons to use scripts in books. In fact, I’m sure that many people will have other reasons to use scripts. The one thing you should note is that Python does automatically compile some files to do things like reduce loading time. Anytime you see a .pyc file, the file has been compiled by Python to bytecode through various means, including importing the script. It’s also possible to pre-compile a script using the python interpreter’s -m command line switch. The resulting output appears in the __pycache__ folder with a .pyc extension. You can further modify the compilation process by using the -o and -oo command line switches, which offer various optimizations to make the code load even faster. The problems with these outputs is that they’re only mildly obfuscated, so if your intent is to hide your code from prying eyes, this isn’t the best option.

Another built-in compilation option is to use the compile() function, which performs a compilation directly in your code. The purpose of using this function is to speed up code that is used often within your application. For example, you might use it to compile code that appears within a loop. Obviously, you get no obfuscation advantage using this approach, but you do get a speed advantage. If you don’t want to go through the bother of using the compile() function, you could always use a third party product like Numba, which reduces the task to one of adding a decorator to your code.

None of the solutions discussed so far do anything more than turn your Python script into bytecode, which is still interpreted (albeit, much faster than using a human language script). There is also an option for turning your Python code into actual machine code through various intermediate steps. A Python compiler usually turns your Python script into an intermediate language, which is then compiled into actual machine code that is native to the host platform. However, it may simply run your script online, so you need to know in advance whether you’ll end up with an executable file in the end. An executable file can offer these advantages:

The source code is fully obfuscated, protecting your development investment.
The code runs significantly faster than any other means of interacting with it.
Instead of a host of script files, you usually end up with just a few executable files, perhaps even just one.
Because it’s harder to modify, an executable file can be more secure and reliable than using scripts.

If your goal is to exclusively create an executable output, then a product like auto-py-to-exe might be your best option. This way you get to use your interpreter of choice to develop the application, then use another product to turn the result into an .exe file. The idea is to get the best of both worlds. The point of all this is that you don’t strictly have to interact with Python code in one way, using an interpreter. You have a great many options at your disposal. Let me know your thoughts about working with compiled Python code at [email protected].

Machine Learning Security Principles Now Available as an Audiobook

I love to provide people with multiple ways to learn. One of the more popular methods of learning today is the audiobook. You can listen and learn while you do something else, like drive to work. Machine Learning Security Principles is now available in audiobook form and I’m really quite excited about it because this is the first time that one of my books has appeared in this format. You can get this book in audiobook format on the O’Reilly site at https://www.oreilly.com/videos/machine-learning-security/9781805124788/.

After listening to the book myself, I have to say that the audio is quite clear and it does add a new way for me to learn as well. If you do try this audiobook, please let me know how it works for you. I’ll share any input you provide with the publisher as well so that we can work together to provide you with the best possible book materials in a format that works best for you. Please let me know your thoughts at [email protected].

Machine Learning Security and Event Sourcing for Databases

In times past, an application would make an update to a database, essentially overwriting the old data with new data. There wasn’t an actual time element to the update. The data would simply change. This approach to database management worked fine as long as the database was on a local system or even a network owned by an organization. However, as technology has progressed to use functionality like machine learning to perform analysis and microservices to make applications more scalable and reliable, the need for some method of reconstructing events has become more important.

To guarantee atomicity, consistency, isolation, and durability (ACID) in database transactions, products that rely on SQL Server use a transaction log to ensure data integrity. In the event of an error or outage, it’s possible to use the transaction log to rebuild pending database operations or roll them back as needed. It’s possible to recreate the data in the database, but the final result is still a static state. Transaction logs are a good start, but not all database management systems (DBMS) support them. In addition, transaction logs focus solely on the data and its management.

In a machine learning security environment, of the type described in Machine Learning Security Principles, this isn’t enough to perform analysis of sufficient depth to locate hacker activity patterns in many cases. The transaction logs would need to be somehow combined with other logs, such as those that track RESTful interaction with the associated application. The complexity of combining the various data sources would prove daunting to most professionals because of the need to perform data translations between logs. In addition, the process would prove time consuming enough that the result of any analysis wouldn’t be available in a timely manner (in time to stop the hacker).

Event sourcing, of the type that many professionals now advocate for microservice architectures, offers a better solution that it less prone to problems when it comes to security. In this case, instead of just tracking the data changes, the logs would reflect application state. By following the progression of past events, it’s possible to derive the current application state and its associated data. As mentioned in my book, hackers tend to follow patterns in application interaction and usage that fall outside the usual user patterns because the hacker is looking for a way into the application in order to carry out various tasks that likely have nothing to do with ordinary usage.

A critical difference between event sourcing and other transaction logging solutions is the event sourcing relies on its own journal, rather than using the DBMS transaction log, making it possible to provide additional security for this data and reducing the potential for hacker changes to cover up nefarious acts. There are most definitely tradeoffs between techniques such as Change Data Capture (CDC) and event sourcing that need to be considered, but from a security perspective, event sourcing is superior. As with anything, there are pros and cons to using event sourcing, the most important of which from a security perspective is that event sourcing is both harder to implement and more complex. Many developers cite the need to maintain two transaction logs as a major reason to avoid event sourcing. These issues mean that it’s important to test the solution fully before delivering it as a production system.

If you’re looking to create a machine learning-based security monitoring solution for your application that doesn’t require combining data from multiple sources to obtain a good security picture, then event sourcing is a good solution to your problem. It allows you to obtain a complete picture of the entire domain’s activities that helps you locate and understand hacker activity. Most importantly, because the data resides in a single dedicated log that’s easy to secure, the actual analysis process is less complex and you can produce a result in a timely manner. The tradeoff is that you’ll spend more time putting such a system together. Let me know your thoughts about event sourcing as part of security solution at [email protected].

Giving Hackers an Exciting Target

Hackers will attack anyone, any organization, or anything that seems to offer the promise of something in exchange for the time spent: money, resources, revenge…the list goes on. However, for many hackers the kicker for choosing somewhere to hit is some level of challenge, some sort of excitement. After all, why attack a boring site when there is one out there literally begging you to attack it? Such is the case with GrapheneOS, which bills itself as:

The private and secure mobile operating system with Android app compatibility.
GrapheneOS Website

According to Multiple DDoS Attacks at GrapheneOS — What’s Going On Behind the Scenes?, GrapheneOS has recently endured multiple attacks. I verified the story on Twitter from a post by GrapheneOS. Such an attack can happen to anyone at any time. Keeping a low profile seems prudent, but not always possible (as is the case here). One of the things I stressed when writing Machine Learning Security Principles is that anything an organization can do to make attacks harder and less attractive will only reduce the security burden of the organization in the long run. Keeping a low profile tends to make an attack less attractive.

The reason that I was attracted to this particular DDoS attack is that GrapheneOS is using Synapse, an AI-based product. The article, Synapse Technology Corporation: Using AI to Take a Good Look at Airport Security, tells you a bit more about the history of this product. In looking at the Synapse website, you can see that they have some interesting customers, including the military and government. Oddly enough, I’m not seeing any other reports of major problems with Synapse. The problem must be with the GrapheneOS security setup.

The bottom line is that if a hacker decides to break into your organization, it’ll happen at some point no matter how good your security systems are, which means that it’s essential to combine security with monitoring and analysis of attack vectors. Keeping a low profile is essential too because hackers, like the most of the rest of us, love a good challenge. Reviewing attacks like the ones targeted at GrapheneOS can help you improve your own security setup. Let me know your thoughts on AI-based security at [email protected].

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31