Review of The Kaggle Book

A picture of The Kaggle Book cover.
The Kaggle Book tells you everything needed about competing on Kaggle.

The Kaggle Book by Konrad Banachewicz and Luca Massaron is a book about competing on Kaggle. The introductory chapters tell you all about Kaggle and the competitions it sponsors. The bulk of the book provides details on how to compete better against a variety of adversaries. The book ends with some insights into how competing in Kaggle can help with other areas of your life. If the book ended here, it would still be well worth reading cover-to-cover as I did, but it doesn’t end here.

My main reasons for reading the book were to find out more about how data is created and vetted on Kaggle, and to obtain some more insights on how to write better data science applications. In the end, the bulk of this book is a rather intense treatment of data science with a strong Kaggle twist. If you’re looking for datasets and to understand techniques for using them effectively, then this is the book you want to get because the authors are both experts in the field. The biases in the book are toward data management, verification, validation, and checks for model goodness. It’s the model goodness part that is hard to find in any other book (at least, the ones I’ve read so far).

The book does contain interviews from other people who have participated in Kaggle competitions. I did read a number of these interviews and found that they didn’t help me personally because of my goals in reading the book. However, I have no doubt that they’d help someone who was actually intending to enter a Kaggle competition, which sounds like a great deal of work. Before I read the book, I had no idea of just how much goes into these competitions and what the competitors have to do to have a chance of winning. What I found most important is that the authors stress the need to get something more out of a competition than simply winning—that winning is just a potential outcome of a much longer process of learning, skill building, and team building.

You really need this book if you are into data science at all because it helps you gain new insights into working through data science problems and ensuring that you’re getting a good result. I know that my own person skills will be improved as I apply the techniques described in the book, which really do apply to every kind of data science development and not just to Kaggle competitions.

Technology and Child Safety

This is an update of a post that originally appeared on January 20, 2016.

I wrote a little over seven years ago that I had read an article in ComputerWorld, Children mine cobalt used in smartphones, other electronics, that had me thinking yet again about how people in rich countries tend to ignore the needs of those in poor countries. I had sincerely hoped at the time that things would be different, better, in seven years. Well, they’re worse! We’ve increased our use of cobalt dramatically in order to create supposedly green cars. The picture at the beginning of the ComputerWorld article says it all, but the details will have you wondering whether a smartphone or an electric car really is worth some child’s life. That’s right, any smartphone or electric car you buy may be killing someone and in a truly horrid manner. Children as young as 7 years old are mining the cobalt needed for the batteries (and other components) in the smartphones and electric cars that people seem to feel are so necessary for life (they aren’t you know; food, water, clothing, shelter, sleep, air, and reproduction are necessities, everything else is a luxury).

The problem doesn’t stop when someone gets rid the smartphone, electric car, or other technology. Other children end up dismantling the devices sent for recycling. That’s right, a rich country’s efforts to keep electronics out of their landfills is also killing children because countries like India put these children to work taking them apart in unsafe conditions. Recycled wastes go from rich countries to poor countries because the poor countries need the money for necessities, like food. Often, these children are incapable of working by the time they reach 35 or 40 due to health issues induced by their forced labor. In short, the quality of their lives is made horribly low so that it’s possible for people in rich countries to enjoy something that truly isn’t necessary for life. To make matters worse, the vendors of these products build in obsolescence (making them unrepairable) so they can sell more products and make more money, increasing the devastation visited on children.

I’ve written other blog posts about the issues of technology pollution. However, the emphasis of these previous articles has been on the pollution itself. Taking personal responsibility for the pollution you create is important, but we really need to do more. Robotic (autonomous) mining is one way to keep children out of the mines and projects such as UX-1 show that it’s entirely possible to use robots in place of people today. The weird thing is that autonomous mining would save up to 80% of the mining costs of today, so you have to wonder why manufacturers aren’t rushing to employ this solution.

In addition, off world mining would keep the pollution in space, rather than on planet earth. Of course, off world mining also requires a heavy investment in robots, but it promises to provide a huge financial payback in addition to keeping earth a bit cleaner. The point is that there are alternatives that we’re not using. Robotics presents an opportunity to make things right with technology and I’m excited to be part of that answer in writing books such as Machine Learning Security PrinciplesArtificial Intelligence for Dummies, 2nd EditionAlgorithms for Dummies, 2nd EditionPython for Data Science for Dummies, and Machine Learning for Dummies, 2nd Edition.

Unfortunately, companies like Apple, Samsung, and many others simply thumb their noses at laws that are in place to protect the children in these countries because they know you’ll buy their products. Yes, they make official statements, but read their statements in that first article and you’ll quickly figure out that they’re excuses and poorly made excuses at that. They don’t have to care because no one is holding them to account. People in rich countries don’t care because their own backyards aren’t sullied and their own children remain safe. It’s not that I have a problem with technology, quite the contrary, I have a problem with the manner in which technology is currently being made and supported. We need to do better. So, the next time you think about buying electronics, consider the real price for that product. Let me know what you think about polluting other countries to keep your country clean at [email protected].

Python for Data Science for Dummies Errata on Page 221

The downloadable source for Python for Data Science for Dummies contains a problem that doesn’t actually appear in the book. If you look at page 221, the code block in the middle of the page contains a line saying import numpy as np. This line is essential because the code won’t run without it. The downloadable source for Chapter 12 is missing this line so the example doesn’t run. This P4DS4D; 12; Stretching Pythons Capabilities link provides you with a .ZIP file that contains the replacement source code. Simple remove the P4DS4D; 12; Stretching Pythons Capabilities.ipynb file from the archive and use it in place of your existing file.

Luca and I always want you to have a great experience with our book, so keep those emails coming. Please let me know if you have any questions about source code file update at [email protected]. I’m sorry about any errors that appear in the downloadable source and appreciate the readers who have pointed them out.

 

Python for Data Science for Dummies Errata on Page 145

Python for Data Science for Dummies contains two errors on page 145. The first error appears in the second paragraph on that page. You can safely disregard the sentence that reads, “The use_idf controls the use of inverse-document-frequency reweighting, which is turned off in this case.” The code doesn’t contain a reference to the use_idf parameter. However, you can read about it on the Scikit-Learn site. This parameter defaults to being turned on, which is how it’s used for the example.

The second error is also in the second paragraph. The discussion references the tf_transformer.transform() method call. The actual method call is tfidf.transform(), which does appear in the sample code. The discussion about how the method works is correct, just the name of the object is wrong.

Please let me know if you have any questions about either of these changes at [email protected]. I’m sorry about any errors that appear in the book and appreciate the readers who have pointed them out.

 

Python for Data Science for Dummies Errata on Page 124

Python for Data Science for Dummies contains an error in the example that appears on the top half of page 124. In the first of the two grey boxes, the code computes the results of four print statements. The bottom-most print statement, print x[1:2, 1:2], is supposed to compute a result based on rows 1 and 2 of columns 1 and 2, and the bottom grey box seems to confirm that interpretation by the showing the result as [[[14 15 16] [17 18 19]] [[24 25 26] [27 28 29]]]. However, the answer provided for this example in the downloadable source code is [[[14 15 16]]], which doesn’t agree with that in the text.

The good news is that the downloadable source contains the correct code. The error appears only in the book. The last print statement in the book is wrong. Here is the correct code (with output) for this example:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
 [[11,12,13], [14,15,16], [17,18,19],],
 [[21,22,23], [24,25,26], [27,28,29]]])

print x[1,1]
print x[:,1,1]
print x[1,:,1]
print
print x[1:3, 1:3]
[14 15 16]
[ 5 15 25]
[12 15 18]

[[[14 15 16]
 [17 18 19]]

[[24 25 26]
 [27 28 29]]]

Please let me know if you have any questions about this example at [email protected]. I’m sorry about the error that appears in the book and appreciate the readers who have pointed it out.

 

Missing XMLData2.xml File

A number of readers have written to report that XMLData2.xml is missing from the downloadable source for Python for Data Science for Dummies. You encounter this file in Chapter 6, on page 108. The publisher has already added the file to the downloadable source, but you might be missing the file from your copy. If so, you can download it by clicking XMLData2.zip. I’m truly sorry about any problems that the missing file might have caused. Please be sure to let me know about your book specific question at [email protected].

 

Tip Error in Python for Data Science for Dummies

There is a small error on page 318 of Python for Data Science for Dummies. You can find it near the middle of the page in the Tip text. The current text on the second line of that paragraph says, “k as a number near the squared number of available observations.” However, the text should really read, “k as a number near the squared root number of available observations.” The word root is missing, which obviously changes the mathematical meaning of the text. Please accept our apologies for the typo. Let me know if you find any other errors of a technical nature in the book at [email protected] and I’ll be sure to provide a blog post about it here. Thank you for your support!

 

Missing File from Python for Data Science for Dummies Downloadable Source

A reader recently contacted me regarding a missing file from the downloadable source for Python for Data Science for Dummies. This is the P4DS4D; 01; Quick Overview.ipynb you need for the first chapter. Simply click here to download P4DS4D; 01; Quick Overview.ipynb. I’m also asking the publisher to add the missing file to the downloadable source found on the Dummies site at http://www.dummies.com/store/product/Python-for-Data-Science-For-Dummies.productCd-1118844181,descCd-DOWNLOAD.html. If you encounter any other problems with the book, please be sure to let me know at [email protected]. Thank you for your patience!

 

Finding and Employing Data Science Tools

Python for Data Science for Dummies introduces you to a number of common libraries used for data science experimentation and discovery. Most of these libraries also figure prominently as part of a data scientist’s toolbox because they provide common functionality needed for every application. It is a great idea for those who are interested in expanding their knowledge in data science and how it can be applied to the field of Artificial Intelligence (AI). You can learn more about some of the basic principles such as applying, developing, leveraging and creating data science projects. However, these libraries are only the tip of the data science toolbox. Because data science is such a new technology, you can find all sorts of tools to perform a wide range of tasks, but there is little standardization and some of these tools are hard to categorize so that you know where they fit within your toolbox. That’s why I was excited to see, The data science ecosystem, the first of a three part series of articles that describe some of the tools available for use in data science projects. If you are interested in finding out more about data science, you might want to check out this data science bootcamp for more information. You can also find the other two parts of the article at:

The problem for people who want to explore data science and machine learning today might not be the lack of tools, but the lack of creativity in using them. In order to explore data science, it’s important to understand that the tools only work when your prepare the data properly, employ the correct algorithm, and define reasonable goals. So for those that are looking for suitable tools and aid when looking to start experimenting with data science or machine learning processes they might look to collaborate with other data scientists using this open-source dvc data science platform or one similar that can integrate many other data science tools. No matter how hard you try, data science and machine learning can’t provide you with the correct numeric sequences for the next five lottery wins. However, data science can help you locate potential sources of fraud in an organization. The article, Machine learning and the strategic snake oil reserve, sums up what may be the biggest problem with data science today-people expect miracles without putting in the required work. Fortunately, there are new tools on the horizon to make languages, such as Python, and products, such as Hadoop, easier for even the less creative mind to use (see Python and Hadoop project puts data scientists first).

Even with a great imagination, the tools available today may not do the job you want as well as they should because the underlying hardware isn’t capable of performing the required tasks. The process is further hampered by a misuse of the skills that data scientists provide (see You’re hiring the wrong data scientists for details). As a result, you need a large number of specialized tools in order to perform tasks that shouldn’t require them. However, that’s the reason why you need to know about the availability of these tools so that you can produce useful results on today’s hardware with a minimum of fuss. Asking the question, “How would Alan Turing fix A.I.?” helps you understand the complexities of the data science and machine learning environments.

Data science, machine learning, data scientists with even greater skills, and better hardware will keep the momentum going well into the future. As the Internet of Things (IoT) continues to move forward and the problem of what to do with all that data becomes even larger, data science will take on a larger role in everyone’s daily life. Count on reading more articles like, Google a step closer to developing machines with human-like intelligence, that describe the proliferation of new hardware and new tools to make the full potential of data science and machine learning a reality. In the meantime, getting the tools you need and exploring the ways in which you can creatively use data science to solve problems is the best way to go for now. Let me know your thoughts on the future of data science at [email protected].

Missing Python for Data Science for Dummies Companion Files

For all those long suffering readers who have been missing the companion files for Python for Data Science for Dummies, they’re finally available at http://www.dummies.com/store/product/Python-for-Data-Science-For-Dummies.productCd-1118844181,descCd-DOWNLOAD.html. All you need to do is click the Click to Download link on the page. I’m truly sorry you needed to wait so long. Thank you to everyone who noticed the missing files and also the incorrect link in the book, which now appears in the book errata. Please let me know if you have any problems locating the files or downloading them at [email protected].