Technical – Page 10 – John's Random Thoughts and Discussions

Using Jupyter with Anaconda (Updated)

A few readers have recently written to me regarding the use of Jupyter with the downloadable source for Python for Data Science for Dummies. The version of Anaconda recommended for the book, 2.1.0, doesn’t rely on Jupyter, which is why the book doesn’t mention Jupyter. The book relies on IPython Notebook, which is what you should use to obtain the best reading experience. You can obtain the proper version from the Continuum archive. However, if you choose to download the current version of Anaconda, then using Jupyter becomes a possibility; although, many of the procedures found in the book will require tweaking and the screenshots won’t match precisely.

In order to use Jupyter, you must still import the downloaded files into your repository. The source code comes in an archive file that you extract to a location on your hard drive. The archive contains a list of .ipynb (IPython Notebook) files containing the source code for this book (see the Introduction for details on downloading the source code). The following steps tell how to import these files into your repository:

Click Upload at the top of the page. What you see depends on your browser. In most cases, you see some type of File Upload dialog box that provides access to the files on your hard drive.
Navigate to the directory containing the files you want to import into Notebook.
Highlight one or more files to import and click the Open (or other, similar) button to begin the upload process. You see the file added to an upload list, as shown here. The file isn’t part of the repository yet—you’ve simply selected it for upload.

Upload Source Files to the Repository
Click Upload. Notebook places the file in the repository so that you can begin using it.

It’s important to both Luca and me that you have the best possible learning experience with our book. This means using the right version of Anaconda for most people. Using the latest version shouldn’t cause problems, but we’d like to know if it does. Please feel free contact me at [email protected] with your book-specific questions.

Update

It has come to our attention since this post first published that using the latest version of Anaconda with Python for Data Science for Dummies is problematic. Some of the examples won’t work without rewriting because the Pandas Categorical class has changed. This is the only change we’ve confirmed so far, but there are no doubt other changes. In order to get the proper results from the examples in the book, you must use the correct version of Anaconda, version 2.1.0.

Please do keep those questions coming. It’s because a reader took time to write that Luca and I became aware of this problem. We truly do want you to have a great learning experience, so these questions are important!

Python for Data Science for Dummies Errata on Page 124

Python for Data Science for Dummies contains an error in the example that appears on the top half of page 124. In the first of the two grey boxes, the code computes the results of four print statements. The bottom-most print statement, print x[1:2, 1:2], is supposed to compute a result based on rows 1 and 2 of columns 1 and 2, and the bottom grey box seems to confirm that interpretation by the showing the result as [[[14 15 16] [17 18 19]] [[24 25 26] [27 28 29]]]. However, the answer provided for this example in the downloadable source code is [[[14 15 16]]], which doesn’t agree with that in the text.

The good news is that the downloadable source contains the correct code. The error appears only in the book. The last print statement in the book is wrong. Here is the correct code (with output) for this example:

x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],],
 [[11,12,13], [14,15,16], [17,18,19],],
 [[21,22,23], [24,25,26], [27,28,29]]])

print x[1,1]
print x[:,1,1]
print x[1,:,1]
print
print x[1:3, 1:3]
[14 15 16]
[ 5 15 25]
[12 15 18]

[[[14 15 16]
 [17 18 19]]

[[24 25 26]
 [27 28 29]]]

Please let me know if you have any questions about this example at [email protected]. I’m sorry about the error that appears in the book and appreciate the readers who have pointed it out.

Getting the Fastest Question Response

I always want to be sure that you get fast, courteous responses to your book-specific questions. Even though I don’t check my e-mail every day, I do check it most days of the week, so that’s the fastest way to contact me regarding issues that you have with my books. Of course, you can make the response even faster by doing a few simple things when sending your email:

Be sure to include the name of the book and the book edition in the message subject line.
Tell me which page, figure, or listing number to look at in the book.
Document the steps you took.
Provide me with the exact error message you’re seeing.
Tell me about your platform (operating system, the version of any software you’re using, and so on).

If you provide these basic pieces of information, I can usually answer your questions much faster—often without asking for additional information. E-mail communication can be difficult at times because it lacks that in person body language element and you can’t show me what you’re seeing on your machine. Remote diagnostics are harder than you might think.

It’s also important that you understand that I focus on book-specific questions. I’ve discussed this issue before in Sending Comments on My Books and Sending Comments and Asking Questions. The bottom line is that I want you to be happy with your book experience, but I also don’t have time to provide free consulting. Please let me know if you have any questions or concerns about contacting me at [email protected].

Missing XMLData2.xml File

A number of readers have written to report that XMLData2.xml is missing from the downloadable source for Python for Data Science for Dummies. You encounter this file in Chapter 6, on page 108. The publisher has already added the file to the downloadable source, but you might be missing the file from your copy. If so, you can download it by clicking XMLData2.zip. I’m truly sorry about any problems that the missing file might have caused. Please be sure to let me know about your book specific question at [email protected].

The Internet – The Home of Old Data Made New

I have to admit to making this error myself. I’ll perform a search online and fail to fully check the freshness date of the information I obtain. Of course, there are several levels of freshness date to consider. The first level is the information source. This is the easiest level of data to check. You simply look at the date of the material when you get to the page. Unfortunately, some authors don’t date their work, so you can’t always rely on a posting date. The next best alternative is to ask the search engine to list only those entries that come from a certain time frame. In most cases, you can verify that the information appearing in an article or other posting is current enough for your needs.

Unfortunately, just verifying the posting date may not be good enough. The second level of check is the version of the products discussed as part of the post. For example, you might come to my blog and find a post on CodeBlocks. Unless you read the article carefully, you might think that I’m discussing the latest version of CodeBlocks. However, I have a number of books that rely on CodeBlocks, so I might actually be discussing an older version of CodeBlocks that I used in a specific book. Reading carefully and ensuring you understand version issues is the best way to verify this second level of information.

A third level of freshness checking is the information sources used by the author. This is where things get tricky because the author could truly think that the information source used for an article is the most current available, yet it’s outdated before the author even uses it. Some technologies change so fast that using a resource even a few months old is deadly. These resources become outdated so quickly that they can blindside even a professional author, much less someone who writes on the side. Verifying this level of information requires that you depend on at least three information sources (I recommend finding as many as you can). Gently nudging an article author and mentioning that the information sources might contain outdated material is often helpful when done in a constructive manner.

Freshness checking can occur at even deeper levels. The point is that you can’t be sure that a resource that keeps information literally forever contains the latest information on any given topic. In addition, even when that information is available, it’s up to you to find it. I do try to provide the latest information available when I can. However, when the topic is a question on an older book, I need to address the question in the context of that book and will provide you with some sort of version information so you know what to expect. If you ever question the freshness of the information I provide, please feel free to contact me at [email protected].

Tip Error in Python for Data Science for Dummies

There is a small error on page 318 of Python for Data Science for Dummies. You can find it near the middle of the page in the Tip text. The current text on the second line of that paragraph says, “k as a number near the squared number of available observations.” However, the text should really read, “k as a number near the squared root number of available observations.” The word root is missing, which obviously changes the mathematical meaning of the text. Please accept our apologies for the typo. Let me know if you find any other errors of a technical nature in the book at [email protected] and I’ll be sure to provide a blog post about it here. Thank you for your support!

Reviews, Darned Reviews, and Statistics

A friend recently pointed me toward an article entitled, “Users who post ‘fake’ Amazon reviews could end up in court.” I’ve known for a long time that some authors do pay to get positive reviews for their books posted. In fact, some authors stoop to paying for negative reviews of competing works as well. Even though the actual technique used for cheating on reviews has changed, falsifying reviews is an age old problem. As the Romans might have said, caveat lector (let the reader beware). If there is a way to cheat at something, someone will most certainly find it and use it to gain a competitive advantage. Amazon and other online stores are quite probably fighting a losing battle, much as RIAA has in trying to get people to actually purchase their music (see Odd Fallout of Digital Millennium Copyright Act (DMCA) for a discussion of the ramifications of IP theft). The point is that some of those reviews you’ve been reading are written by people who are paid to provide either a glowing review of the owner’s product or lambaste a competitor’s product.

Of course, it’s important to understand the reasoning behind the publication of false reviews. The obvious reason is to gain endorsements that will likely result in better sales. However, that reason is actually too simple. At the bottom of everything is the use of statistics for all sorts of purposes today, including the ordering of items on sales sites. In many cases, the art of selling comes down to being the first seller on the list and having a price low enough that it’s not worth looking at the competitors. Consequently, sales often hinge on getting good statistics, rather than producing a good product. False reviews help achieve that goal.

I’ve spent a good deal of time emphasizing the true role of reviews in making a purchase. A review, any review you read, even mine, is someone’s opinion. When someone’s opinion tends to match your own, then reading the review could help you make a good buying decision. Likewise, if you know that someone’s opinion tends to run counter to your own, then a product they didn’t like may be just what you want. Reviews are useful decision making tools when viewed in the proper light. It’s important not to let a review blind you to what the reviewer is saying or to the benefits and costs of obtaining particular products.

Ferreting out false reviews can be hard, but it’s possible to weed out many of them. Reviews that seem too good or too dire to be true, probably are fakes. Few products get everything right. Likewise, even fewer products get everything wrong. Someone produces a product in the hope of making sales, so creating one that is so horrid as to be completely useless is rare (it does happen though and there are legal measures in place to deal with these incidences).

Looking for details in the review, as well as information that is likely false is also important. Some people will write a review without ever having actually used the product. You can’t review a product that you haven’t tried. When you read a review here, you can be sure that I’ve tried out every feature (unless otherwise noted). Of course, I’m also not running a test lab, so my opinion is based on my product usage—you might use the product in a different manner or in a different environment (always read the review thoroughly).

As you look for potential products to buy online, remember to take those reviews with a grain of salt. Look for reviews that are obviously false and ignore them. Make up your own mind based on experiences you’ve had with the vendor in the past or with similar products. Reviews don’t reduce your need to remain diligent in making smart purchases. Remember those Romans of old, caveat lector!

Missing File from Python for Data Science for Dummies Downloadable Source

A reader recently contacted me regarding a missing file from the downloadable source for Python for Data Science for Dummies. This is the P4DS4D; 01; Quick Overview.ipynb you need for the first chapter. Simply click here to download P4DS4D; 01; Quick Overview.ipynb. I’m also asking the publisher to add the missing file to the downloadable source found on the Dummies site at http://www.dummies.com/store/product/Python-for-Data-Science-For-Dummies.productCd-1118844181,descCd-DOWNLOAD.html. If you encounter any other problems with the book, please be sure to let me know at [email protected]. Thank you for your patience!

Finding and Employing Data Science Tools

Python for Data Science for Dummies introduces you to a number of common libraries used for data science experimentation and discovery. Most of these libraries also figure prominently as part of a data scientist’s toolbox because they provide common functionality needed for every application. It is a great idea for those who are interested in expanding their knowledge in data science and how it can be applied to the field of Artificial Intelligence (AI). You can learn more about some of the basic principles such as applying, developing, leveraging and creating data science projects. However, these libraries are only the tip of the data science toolbox. Because data science is such a new technology, you can find all sorts of tools to perform a wide range of tasks, but there is little standardization and some of these tools are hard to categorize so that you know where they fit within your toolbox. That’s why I was excited to see, The data science ecosystem, the first of a three part series of articles that describe some of the tools available for use in data science projects. If you are interested in finding out more about data science, you might want to check out this data science bootcamp for more information. You can also find the other two parts of the article at:

The problem for people who want to explore data science and machine learning today might not be the lack of tools, but the lack of creativity in using them. In order to explore data science, it’s important to understand that the tools only work when your prepare the data properly, employ the correct algorithm, and define reasonable goals. So for those that are looking for suitable tools and aid when looking to start experimenting with data science or machine learning processes they might look to collaborate with other data scientists using this open-source dvc data science platform or one similar that can integrate many other data science tools. No matter how hard you try, data science and machine learning can’t provide you with the correct numeric sequences for the next five lottery wins. However, data science can help you locate potential sources of fraud in an organization. The article, Machine learning and the strategic snake oil reserve, sums up what may be the biggest problem with data science today-people expect miracles without putting in the required work. Fortunately, there are new tools on the horizon to make languages, such as Python, and products, such as Hadoop, easier for even the less creative mind to use (see Python and Hadoop project puts data scientists first).

Even with a great imagination, the tools available today may not do the job you want as well as they should because the underlying hardware isn’t capable of performing the required tasks. The process is further hampered by a misuse of the skills that data scientists provide (see You’re hiring the wrong data scientists for details). As a result, you need a large number of specialized tools in order to perform tasks that shouldn’t require them. However, that’s the reason why you need to know about the availability of these tools so that you can produce useful results on today’s hardware with a minimum of fuss. Asking the question, “How would Alan Turing fix A.I.?” helps you understand the complexities of the data science and machine learning environments.

Data science, machine learning, data scientists with even greater skills, and better hardware will keep the momentum going well into the future. As the Internet of Things (IoT) continues to move forward and the problem of what to do with all that data becomes even larger, data science will take on a larger role in everyone’s daily life. Count on reading more articles like, Google a step closer to developing machines with human-like intelligence, that describe the proliferation of new hardware and new tools to make the full potential of data science and machine learning a reality. In the meantime, getting the tools you need and exploring the ways in which you can creatively use data science to solve problems is the best way to go for now. Let me know your thoughts on the future of data science at [email protected].

Missing Python for Data Science for Dummies Companion Files

For all those long suffering readers who have been missing the companion files for Python for Data Science for Dummies, they’re finally available at http://www.dummies.com/store/product/Python-for-Data-Science-For-Dummies.productCd-1118844181,descCd-DOWNLOAD.html. All you need to do is click the Click to Download link on the page. I’m truly sorry you needed to wait so long. Thank you to everyone who noticed the missing files and also the incorrect link in the book, which now appears in the book errata. Please let me know if you have any problems locating the files or downloading them at [email protected].