Data Science Fundamentals

Code readability and simplicity are the primary design goals of the Python language. Add a few key APIs and it becomes a powerful data analysis tool. Examine basic data science fundamentals and how to apply them to Python.

Introduction to Jupyter/IPython

[Topic Title: Introduction to Jupyter/IPython. Your host for this session is Wesley Miller. The ANACONDA NAVIGATOR is open on the Home section. A series of apps, such as Spyder and Glueviz, display.] In this video, we're going to run the Jupyter Notebook, and familiarize ourselves with the basics of its user interface. So as with other Python data science tools and packages, there are several different ways in which you may obtain and install Jupyter Notebook. But in most cases, it is recommended that you download and install a Python data science platform like Anaconda, that already includes the Jupyter Notebook application, among other Python data science packages and libraries. The Jupyter Notebook is a web-based interactive computing environment in which you may run and store code along with any code output and mark down notes in a modifiable document called a notebook. As such, when you run the Jupyter Notebook application, this starts the notebook server and opens a browser window that you may use to carry out interactions like creating new notebook documents, editing existing ones, and so on. So let's now launch the Jupyter Notebook from here in the Anaconda navigator. [The presenter clicks LAUNCH on the app tile.] And we see that this opens a new terminal window in which we can see a log of all the activity that's taken place as the application starts up. [A terminal window is displayed.]

And then upon completion of application launch, we get a browser window being automatically opened that displays the Notebook dashboard. [The presenter switches to the browser window.] All right, so here in the dashboard we have these four tabs, Files, Running, Clusters and Conda. And the Files tab, this displays all files in the notebook's default directory which in this case is the home directory. And we're just going to navigate to the devel directory and then to _pythonNotebooks as this is where we're going to create and store our notebook for this session. Now, if we temporarily switch to the Running tab, you see that just displays our notebook terminals and documents that are currently running. [In this case, nothing is running yet.] And if we click on Clusters, this displays any running clusters. [No clusters are running at this stage.] And then click on Conda, displays all available Conda environments along with their installed packages. And a list of available packages for install into each Conda environment. In this case, we only have one Conda environment running, named root. [There are 595 available packages and 183 installed packages in environment "root".]

Now, we're going to click on the Files tab and we're going to create a new notebook in our Conda environment by clicking on the New button. [A drop-down menu opens. Options include Text File, Folder, and Terminal.] And then, under Notebooks we're going to click Python with Conda root in square brackets. [He clicks Python [conda root] in the Notebooks section of the menu. The other option is Python [default].] And this opens up a new notebook document here in the web browser with a new cell that's in edit mode. And at the top of the page we have this notebook header that contains the name of the notebook, as well as the time at which a checkpoint was last saved for this notebook. [The new notebook is Untitled. The Last Checkpoint was a few seconds ago.] If we click on the File menu, here we have options for performing notebook file based operations like creating new notebooks, opening existing ones, copying, renaming etc. [He refers to options such as New Notebook, Make a Copy, and Rename.] So let's now rename our notebook to basic, by clicking on Rename, [The Rename Notebook window opens. He types basic in the Enter a new notebook text field.] now click OK. And then if we go back to the notebook dashboard, [He clicks the browser tab.] we'll see that our new notebook has been renamed accordingly. It has the .ipynb extension which is short for IPython Notebook. So, let's now go back to the notebook document. And we're going to look at more of the menu bar options available. If we click on Edit, here we have options for performing cell edit operations like copying, cutting, pasting, deleting, splitting, merging and moving cells. If we click on View, we have options for changing our notebook page appearance.

So we can toggle the header on and off. [He refers to the Toggle Header option.] We may also toggle toolbar on and off. [He refers to the Toggle Toolbar option.] And we can make changes to the cell toolbar [He selects Cell Toolbar. A flyout menu opens, with options that include Edit Metadata, Raw Cell Format, and Slideshow.] and we may also toggle presentation mode on and off. [He refers to the Toggle Presentation option.] And if we clicked on the Insert menu item, we may insert new cells above or below the active cell. [Options in the Insert menu are Insert Cell Above and Insert Cell Below.] We'll click on Cell, we may run multiple cells or change a cell's type and also change how output gets displayed. [Options include Run Cells, Run All, and Cell Type.] If we click on Kernel, we have options for performing kernel-based operations. [Options include Interrupt, Restart, and Restart & Clear Output.] We click on Widgets, we may also perform widget-based operations. [Options include Save notebook with snapshots and Embed widgets.] And if we click on Help, we have a very handy help facility available here to help us better leverage the powerful features of the Notebook application. [Options include User Interface Tour, Keyboard Shortcuts, and Notebook Help.]

All right, so beneath our notebook menu bar, we have the tool bar that we can use to conveniently interact with our notebook. We can either save and checkpoint, we can insert a new cell below the active cell, we can cut selected cells, copy selected cells, paste cells below the active cell, move a cell up or down, run our cell. And we interrupt the kernel, restart the kernel, [He refers to icons on the toolbar.] and we can also have additional cell type selection options here. [He clicks the cell type drop-down menu. Options include Code, Markdown, and Raw NBConvert.] We may display the Command pallet, we may show new cell toolbar selector location. We may publish our notebook to the Anaconda cloud. We may show or edit our presentation, rather. And we may show our presentation here. [He refers to additional icons on the toolbar.] All right, so with that, let's now get into adding and running some code in our notebook. So first I'm going to import the NumPy package and matplotlib.pyplot library. And note that as I had hit Enter, the code did not actually execute automatically. And this is because code in cells will not get executed until the cell is actually run. So, I'll finish typing here. [The complete code is import numpy as np import matplotlib.pyplot as plt.]

Then we're going to run our cell by going to the toolbar here. And we're going to click on run. [He clicks the run cell, select below icon.] And observe that this runs the code in the cell, and opens a new cell below for editing. So let's now use the NumPy package to create a new array of seven random floating point numbers, [The code is np.random.rand(7).] and run cell. And we see that this now generates some output. [The output produces an array of seven numbers. Each is a 0 followed by a decimal point and up to eight additional numbers.] Now, whenever a Python object gets returned by an expression, Python's display mechanism gets triggered, which results in an output prompt being displayed which, in this case, is Out[2]. So let's now create a new array containing 13 floating point numbers evenly spaced between 0 and 5. So for this, we use the NumPy linspace function. [The code is np.linspace(0,5,13).] And then we're going to run the cell. [He clicks Run on the toolbar. The output produces an array of 13 numbers with values that range from 0 to 5. The numbers in between are fractions such as 2.91666667.] And then we'll an automatic sequence of integers between 0 and 6. [The code is np.arange(6).] And then run the cell. [He clicks Run on the toolbar. The output 4 is array([0, 1, 2, 3, 4, 5]).] Okay, so with the display mechanism triggered, we're able to access the last output value by using the underscore character. [He types an underscore in the cell.] So, I'll run this. [He clicks Run.] And there you can see that we're displaying the array that we got from the last output. [Output 5 is the same as output 4.] And we may also select a specific output value by using the underscore character along with the corresponding number of the output, all right? [He types _2 in the cell.] So we see that running the cell displays the value for output 2 above.

We can even include these output values in subsequent operations. [The code is np.sin(_3).] We run the cell. [He clicks Run.] So we took the array obtained in output 3 and performed a sign operation on each of its elements. [The output produces an array of 13 numbers, starting with 0 and eventually becoming negative numbers. For example, the last result is -0.95892427.] So now let's enter and run an expression that will plot our original linspace array against this new array above and then show the plot. [The code is plt.plot(_3,_7) plt.show().] We run cell. [He clicks the Run cell option on the toolbar.] And here we see that displays an inline figure here in our notebook. [A graph is plotted with values 0 to 5 along the x-axis and values from -1.0 to 1.0 along the y-axis. The highest point is between 1 and 2 and the lowest point between 4 and 5.] And now if we only wanted to display our inputs here in the notebook page, we can toggle our output off by going to Cell - All Output and Toggle. And see that this hides all of our output. And this makes it a bit easier to see our sequence of inputs at a glance. And of course, we can reshow our outputs by following the same steps again. All right, so we can also generate and view a slide presentation for this notebook by clicking the Edit Presentation button in the tool bar. [The presentation toolbar opens. Options include Slides, Themes, and Help.] And then we'll click Slides, [The Slides pane opens. A note states that there are currently no slides to display.] then we'll click Basic, 9 Slides, [He clicks the 9 Slides option under the Basic section.] and this would automatically generate slides for our notebook. And as I hover over each of these slides, then you can see a preview of each slide's content. Now by default, each input cell and its corresponding output are placed on a new slide.

So now if we go up to click Present in the presentation toolbar along the right side of the page, and then click Slides to hide the preview slide at the bottom of the page, we're presented with our first slide which shows our first input. And then moving the mouse cursor to the bottom right of the page displays our slide presentation controls in which we may jump back to the first slide, go back to the previous slide, go to the next slide, or go back to the notebook view. Now if we keep clicking Next, we see that each slide contains each inputs and its corresponding outputs. And click to the end of the slides here. Finally, we can go back to the notebook view by clicking Notebook. And then, we may save and checkpoint by clicking on the Save and Checkpoint button. [He clicks the icon on the main toolbar.] And then we can shutdown our notebook kernel from the dashboard. So we're going to go back to dashboard, [He selects the tab in the browser.] click on the Running tab, click Shutdown by our notebook name. And then we're going to go back to terminal. [He switches windows and returns to the terminal window.] And we're going to shutdown the notebook server by hitting Ctrl+C and then selecting Y to confirm shutdown. All right, so that's it. In this video, we have run the Jupyter Notebook and have familiarized ourselves with the basics of its user interface.

Working with Jupyter/IPython

[Topic Title: Working with Jupyter/IPython. Your host for this session is Wesley Miller. ANACONDA NAVIGATOR is open on the Home section. Apps such as Jupyter Notebook and Spyder display.] In this video, we're going to look at capturing Python code output in Jupyter Notebook. So we're going to start by launching the Jupyter Notebook application from here in the ANACONDA NAVIGATOR. [The presenter clicks Launch on the app tile. The dashboard opens on the Files tab. Other tabs include Running and Clusters.] And then here in the Notebook dashboard we're going to create a new Notebook in our existing conda environment. [The presenter clicks the devel folder in the list.] First, I'm going to switch to the _pythonNotebooks directory. And then I'm going to click on New - Python [conda root]. [He clicks the New button and selects Python [conda root] from the drop-down menu. An Untitled notebook opens in a new browser tab.] And then I'm just going to rename this Notebook as nb_Capture, [He clicks the File - Rename. The Rename Notebook window opens. He types the name in the Enter a new notebook name field.] and then say OK. Now IPython has a cell magic command named capture that can be used to capture the standard output and standard error streams of a cell. And we may use this cell magic command to either store the streams in a variable in our current namespace or we may use it to discard the streams, which is what it does by default.

All right, so let's use the capture cell magic command to capture standard output and standard error streams into a variable. [He types the following code in the cell: from __future__ import print_function import sys.] Okay, so let's run. [He clicks the run cell, select below icon on the toolbar.] And then we're going to use our cell magic command. And we give it a name. And we're going to capture a standard output stream, and also a standard error stream. [The code is %%capture myCapture print('This is standard output') print('This is standard error', file-sys.stderr).] Then we'll run this cell, [He clicks the run cell, select below icon on the toolbar.] and then we'll type the name of our capture object and execute. [The code is: myCapture. Code ends. He clicks the run cell, select below icon on the toolbar.] And this shows that our stream capture is now an object in our current namespace. [The output is <IPython.utils.capture.CapturedIO at 0x104949390>.] If we call a capture object as a function and execute, [The code is myCapture(). Code ends. He clicks the run cell, select below icon on the toolbar.] this prints both streams that were captured, both a standard output and a standard error stream. [The output is This is standard output and This is standard error.] And we can also print our captured streams individually by obtaining the value of either the standard out or a standard error property of our capture object. [The code is myCapture.stdout. Code ends.] Execute [He clicks the run cell, select below icon on the toolbar. The output is 'This is standard output\n'. He adds code to the next cell. The code is myCapture.stderr.] and execute. [The output is 'This is standard error\n'.] All right, now we may also use our cell magic capture command to capture all other output. And to demonstrate this, we're going to use it to capture and subsequently execute a variables plot operation. Now first, we're going to import the NumPy package. And we're also going to import the matplotlib.pyplot library. [The code is import numpy as np import matplotlib.pyplot as plt.] Then we'll execute, [He clicks the run cell, select below icon on the toolbar.] and then we're going to use our cell magic command.

So we have our cell magic command, which we're going to name the object myVerboseplot. [The code line is %%capture myVerboseplot.] And we have this first print statement informing that the code is about to build the x-axis data points. [The code line is print("Building x-axis data points...").] Then we're using the NumPy linspace function to create our x-axis array, which is going to be 300 points spaced evenly between 0 and 5. [The code line is x = np.linspace(0,5,300).] Then we have this next print statement informing us that the y-axis data points are about to be constructed. [The code line is print("Building y-axis data points...").] Then we use our cosine function for NumPy to take the cosine of each element in our x-axis array and store those as our y-axis data points. [The code line is y = np.cos(x).] Then we have a print statement informing that the process is about to be constructed and the axes labeled. [The code line is print("Performing plot and labeling axes...").] Then we use our matplotlib.pyplot library to plot x and y. [The code line is plt.plot(x,y).] And then we're going to label the x-axis and label the y-axis. [The code is plt.xlabel('x') plt.ylabel('cos(x)').] Then we have another print statement informing us that the plot is about to be displayed. [The code line is print("Displaying plot now: ").] Then we use pyplot to call the show method, okay? [The code line is plt.show(). Code ends. The complete code is provided for reference in the Capture myVerboseplot section at the end of the transcript.] So with this, we'll then run this cell. And then we'll call our capture object as a function, [The code is myVerboseplot().] run the cell. [He clicks the run cell, select below icon on the toolbar.]

And if we scroll down you can see the output as expected. You see each of our print statements to standard out, and as well, finally we have our plot being displayed as an inline figure here on our Notebook page. [The output displays the messages Building x-axis data points..., Building y-axis data points..., Performing plot and labeling axes..., and Displaying plot now:. A graph displays with the x-axis labeled x and running from 0 to 5 and the y-axis labeled cos(x) with values from -1.0 to 1.0.] All right, so with that, we'll now just shut down the Notebook kernel from the dashboard. [He clicks the dashboard browser tab.] So I'm going to go to Running, [He clicks the Running tab.] click Shutdown. [He clicks the option in the Notebooks section.] And then we're going to just close the dashboard, close Notebook [He closes both browser tabs.] and we're going to terminate the Notebook server here in the terminal by hitting Ctrl+C and selecting y and Enter. And that's it. So in this video, we have looked at capturing Python code output in Jupyter Notebook.

Capture myVerboseplot

%%capture myVerboseplot
print("Building x-axis data points...")
x = np.linspace(0,5,300)
print("Building y-axis data points...")
y = np.cos(x)
print("Performing plot and labeling axes...")
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('cos(x)')
print("Displaying plot now: ")
plt.show()

Introduction to NumPy

[Topic Title: Introduction to NumPy. Your host for this session is Wesley Miller. The ANACONDA NAVIGATOR is open on the Home section. Apps such as Jupyter Notebook and Spyder display in the main view.] In this video, we're going to look at basic access and usage of the NumPy package in a Python development environment. Okay, so we're going to start here in the ANACONDA NAVIGATOR. And I'll mention that if you require the use of the most powerful and popular Python data science tools, it is highly recommended that you download and install a data science platform, such as Anaconda for example, that includes and automatically manages your Python environment and any necessary Python data science packages. Anaconda is open source, and it ships with all the popular requisite libraries and packages that enable you to quickly and seamlessly get under way with scientific computation and data analysis in a Python development environment. NumPy is one such package, as it is the foundation upon which all other higher level data science based tools for Python are based. Now Anaconda's ships with the Spyder interactive development environment, or IDE, in which you may write and execute Python code and scripts.

And it includes many handy features that assist you in writing your code and carrying out any required data analysis. We access the Spyder IDE via the ANACONDA NAVIGATOR here by clicking Launch. [The presenter clicks the Launch button on the spyder tile.] And now here within the Spyder IDE, we may use either the regular Python console or the interactive Python console to access and work with NumPy. Now for this video, we're going to use an IPython console. So let's open a new IPython console window, Consoles - Open an IPython console. [Other options on the Consoles menu are Open a Python console and Connect to an existing kernel. The console opens and displays some help options along with a brief note on what they do. For example, type a question mark for an introduction and overview of IPython's features.] And here in the IPython console window, if we want to confirm that we indeed have the NumPy package already installed and ready for use, we may attempt to obtain the entire help documentation on NumPy by typing and entering the following. [The presenter types the following code: help('numpy').] So we're using Python's help to get more information on the NumPy package, we'll hit Enter.

And if we scroll up through this output, you'll see that we're presented with a lot of information regarding the use of NumPy in the Python environment. However, this output is not very convenient for quick reference and you can see it's actually even clipped, we can't go to the top here in the console window. But we can query the Python help with more specific information, like for example, a particular NumPy method. And one of the most commonly used methods is the built-in array method that is used to create new NumPy arrays. So to obtain this help information, you simply type and enter the following. [The code is help('numpy.array').] So we enter numpy.array as a string and we hit Enter. And here in the output, top of this output, here in the output represented with a synopsis and usage examples of the NumPy array method. [The Output also includes parameters and return information.]

All right, so before we wrap up this video, let's just create a simple NumPy array to demonstrate usage. So first, we need to import the NumPy package into an object that we can subsequently use to refer to NumPy in our current workspace. So to do this, we type and execute an import statement. [The code is import numpy as mynp.] And with this, we can now use NumPy functionality in our environment by calling methods on this object named mynp. So, for example, to create a new NumPy array we simply call the array method on this object and pass in the list of numbers that we want to be included as values in the array. So I'm going to store this new NumPy array, this object named array1, and I'm calling the array method on our NumPy object. Then we're going to pass in a list of integers. [He enters the code as follows: array1 = mynp.array([3 , 6, 9 , 12]). Code ends.] And then if I type the name of the array again and hit Enter to get the array contents printed out on screen, [He types array1.] here you can see that array has indeed been created. [The output is array([ 3, 6, 9, 12]).] Now we may also obtain the data type of the element in this new array by using the dtype attribute. [He types array1.dtype. The output is dtype('int64').] And here we see that this array contains 64-bit integer elements. Okay, so that's it. In this video, we have looked at basic access and usage of the NumPy package in a Python development environment.

Working with NumPy Arrays

[Topic Title: Working with NumPy Arrays. Your host for this session is Wesley Miller. An IPython console interface displays.] In this video, we're going to look at different ways of creating a NumPy Arrays. So we're going to be working here in an IPython console, and as usual, when working with NumPy, we first import the NumPy package into an object that we can reference in our current work space. [The code is import numpy as np.] Now, the most standard way to create a NumPy array is to use the array method and pass in the elements that you want to be added to the array as a list or a list of lists. So we can create a basic one dimensional array and store it in this object named a [The code is a = np.array([3,6,9]).] and we'll print out a. [The presenter types an a. The output is array([3, 6, 9]).] And we can also create a two dimensional array [The code is b = np.array([[3,6,9],[2,4,6]]).] and we'll print out this array. [He types a b. The output is array([[3, 6, 9]],[2, 4, 6]]).] So note that when creating the two dimensional array, I open and close the entire array with parentheses and square brackets. And then include each row for the array in their own set of square brackets that are also separated by commas, thus indicating that we are actually creating an array of arrays.

But we can also create arrays that contain placeholder content like all zeros or all ones. To create a two dimensional array of zeros, we call the NumPy zeros function and pass in an integer tuple indicating the number of rows and columns we want this array to have. [The code is c = np.zeros((5,3)).] All right, so we'll print out this array. [He types a c. The output is an array of five rows, each with the values [0., 0., 0.].] And likewise, we may use the NumPy ones function to create an array of ones. [The code is d = np.ones((3,2)).] Okay, and it we'll print this array. [He types a d. The output is an array of three rows, each with the values [ 1., 1.].] So there you can see that we have our array with three rows and two columns, as expected. Now we may also specify the data type of the elements that we want our placeholder arrays to have. And if we take a look at the data type attribute for each of these arrays that we've just created, arrays c and d. And here we see that these numbers are actually 64-bit floating point by default. [He types c.dtype.name and d.dtype.name. Both return an output of 'float64'.] Now if we wanted to specify a different data type, then we specify an additional parameter value for d type when creating an array. [The code is f = np.ones((3,4), dtype=np.int64).] Okay, so let's print this [He types an f. The output is an array of three rows, each with the values [1, 1, 1, 1].] and then obtain the type. Okay, so we see that our new array f contains 64-bit integer elements. [He types f.dtype.name. The output is 'int64'.] And also note that I use a NumPy specific type for the b type attribute value.

But we can also use the NumPy empty function to create an array whose initial content is random, although very close to zero. [The code is g = np.empty((3,3)).] We'll print the values, all right? [He types a g. The output produces an array of three rows. For example, the values in the first row are row [ 0.00000000e+000, 0.00000000e+000, 2.29059370e-314].] Now we may also create arrays using NumPy's autoranging function referred to as arange. If for example you wanted to create a one dimensional array with numbers between zero and ten in increments of two, we would type and enter the following. [He enters the code as follows: h = np.arange(0,10,2).] And we'll print that out. [He types an h. The output is array([0, 2, 4, 6, 8]).] Okay, so as you can see the first parameter indicates the start of the range, [the 0] the second parameter is the end of the range [the 10] and the third parameter in the arange call, is the increment. [the 2] Now we may also specify a floating point increment with this function. But it's actually better to use the NumPy linspace function for this purpose. This function takes in the start of the range, the end of the range, and then the total number of elements that you want your array to have. [The code is k = np.linspace(0,5,13).] And we'll print out that array k. [He types a k. The output produces and array of values from 0 to 5, with fractional values such as 2.91666667, in between.] All right, so we may also create arrays containing complex numbers by specifying a d type attribute value of complex in the NumPy array function call. [The code is cplxMatrix = np.array([[2,4],[3,6],[4,8]], dtype=complex).] And we'll print that matrix out. [He types cplMatrix. The output is array([[ 2.+0.j, 4.+0.j],[ 3.+0.j, 6.+0.j],[ 4.+0.j, 8.+0.j]]).] Okay, so here we have our array containing complex numbers.

Now we may also use the arange function chained with the reshape function, so as to also specify array dimensions for an automatic sequence. [The code is m = np.arange(10).reshape(5,2).] And let's print out that array m. [He types an m. The output is array([0, 1],[2, 3],[4, 5],[6, 7],[8, 9]]).] Okay, so with this, you see that we now have this array containing a range of integers between zero and ten and shaped into five rows and two columns. Note, however, that for the integer pair we specify in the reshape function. Both integers must be factors of the specified total number of elements. And the product of the integer pair must be equal to the specified total number of elements in the range or else you will get an error and our recreation will fail. So that's it, in this video we've looked at different ways of creating NumPy arrays.

Introduction to Pandas

[Topic Title: Introduction to Pandas. Your host for this session is Wesley Miller.] In this video, we're going to look at basic and common usage of the Python pandas library for data science operations. Okay, so pandas is short for Panel Data System. And it is a Python data analysis library that provides the use of data structures that are efficient, expressive, fast, and flexible. The three kinds of data structures that pandas provide are series, dataFrames, and panels. And pandas is actually build on top by NumPy making it easier to handle data. As such, it provides a solid foundation on which to perform real-world data analysis in a Python environment. The pandas library allows for intelligent and automatic label-based data alignment that allow for messy data to be more easily manipulated into orderly data sets. And pandas is most notably compatible with [ordered and unordered] time series data, matrix data, [A type of data which contains arbitrary row and column labels, and is both homogenous and heterogeneous.] and tabular data, [which contains heterogeneously typed columns] among other types of [observational and statistical] data sets. The pandas library provides with a means of carrying out the entire data analysis process flow in Python without having to switch to a language like R.

The pandas data structures provide support for fast joins and merges of data sets. And that the library can be used to read from various sources, like from local files or relational database tables, for example. As well as various formats, such as CSV, Excel, and JSON. Pandas can also be used to write data out to various sources in various formats, same as what was just mentioned. And the pandas library provides support for various data operations such as the handling of missing data. High performance merging and joining of data sets, flexible reshaping and pivoting of data sets. Data aggregation with the power of the Group By engine, the insertion and deletion of columns from data structures for size mutability. Integrated indexing and fancy indexing, robust I/O tools for reading and writing data. And time series functionality that include generating date ranges, frequency conversion, date shifting and lagging. Moving window statistics and linear regressions, and the joining of time series without data loss. Now, there are several different ways in which you may install the pandas library. But the most highly recommended way is to download and install a Python data science platform like Anaconda. That not only will contain most of the popular data science tools and packages. But will also act as a package manager so that whenever any specific package needs to be updated. Anaconda will also take care of updating any dependencies as necessary.

This type of functionality significantly improves the efficiency of obtaining Python data science packages. And also significantly decreases the chance of you accidentally breaking anything in your Python development and analysis environment whenever you need to install or update these packages. [The Anaconda distribution installer is available at http://pandas.pydata.org/pandas-docs/stable/install.html.] Now, you can also install pandas with Miniconda in the event that you want more control over the specific packages that you want to install. And/or if you have a slow or intermittent Internet connection and don't want to download the entire Anaconda distribution. With Miniconda, you can create a minimal, self-contained Python installation. And then use the conda command to create a new conda environment, activate it, then install pandas and additional packages. You may also install pandas from the Python Packaging Index via the Python pip package management system. And pip just stands for preferred installer program.

And starting with Python 3.4, it's included by default with the Python library installer. You may also use your Linux distributions package manager to install pandas or you may install from source. Now, you may also run a set of unit tests on your machine to verify that pandas is working. And that you have all the necessary dependencies installed by using nose. So nose extends the Python unit testing framework. And you can run unit tests on the nose either at a Python console prompt or as a root user in a terminal shell from directly inside your pandas get clone. [The code is $ nosetests pandas.] Pandas has several required dependencies that include setuptools, NumPy [version 1.7.1 or higher] , pytz [for time zone support] , and python-dateutil. [version 1.5 or higher] As well as a couple recommended dependencies, namely numexpr and bottleneck. Numexpr is used for accelerating certain numerical operations. While bottleneck is used for accelerating certain types of NAN or not a number evaluations. There are also many optional dependencies you might want to have installed such as SciPy, Cython, and matplotlib to name a few.

All right, so that's it. In this video we have looked at basic and common usage of the Python pandas library for data science operations.

Data Manipulation with Pandas

[Topic Title: Data Manipulation with Pandas. Your host for this session is Wesley Miller. An IPython console interface displays.] In this video, we're going to perform some basic merge operations with Pandas data frames. Okay, so the Pandas library provides a method named merge that can be used to perform SQL-like joins between the data frames. So, let's take a quick look at the synopsis of this method by using Python's help facility, right? [The presenter types help('pandas.merge').] So, here we see that this method can be called to merge data frames by performing joins on columns or indexes. And as you can see here in the synopsis, the only two mandatory parameters that we must pass to this method are the two data frames that we wish to merge, as denoted by left and right. The other parameters have default values but we may of course change these in a method called depending on our intentions for specific merges. For this demo however, we're only going to focus on the left, right and how, and on parameters. [Other parameters include right_on, left_index, and sort.] The how parameter is used to indicate the kind of join you want to perform, while the on parameter is used to indicate the field name or names that you want to join on.

All right, so let's scroll down here [He scrolls to a new input field below the help information.] and we're going to create two new data frames as per an example shown in the Pandas documentation. [He first types import pandas as pd to activate pandas. The he types the following code df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}).] So here we have df1 and we also have df2. [The code is df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}).] All right, so we'll just print out both data frames for convenience of reference. [He types df1. The output produces a table with lines 0 to 3, column A with values A0 to A3, column B with values B0 to B3, and column key with values K0 to K3. He types df2. The output produces a table with lines 0 to 3, column C with values C0 to C3, column D with values D0 to D3, and column key with values K0 to K3.] And now we're going to perform a basic merge on the key column and keeping in mind that the column name that we indicate for the on parameter must exists in both data frames. [The code is pd.merge(df1, df2, on='key'). Code ends. The output displays a table with rows 0 to 3 and columns A, B, C, and D, along with a single key column with values from K0 to K3.] All right? So since we did not specify a value for the how parameter, the default join method was carried out. Which is an inner join. And thus, the join was carried out using the intersection of keys from both data frames. However, since both data frames have the same key sequence, as you can see, K0, K1, K2, K3, in both. There was no data loss, and there was also no introduction of any empty values, right? So, with this done, let's now create a third data frame in which we're going to modify the key sequence in data frame df2. And we're going to call this new data frame df3. [The code is df3 = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}).] All right, and one more time, we'll print out df1, then df3. [He types df1. The output produces a table with lines 0 to 3, column A with values A0 to A1, column B with values B0 to B3, and a key column with values K0 to K3. He types df3. The output produces a table with rows from 0 to 3, column C with values from C0 to C3, column D with values from D0 to D3, and a key column with values of K0, K1, K4, and K3.] And now we're going to perform a merge between df1 and df3 on the column labeled key. Specifying df1 as the left data frame and df3 as the right data frame. And this time we'll also specify the how parameter. But we're going to use a value of left. [The code is pd.merge(df1, df3, on='key', how='left'). Code ends. The output generates a table with rows 0 to 3 and columns A, B, key, C, and D.]

All right, so this results in a data frame merge that used the keys from the left data frame only. Our resulting data frame has all of the data from our left data frame, from df1, but something strange happened in columns C and D, which came from the data frame on the right. The key sequence matches for the first two rows and the last row of each data frame. So we have K0, K1, and then K3 in the last row. And in df3, we have K0, K1 and then K3 in the last row. However though, in the third row of df1, we have a key, K2, but there's no K2 here in df3. And similarly in the third row of df3, we have key K4 that has no key matches in df1. Now since the merge operation took place with respect to the keys in df1, then we end up with a merge that retains our values for the K2 key from df1 but it throws away the values for the K4 key in df3. So notice that there's no C2 D2 in the merged data frame. [Instead values of NaN display in the K2 merged line for the C and D columns.] So, in so doing, since there was no K2 key in df3, then the data value for columns C and D for the K2 key rule is going to end up empty here in the merge, as indicated by NaN, or not a number.

All right, so let's now look at a slightly more advanced case where there are multiple join keys. So first we're going to create two new data frames, and they're each going to have two key columns. Right, so we have df left [The code is dfleft = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1'], 'key2': ['K0', 'K1'. 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}).] and we have df right. [The code is dfright = pd.DataFrame({'key1': ['K0', 'K0', 'K2', 'K1'], 'key2': ['K0', 'K1'. 'K1', 'K2'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}).] And I'll print out both dataframes for convenience of reference. [He types dfleft. The output displays a table with rows 0 to 3 and columns A, B, key1, and key2. The key 1 column has the values K0, K0, K1, and K1. The key2 column has the values K0, K1, K2, and K3. He types dfright. The output displays a table with rows 0 to 3 and columns C, D, key1, and key2. The key 1 column has the values K0, K0, K2, and K1. The key2 column has the values K0, K1, K1, and K2.] And then we'll perform a merge on both keys, key1 and key2. [The code is pd.merge(dfleft, dfright, on=['key1', 'key2']). Code ends. The output produces a table with rows 0 to 2, and columns A, B, key1, key2, C, and D.] So with this we now have a merged dataframe such that an inner join was performed using these two keys. Now as you can see, this kind of join between these two data frames resulted in the loss of some data as our resulting data frame only has three rows of data while the original data frames each had four rows prior to the merge operation. So, what happened here is that the join is only now happening where there is a match of key1, key2 sequences between the two data frames. So, notice that here in dfleft, we have key1, key2 sequences K0 K0, K0 K1, K1 K2 and K1 K3. While in dfright, we have key1 key2 sequences K0 K0, K0 K1, K2 K1 and K1 K2. So, what's happening here in dfright is that we have K0 K0 matching the key1, key2 sequence in the first row of dfleft. And then here in the second row, we have K0 K1 matching the second row of the key1, key2 sequence in dfleft. However, in the third row we have the sequence K2, K1 for which there is no match in the dfleft data frame. And similarly, in the dfleft data frame. In the last row, we have this key1, key2 sequence K1 K3.

And there's no K1 K3 sequence here in dfright. So, what's going to happen is that for each of the rows for which we don't have a matching key1, key2 sequence. We're actually going to lose that data after this type of merge operation. And in so doing as you can see, [He refers to the merged table.] we have records for the key sequence matches. So we have K0 K0 and for each of those or for that row rather, we have A0 B0 from dfleft and C0 D0 from dfright. For the K0 K1 sequence, we have A1 B1 from dfleft and we have C1 D1 from dfright. And then for K1 K2, which is in the third row of dfleft and the fourth row of dfright. Notice that we have the data values for the third row of dfleft, so A2 B2 or K1 K2. And we have the values for the fourth row of dfright, C3 D3. All right, so that's it. In this video, we have performed some basic merge operations with Pandas data frames.

Data Visualization with Matplotlib,Ggplot2

[Topic Title: Data Visualization with Matplotlib. Your host for this session is Wesley Miller. An IPython console interface displays.] In this video, we're going to use the Python matplotlib library to create and display a heat map. So we're going to start off by importing NumPy as well as pyplot from matplotlib. [The presenter types import numpy as np, and then types import matplotlib.pyplot as plt.] And then, we're going to generate some test data by using the randn function. To generate normally distributed random numbers, for both our x coordinates [The code is x = np.random.randn(12000).] and y coordinates. [The code is y = np.random.randn(12000).] And then we're going to use the NumPy histogram2d function to compute the two dimensional histogram of our x and y data samples. [The code is heatmap, xedges, yedges = np.histogram2d(x, y, bins=50).] All right, so in calling this function, we pass in our x and y data points and we also create 50 bins for each sets of data points. We bin the data value so that we can plot this relatively huge number of data points on the heat map. Also this function returns three ndarray objects. [In the code, the presenter refers to heatmap, xedges, and yedges.] The first array is the 2D histogram of data samples x and y. With x histogrammed along the first dimension and y histogrammed along the second dimension. The second array contains the bin edges along the first dimension. And the third array contains the bin edges along the second dimension.

Then we're going to create an extent array containing scalar values used to position our heat map when it gets displayed. So let's put this command in here. [The code is extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]].] So more specifically the extent specifies the location in data coordinates of the lower left and upper right corners of the plot. So we'll execute this. So here we're specifying that the left limit be the first element in the bin edges array for the x samples. [He refers to xedges[0].] And that the right limit be the last element in the bin edges array for the x samples. [He refers to xedges[-1].] And then we specify the bottom limit to be the first element in the bin edges array for the y samples. [He refers to yedges[0].] And then the top limit to be the last element in the bin edges array for the y samples. [He refers to yedges[-1].]

And then finally, we're going to display our heat map as an inline figure here in the IPython console by calling the imshow function. And passing in the two dimensional histogram of the heat map data, as well as the extent parameter. And we're also going to add a wide grid to our heat map, all right? [The code is plt.imshow(heatmap, extent=extent) plt.grid(b='on', color='#ffffff').] So here we get our heat map finally being displayed with our grid lines. [The heat map displays with values from -3 to 4 on the x-axis, and -3 to 3 on the y-axis.] And since the numbers in each dimension are normally distributed, then we would expect that the concentration of numbers will be more toward the mean or center of the data. Which is what we're seeing here in the plot as indicated by the red toward the center and then the blue toward the outer area. All right, so that's it. In this video, we have used the Python matplotlib library to create and display a heat map.

Introduction to Anaconda

[Topic Title: Introduction to Anaconda. Your host for this session is Wesley Miller. The Anaconda download page is open in a web browser.] In this video, we're going to install the Anaconda data science platform for Python 3.5 on a Mac OS X machine. All right, so we're going to start here on the Anaconda download page, and there you can see the URL in the address bar for the browser. [The URL is https://www.continuum.io/downloads#osx.] And we're going to scroll down on this page [The presenter scrolls down to the Ananconda 4.2.0 For OSX download link.] and we're going to click on the link so as to download the 64-Bit GRAPHICAL INSTALLER for Python 3.5. So once you click on that link, you can see that we're prompted to download this .pkg file, and this is for Mac OS 10. [A message displays asking whether the Anaconda3-4.2.0-MacOSX-x86_64.pkg file should be saved.] And we also have this dialog popping up in the background that gives us the option of entering an e-mail address, so as to obtain an e-mail from Team Anaconda containing a list of links to some useful Anaconda resources, including the Anaconda cheat sheet. [He refers to the Thank You For Downloading Anaconda message.] Now I've already downloaded this package file, so I'm not going to download it again. So I'll just close that dialog. [He clicks Cancel on the download message and X on the dialog box.]

I'm going to step over to Finder, show you the location of that package file. [A Finder window displays.] So it's here at my devel folder, on local machine. [He refers to the folder in the navigation pane.] And I've also created this folder, Anaconda_Python_3.5, into which we're going to install Anaconda. So I'm going to double-click on this downloaded package file. [The Installer opens to display the Install Anaconda3 wizard on the Introduction page.] And here you see now we're presented with the wizard, and here we're just going to follow the instructions to complete the install. So we're going to click Continue. [The Read Me page opens and contains important information about Anaconda.] And then we'll just do a quick read. Click Continue, [The Licence page opens and provides the software license agreement.] we'll click Continue again. [A message displays stating that in order to continue with the installation, the terms of the software licence must be agreed to.] And we're going to Agree to the license. [He clicks Agree and the Destination Select page opens.] And I'm going to Install on a specific disk. [The page asks How do you want to install this software? Install for me only is selected. He clicks the Install on a specific disk option.] We're going to select Macintosh HD. I'm going to choose our folder that we had created, [He clicks Choose Folder. A Finder window opens.] so I'm going to click into devel. And there's our folder, Anaconda_Python_3.5. We'll click Choose, [He returns to the Destination Select page.] and then we'll click Continue, [The Installation Type page opens with a summary of the type of installation that will be performed and how much disc space will be used.] and then we'll click Install. [An Installer message displays, requesting a Username and Password to allow the installation to proceed.] And then we, of course, have to enter our administrator password, and then click Install Software. And this process takes about...minutes. But after this, then this will complete the install and we're then able to verify the files that were installed having run this installation program.

Okay, so the installation process has now been completed. [The Summary page displays confirming the installation.] We'll now just click Close. And we'll just briefly go back to the download page. [He returns to the Anaconda download page in the browser.] Let me scroll down. You can see that we have some links where you can learn how to manage packages. You can also get access to some documentation for Anaconda, in addition to obtaining The Conda Cheat Sheet. We also take a look at the Anaconda Cheat Sheet, [He clicks a tab in the browser to display the Cheat Sheet.] it's a .pdf file. And here you can also see some of the data science packages that also come with Anaconda. So we have NumPy, SciPy, MatPlotLib, Pandas, Seaborn, Bokeh, SciKit-Learn, NLTK, Notebook, and R essentials. So if we go over to Finder and we open that folder into which we installed Anaconda. There's Anaconda, we'll expand that. [He expands the anaconda folder.] And here are the files that were installed as part of the Anaconda platform.

And here we also have the Anaconda-Navigator application, which is a desktop application consisting of a graphical user interface that allows you to launch applications and manage packages, channels, and the environment without the need for a command line. So I'm going to open Anaconda-Navigator. [ANACONDA NAVIGATOR opens on the Home section. Other sections include Environments, Learning, and Community. Applications on root display in tiles in the main view.] And here in NAVIGATOR, you can see that, you know, we have a few applications that we could run. We could run Jupyter notebook, qtconsole, spyder IDE, and we can also install glueviz. Now spyder, which is the scientific Python Development Environment, is a free interactive development environment that is included with Anaconda that gives us the ability to edit, interactively test and debug Python applications, among other features. So we can Launch spyder. [He clicks the Launch button on the app tile.] And here, we're now presented with our Spyder main window. [Spyder3 opens. It features a toolbar with a number of options and a code pane, a Source pane, and a console pane.] All right, so that's it. In this video, we have installed the Anaconda data science platform for Python 3.5 on a Mac OS 10 machine.

Python and Scikit-learn

[Topic Title: Python and SciKit-learn. Your host for this session is Wesley Miller. An IPython console interface displays.] In this video, we're going to use the SciKit-learn library to perform data normalization. All right, so normalization is the process of scaling individual data samples so as to have unit norm. And it's a common operation for text classification or clustering. So let's take a brief look at normalizing a simple NumPy array. We're going to start by importing the pre-processing package from SciKit-learn. And we will also import NumPy. [The code is as follows: from sklearn import preprocessing import numpy as np.] And then, we're going to create a three by three NumPy array to serve as our dataset. [The code is A = np.array([[3.0, 1.0, 4.0], [1.0, 2.0, 3.0], [4.0, 2.0, 2.0]]).] And then we're going to use the normalizer utility class to create a new normalization estimator. [The code is mynormalizer = preprocessing.Normaler().fit(A).] Okay, so this creates a normalization estimator that can subsequently be used on new test data sets, similar to the training set. So let's now take a look at our new estimator. [The presenter runs mynormalizer. The output is Normalizer(copy=True, norm='l2').]

And we see that we have a new instance of the normalizer class with copy=True. Meaning that a copy of the original data is made by auto operation. And norm='l2' means that the l2 norm is used to normalize each non zero sample. Let's now normalize our original data array by passing it through the transformer. [He types mynormalizer.transform(A). The output produces an array of three rows of three fractions each, with eight numbers after the decimal point.] And with this, we now obtain the normalized form of the original data. So let's now create a test dataset named A_test. [The code is A_test = np.array([[2.0, 1.0, 4.0], [3.0, 4.0, 2.0], [2.0, 3.0, 1.0]]).] And then we're going to pass this test set through our estimator. [He types mynormalizer.transform(A_test).] And with this, we now have the normalized form of our test dataset. [The output produces an array of three rows of three fractions each, with eight numbers after the decimal point.] The normalizer class actually accepts both dense array-like matrices and sparse matrices from the scipy.sparse package as input. But for the sparse input however, the data gets converted to compressed sparse rows or CSR representation before being fed to scython routines down stream. So that's it. In this video, we have used the SciKit-learn library to perform data normalization.

Image Processing in Python

[Topic Title: Supervised Learning with SciKit-learn. Your host for this session is Wesley Miller. An IPython console interface displays.] In this video, we're going to perform supervised learning by using the SciKit-learn library to perform optical recognition of handwritten digits. Okay, so the overall objective here is to import the digits dataset from SciKit-learn. And use support vector classification to estimate the digit being represented in a test image. So let's start by importing pyplot from matplotlib, as well as the SciKit-learn data sets. And after this we'll also load the digits dataset. [The code is import matplotlib.pyplot as plt, from sklearn import datasets, digits = datasets.load_digits().] All right, and I will briefly display the description of the digits dataset. [The presenter types digits. An extensive range of output data displays.] Let's scroll back up a bit here. All right, so this dataset contains eight by eight matrix representations of bitmap images of hand written digits. Along with the ten target classes, each target class representing one of ten digits from zero to nine. In the case of supervised learning, the learning targets are stored in the target member of the dataset. So here in the description we see a preview printout of the digits data array which contains the features that are used to classify the digit image samples. And we also see a preview printout of the digit images array. And we also have the target array which contains the actual target digit of the corresponding digit image in the data. And then we have the target_names array that contains the range of digits from the target array. And as you can see here, as I just mentioned, there are ten different target names, or classes, each representing one of ten digits from zero to nine.

Now we can also get the shape of the digits data array, [He types digits.data.shape. The output is (1797, 64).] and we see that there are 1,797 records with 64 columns. And more specifically we get the shape of the digits images array. [He types digits.images.shape. The output is (1797, 8, 8).] We see that it contains 1,797 records of 8 by 8 matrices. Each 8 by 8 matrix representing the pixels of a bitmap image of a given digit. Now to make this a little more clear, let's get the shape of the image data array associated with the second to last image in the images array. [He types digits.images[-2].shape. The output is (8, 8).] Right, so here we see that an individual image is an 8 by 8 matrix of pixel data. If we want to look at the actual pixel integer values. We can display the entire contents of our image's 8 by 8 matrix by converting to a list here in the IPython console, right? [He types digits.images[-2].reshape(8,8).tolist(). The output produces eight lines with eight values in each line. For example, the first line is [0.0, 0.0, 2.0, 10.0, 7.0, 0.0, 0.0, 0.0].] So here we can see the individual pixel data values, each number being an integer value in the range of 0 to 16. Well at this point, things might still seem a bit obscure to you, so let's now bring this into even clearer focus by plotting the actual image. So we're going to graphically display our second to last image from the images array as a grayscale plot using the pyplot imshow function. [The code is myimg = digits.images[-2], plt.imshow(myimg, cmap=plt.cm.gray_r, interpolation='nearest'). The output is <matplotlib.image.AxesImage at 0x11551f828> and an image of the character 9 in an 8x8 grid.] All right, so we can see that this image represents the number 9, although with some distortion. All right, so now that we've seen the image that we're going to be testing our model with, let's now set about creating an estimator. Now, in order to be able to apply an estimator, our images need to be flattened first. And it just so happens that our image data has already been flattened into a number of samples by each matrix. As you saw when we obtained the shape of the digits data array. And now I'm just going to store the length of our digits images array. Which is the same as the digits data array, into a variable named numSamples. [The code is numSamples = len(digits.images).] And for convenience, I'm going to store the digits.data array into a variable just named data. [The code is data = digits.data.]

Then I'm going to import the support vector machine module from SciKit-learn. [The code is as follows: from sklearn import svm.] And with this, I can now use the support vector classification constructor to create an estimator that can be used to predict the digit being displayed in a given image. [The code is as follows: myEstimator = svm.SVC(gamma=0.001, C=100).] All right, so with this, we're now going to provide the first half of all digit data as the training set for the estimator. [The code is as follows: myEstimator.fit(data[:int(numSamples/2)], digits.target[:int(numSamples/2)]).] And as you can see here, [He refers to the code int(numSamples/2).] I had to be sure to cast the index values to int, so as to not risk placing a value error in future versions of Python. But for this current version though, you would just get a warning about future deprecation of using non-integer values. All right, so now for the moment of truth, so let's execute this. [The output is SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False).] We're going to use our now trained estimator to predict the digit of the second to last image in the digits dataset for which we had plotted the image earlier, this image number 9. [He refers to the image that displays in the grid, reflecting the number 9.] So we make a call to the predict method on our estimator, and we pass in that image. We pass in the bitmap matrix representing the image data. [The code is myEstimator.predict(data[-2:-1]).] And here in the output we see that our estimator has correctly predicted that our input bitmap image was indeed the digit 9, all right? [The output is array([9]).] So that's it. In this video, we have performed supervised learning by using the SciKit-learn library to perform optical recognition of handwritten digits.

Geospatial Analysis Using ArcGIS

[Topic Title: Geospatial Analysis using ArcGIS. Your host for this session is Wesley Miller.] In this video, we're going to install and use the ArcGIS Python API in a Python app. So we're going to start here on the ArcGIS developer's web site, on the ArcGIS API for Python page. [The page is open on the Home tab. Other tabs include Sample Notebooks and Forum.] You can see the full URL there in the address bar. [The URL is https://developers.arcgis.com/python/.] And then, we're going to click on the Install the API button. [The Guide tabbed page opens.] And this opens a page with some documentation, providing you with information on several different ways in which you may start using the ArcGIS API for Python development. So you may either try the API live in a sandbox, or you may install the API locally on your machine using the conda package manager. So using conda, you can either install the API as a package for use within the Python Anaconda platform, or you may install it using the Python package manager in ArcGIS Pro. And lastly, you may also install the ArcGIS API for Python as a Docker image. [These installation options are available on the Guide tabbed page.] In this video, we're going to use conda to install the API as a package for use with Anaconda. So here on this Mac OS X Sierra machine, we're now going to switch over to Terminal and set about installing the API with conda. [The presenter switches to the Terminal interface. The devel folder is open and the prompt is lombardi:devel wesley$.] So, the command is simple. It's just conda install, use the -c switch, and then esri arcgis. [The full command is conda install -c esri arcgis.] Okay, so with this, you can see that I have already used this command to download and install ArcGIS. [A message displays stating that all the requested packages have already been installed along with the location of the installed components.] So with this, we will now switch over to the ANACONDA NAVIGATOR and launch the Jupyter Notebook application. [He clicks the launch button on the Jupyter app tile in the main view. Jupyter opens on the Files tab. Other tabs include Running and Clusters.] All right, so with Jupyter Notebook running, here in our browser, we're going to navigate to the location of notebook files on this current directory. So that's the devel/_pythonNotebooks. And we're going to create a new Python notebook by clicking on the drop down for New and then selecting Python [conda root]. [An untitled notebook opens.] And then, here in our notebook page, were going to rename our new notebook as arcgis, [He clicks the File menu. Options on the menu include New Notebook, Open, and Make a Copy. He clicks Rename. The Rename Notebook window opens.] or you can call it anything you like. [He types arcgis in the Enter a new notebook name text field.] Then click OK.

And then, here in our first cell, we're going to enter a few lines of codes to test our ArcGIS installation. So first, we're going to import the GIS library from our ArcGIS package, [The code line is from arcgis.gis import GIS.] and then we're going to connect to ArcGIS online as an anonymous user by creating a new GIS instance. [The code line is my_gis = GIS().] And then, we're going to call the map function on our GIS instance to call up a world map. [The code line is: my_gis.map().] And then, we're going to run the cell by clicking on the Play button in the toolbar. [He clicks the run cell, select below icon.] Okay, and with this we now see a world map graphic, now being displayed with the zoom controls to the top-left of the graphic. So we can take a look at, scrolling over and zooming in on East Coast Canada. Okay, now in order to get quickly up and running with some really meaningful geospatial analysis, it is recommended that you download the ArcGIS API for Python sample notebooks archive file. And to get to the web page for that, here from this current page, [He switches back to the ArcGIS API for Python page in the browser.] we can simply click the Sample Notebooks tab. And then here on this page, you may click on the blue button to download the archive as a zip file.

And then, you can unzip it to a local directory on your machine. [He refers to the Download as an archive button in the Download and run the sample notebooks section. There is also a Clone the GitHub repository button.] The directory contains some Sample Notebooks and accompanying data files that you may copy to your Jupyter Notebook's directory, and then access with Jupyter running so that you can get a more hands-on guide for performing content search, map rendering, registering big data file shares and carrying out various forms of geospatial analysis among other tasks. So if we go back to Jupyter Notebook running here, I can open this, this is one of the notebooks I had moved over from the extracted contents of that archive file, [He refers to the Analyze New York City taxi data.ipynb notebook in the _pythonNotebooks folder.] we can click on it. And here you see we have some textual information telling us how to make use of this sample notebook. And we have some sample commands in our cells, some sample output as well. And like I said, it's just good for getting you quickly up and running with doing some meaningful geospatial analysis using the ArcGIS API for Python. All right, so that's it. In this video, we have installed and used the ArcGIS Python API in a Python app.

Text Analytics with NLTk

[Topic Title: Text Analysis with NLTK. Your host for this session is Wesley Miller. The NLTK 3.0 documentation page displays in the browser.] In this video we're going to use the Natural Language Toolkit with Python to perform word and sentence tokenization. Okay so we're here on the Natural Language Toolkit, or NLTK homepage. [The URL is www.nltk.org.] And the NLTK is a free open source platform that can be used to develop Python applications capable of working with human language data. It provides an extensive collection of text processing interfaces and libraries for carrying out classification, tokenization, and parsing, among other text processing functionality. You may use this homepage to get more detailed information on NLTK documentation, as well as programming fundamentals. Now we're going to switch over to command prompt where, having already installed Python on our machine, we're going to use pip to install NLTK. [The presenter opens a Command Prompt window.]

All right, so we're going to use pip. And we just say pip install NLTK and hit Enter. And as I have already installed NLTK on this machine, you see that we just get a couple messages informing us that this requirement has already been satisfied. [The messages note the locations of the already installed site-packages.] All right, so now we're going to just open the Python integrated development language environment console. Where we're going to do our Python work. So go to the Start screen, type idle, click the results. [He types in the Search bar. He selects IDLE (Python GUI) from the suggestion list. The Python 2.7.13 Shell opens in a new window.] All right, so here in the IDLE console we're going to import the NLTK package. [At the prompt, he types import nltk.] And then we're going to use the NLTK download graphical user interface to install NLTK components. [He types nltk.download().] You don't have to do this, this is just optional. Let's see if our window came up here. Sometimes it pops up behind windows that you have open. [He minimizes the browser and Command Prompt windows.] There it is. All right, so here's the NLTK downloader graphical user interface. [The NLTK Downloader displays in a new window. The Collections tab is selected.] Where you can optionally download Collections, Corpora, Models. Or you can select All Packages, [He clicks through the other tabs on the interface.] and you can select what you want and click Download. [He refers to the Download button. A Refresh button also displays.] Now, you see this word here, Corpora? [He refers to the Corpora tab.] And corpora is the plural of the words corpus, which is essentially just a large collection of structured text.

And we can use these corpora as test data sets for various NLTK functions, and at this time we're not going to do that. We're actually going to create our own test body of text. And we're going to perform word and sentence tokenization. A token is just the technical name for a sequence of characters that we want to treat as a group. So we're going to go back to the IDLE console here. Actually let's close the downloader graphical user interface. And first we're going to import NLTK sentence and word tokenizers. [At the prompt in the Python 2.7.13 Shell, he types from nltk.tokenize import sent_tokenize, word_tokenize.] And having done this, we're going to then create a body of text containing several sentences. And we're going to store it in a local variable named testText. [He pastes in testText = "This is some text to test. Testing sentence tokenization. Testing word tokenization. A corpus is a large collection of structured texts. It is very interesting to see the different ways in which we can analyze bodies of texts for classification, tokenization, and parsing, among other things."] All right, we'll hit Enter. And now we can perform sentence tokenization on our test body of text by passing our variable as a parameter to a call to the sent_tokenize method. Which we can pass as a parameter to print, so that we can view our method call results on screen. [The code is print(sent_tokenize(testText)).] All right, so with that you can see that we get a Python list being returned, with each list element being a sentence from our test body of text. [The output is ['This is some text to test.', 'Testing sentence tokenization.', 'Testing word tokenization.', 'A corpus is a large collection of structured texts.', 'It is very interesting to see the different ways in which we can analyze bodies of texts for classification, tokenization, and parsing, among other things.'].]

Now to perform word tokenization, we just repeat what we just did, but this time we use the word_tokenize method in place of the sent_tokenize method. [The code is print(word_tokenize(testText).] All right, and as expected, this now returns a list of each individual word in our test body of text. [For example, the output for the first sentence is 'This', 'is', 'some', 'text', 'to', 'test', '.',.] Now, we can also obtain a count of the number of times a specific word appears in the test text body. [The code is word_tokenize(testText).count('tokenize'). Code ends. The output is 2.] All right, so with this, we can see that the word tokenization occurs two times in our body of text. Now, you may also notice that we have other common words included such as off, is, and, to, and in. And you can actually import the stop words package. And use it to loop through the words in our test body and remove such common words. So that you may obtain a more meaningful count of the occurrence of certain words in the test body. We could also use NLTK to proceed to create a word cloud. Which is just a collage of words in which the size of each word indicates the frequency with which that word occurs in a given body of text. All right, so that's it. In this video we have used the Natural Language Toolkit with Python to perform word and sentence tokenization.

Social Network Analysis using Networkx

[Topic Title: Social Network Analysis using Networkx. Your host for this session is Wesley Miller. Jupyter is open on the Files tab. The devel/_pythonNotebooks folder is open.] In this video, we're going to analyze an ego network using Python and Networkx. All right, so we're going to start by running this notebook named networkx-lab. [The presenter selects the notebook from the file list.] And we're just going to clear outputs for now, [He opens the Cell menu and selects All Output - Clear.] and let's talk about Networkx. So Networkx is a Python package used for creating, manipulating, and analyzing the function dynamics and structure of complex networks. So let's cover a little vocabulary before we move on. So a network is essentially a collection of nodes that are interconnected in a specific way. The link or connection between each node is referred to as an edge. Now the degree of a given node is the number of connections that node has to other nodes in the network. All right, so with this, let's now take a look at some code in action here in our Jupyter notebook. So in the first cell, here we're importing Networkx and matplotlib, [He refers to the code lines import networkx as nx and import matplotlib.pyplot as plt. The complete code is provided for reference in the First Cell Code section at the end of the transcript.] and then we create a Networkx graph object. [He refers to the code g = nx.Graph().] And then we add the two edges to the graph by using the add_edge method twice. [He refers to the code lines g.add_edge(2,5) and g.add_edge(4,1).] And as you can see, we're connecting nodes 2 and 5, [g.add_edge(2,5)] as well as nodes 4 and 1. [g.add_edge(4,1)] And we could explicitly create these nodes, but we don't have to since the add_edge method will automatically create them for us. And then we print info about the graph by calling the info method from Networkx and passing in our graph object. [He refers to the code print(nx.info(g)).]

So let's run this cell. [He clicks the run icon on the toolbar. The output is Name:, Type: Graph, Number of nodes: 4, Number of edges: 2, Average degree: 1.0000.] And there, we didn't give it a name but the Type: is Graph. Number of nodes: 4, as you could see there. [He refers to the output.] Number of edges: 2. And the Average degree: is 1. And this makes sense, as each node is only connected to one other node. All right, so let's now take a look at the code in the second cell here. [The complete code is provided for reference in the Second Cell Code section at the end of the transcript.] And in this cell, we're plotting and analyzing an ego graph. But before we do this, however, let's visit the concept of degree distribution. So we've established that the degree of a node is the number of connections that node has to other nodes in the network. Well, the degree distribution is the probability distribution of degrees for all nodes over the entire network. A hub refers to a particular highly connected node or alternatively, a node with an unusually high degree. In an ego network or egocentric network, the largest hub is identified as being the ego of the network. Now the following code here demonstrates an example of using the Networkx ego_graph function to return the main egonet of the largest hub in what is referred to as a Barabasi-Albert network. [He refers to the code in the second cell.]

A Barabasi-Albert, or BA model, is an algorithm that can be used to generate scale-free networks. A scale-free network is one whose degree distribution follows a power law. More clearly, it basically suggests that nodes with a higher degree have a higher probability of obtaining connections from nodes that are newly introduced into the network. So think of a typical social network where if you have someone with a 1,000 friend connections and then someone with only 10. The BA model proposes that the person with 1,000 connections is more likely to gain new connections than the person with only 10. For this reason, the BA model lends well as a graph theory tool for analyzing social networks. All right, so in the first three lines here, we're just doing some imports. [He refers to from operator import itemgetter, import networksx as nx, and import matplotlib.pyplot as plt.] We're importing itemgetter from operator. And we're importing Networkx at matplotlib. And then we're going to create our BA model graph. So n=1000, that's the number of nodes in the graph. And m=2, so these are the number of edges to attach from a new node to existing nodes. And then we say G=nx.generators and this is just creating the barabasi_albert_graph using n and m as arguments. [He refers to the code n=1000, m=2, and G=nx.generators.barabasi_albert_graph(n,m).]

And then we're going to retrieve a dictionary of key-value pairs with each key representing a node, and its corresponding value representing that node's degree. [He refers to the code node_and_degree=G.degree().] So once we get this dictionary, we're going to sort and retrieve the hub with the highest degree. We're going to call that largest_hub. [He refers to the code (largest_hub,degree)=sorted(node_and_degree.items(),key=itemgetter(1))[-1].] And this is also going to be our main hub. And then we're going to create the ego graph of the main hub using the ego_graph method from Networkx. [He refers to the code hub_ego=nx.ego_graph(G, largest_hub).] And this will create the ego graph with neighboring nodes centered at the main hub. And then we're going to use the spring_layout method from Networkx to return a dictionary of positions that are keyed by node. [He refers to the code pos=nx.spring_layout(hub_ego).] And then we use nx.draw to draw the neighboring nodes without labels. We're going to draw them as red nodes. [He refers to the code line nx.draw(hub_ego,pos,node_color='r',node_size=50,with_labels=False).] And then after that, we use nx.draw to draw the main hub, which is our ego node, and it's going to be large and green. [He refers to the code line nx.draw_networkx_nodes(hub_ego,pos,nodelist=[largest_hub], node_size=300,node_color='g').] And then we're going to display the plot figure on screen. [He refers to the code plt.show().] And last, it will print information about the ego graph, specifically the connections about the ego. [He refers to the code print(nx.info(hub_ego)). Code ends.] So let's run this. [He clicks the run icon on the toolbar.] And here you can see we have our ego graph. There is our main hub, large and green. [He refers to the center of the graph.] And here are the neighboring nodes around the main hub. [All the rest of the nodes radiate out from the central node. Some nodes are connected to other nodes.] And here we have the information on the graph. [The name of the graph is baranasi_albert_graph(1000,2).] So Type: of course is Graph. Number of nodes: 81. Number of edges: 99. And Average degree: 2.4444. All right, so that's it. In this video, we have analyzed an ego network using Python and Networkx.

First Cell Code

import networkx as nx
import matplotlib.pyplot as plt
code g = nx.Graph()
g.add_edge(2,5)
g.add_edge(4,1)
print(nx.info(g))

Second Cell Code

from operator import itemgetter
import networksx as nx
import matplotlib.pyplot as plt
# Create a BA model graph
n=1000
m=2
G=nx.generators.barabasi_albert_graph(n,m)
# Locate node with the largest degree
node_and_degree=G.degree()
(largest_hub,degree)=sorted(node_and_degree.items(),key=itemgetter(1))[-1]
# Create ego graph of the main hub
hub_ego=nx.ego_graph(G, largest_hub)
# Draw the ego graph
pos=nx.spring_layout(hub_ego)
nx.draw(hub_ego,pos,node_color='r',node_size=50,with_labels=False)
# Draw ego node as large and green
nx.draw_networkx_nodes(hub_ego,pos,nodelist=[largest_hub], node_size=300,node_color='g')
plt.show()

print(nx.info(hub_ego))

The BeautifulSoup Parser for Python

[Topic Title: The BeautifulSoup Parser for Python. Your host for this session is Wesley Miller. The GitHub homepage is open in a browser.] In this video, we're going to perform web scraping using BeautifulSoup 4 in Python 2.7. Okay, so web scraping is essentially the practice of pulling data from HTML site markup. To do this, you need to first obtain an endpoint, that is, the URL of the site that you want to scrape. And then, you'll need a good parser in order to parse the Document Object Model tree as well as some means of searching and navigating the tree. This is where BeautifulSoup comes in handy as it is a Python library that is in fact used for the very purpose of pulling data from HTML and XML files. BeautifulSoup actually supports the HTML parser that's included in Python's standard library. And it also supports a number of third-party Python parsers. All right, so let's start by using pip to install BeautifulSoup 4. We'll open Command Prompt. [The presenter opens the Command Prompt window from the taskbar.] Now, as of the time of this recording, BeautifulSoup 3 is no longer supported, so you'll want to ensure that you install BeautifulSoup 4. [He types pip install beautifulsoup4.] All right, so as you can see, I've already installed it, and so I get this message telling me that Requirement has already been satisfied. [The message also notes where the site-packages are located.] Now, we'll also need to install the requests Python HTTP library so that we can retrieve web content in human-readable format. So again, I'm going to use pip. [He types pip install requests.] And I've already installed this package as well. [A message displays stating that the requirement is already met and noting the location of the site-packages.]

So now we're going to switch over to the Python IDLE, that's the Integrated Development Language Environment. [He opens the Python 2.7.13 Shell from the taskbar.] And we're going to see about parsing HTML from a web site. All right, so first we're going to import the BeautifulSoup class from the BeautifulSoup 4 or bs4 package. [At the prompt, he types from bs4 import BeautifulSoup.] And then we're going to import the requests library. [He types import requests.] And then we will use the requests library to fetch web content and store it as a local file. So to do this, we will call requests.get and pass in the endpoint URL to the site that we want to scrape. And in this case, we're going to scrape the GitHub homepage. Let's take a quick look at that page. So here we are, that's the URL to the homepage. [He switches back to the browser. The URL is https://github.com.] And there we can scroll and take a look at the content. And we can also right-click and select View Page Source. [The source code for the page opens in a new browser tab.] And here is the actual markup for the page. You can go through this on your own time. And now let's go back to Python IDLE. And we're going to use requests.get, store the result in this variable r. [The code is r = requests.get('https://github.com').]

And we can print our requests object URL to confirm, right? [He types print(r.url). The output is https://github.com/.] Now, whenever we make a request, the requests library makes a guess as to the encoding of the response based on the HTTP headers. So if we type r.content, all right, so there's our content. [The content of the page displays as code. It includes items such as the text, various tags, and CSS information.] We're going to scroll back to the top of this output, okay? And here is the charset or the character set. See we have UTF-8. [He refers to the code <meta charset="utf-8">.] So with this, we can then check and/or set our request's encoding property if necessary to match the same encoding value, to ensure that our request's text property gets used with the correct encoding. So let's go to the end. [He scrolls to the end of the code output.] Okay, and I'm going to type r.encoding. All right, and with this we see that our request's encoding is also UTF-8, so we're set. [The output is 'utf-8'.] Now we need to save our HTTP stream to a local file. So first, let's take a look in the directory into which we're going to save our file. [He opens a File Explorer window from the taskbar.] All right, so there's a path to the directory. [He refers to the navigation bar.] My home directory \Desktop\pg_pyth\assets\111930. And we can see that the directory is currently empty, so let's now switch back to the Python IDLE console. And we'll write our HTTP stream to a new file. All right, so we're going to call it testfile. And for the mode, we're going to specify wb. [The code is with open('C:/Users/mille_000/Desktop/pg_pyth/assets/111930/testfile', 'wb') as fd:.] So this means we're opening the file for writing only in binary mode. Actually let's make this window maximized. [He maximizes the Python 2.7.13 Shell window.]

All right, so we are writing our captured HTTP stream out to our file in evenly sized chunks of 128 bytes. [The code is for chunk in r.iter.content(chunk_size=128):.] And then having done this, we will then be able to open our file and verify that the HTTP stream had indeed been successfully written to it. [The code is fd.write(chunk).] So let's execute. This is actually an underscore, okay? [He corrects the second line of code to for chunk in r.iter_content(chunk_size=128):.] And let's now go back to our directory. There's testfile now created. [He returns to the File Explorer window where the testfile file is listed.] And we'll just open it with Notepad++. [He right-clicks the file and selects Edit with Notepad++ from the menu.] And there you can see that our file contains all of the HTML from our page. [He scrolls through the HTML in the file.] So our write was successful. All right, so let's now go back to Python IDLE. [He opens the interface from the taskbar.] And we're going to create a handle to our new file in read mode, going to call it tfile. [The code is tfile = open('C:/Users/mille_000/Desktop/pg_pyth/assets/111930/testfile', 'r').] All right, and then we're going to pass our new file handle to the BeautifulSoup class constructor, specifying the Python standard HTML parser as our parser. [The code is soup = BeautifulSoup(tfile, 'html.parser').] All right, and now we can extract all of the text from the page, free of any HTML tags, get home text or ghText. We call it the get_text method. [The code is ghText = soup.get_text().] We can then print this. [The code is print(ghText).] And we scroll back up, you can see that we now have the text from our markup free of any HTML tags. [Only the text from the web site displays, each sentence or piece of text on a separate line.] So, now that we have our soup, we can also navigate it using HTML tag names like head and title. [He types soup.head.] So as you can see, this returns the entire head section of our HTML markup. [The page content in the <head> tags is returned.] And then typing and entering soup.title returns the page's title block. [Only the text within the <title> tags is returned.] Now, we can also retrieve a list of all of the links that exist on the page by calling the find_all method on our soup object and passing in an a for an anchor tag as the parameter. [He types soup.find_all('a').]

Okay, so this isn't quite reader friendly. [He refers to the large amount of output returned within the <a> tag.] But this actually is still a Python list of anchor elements on which we may use indices to select individual links or a range of links. So you can see we have our first anchor tag here, [He refers to the first set of information in <a> tags in the output.] then we have a comma, then we have a new anchor tag, and so on. And there the list closes. [He refers to the closing tag at the end of the output.] Now, we can also use the find_all method with regular expressions to conduct more detailed searches. So first, let's import regular expression. [He types import re.] And then we'll look for all hyperlinks in our HTML document that contain the word documentation. [The code is soup.find_all(href=re.compile("documentation")).] All right, and we see that the result comes back as a list containing the sole anchor element that contains the word documentation. There's a lot more that you can do with BeautifulSoup to scrape web data, but the content we've covered here today will at least provide you with a good start. So that's it. In this video, we have performed web scraping using BeautifulSoup 4 in Python 2.7.

Working with PySpark for Big Data

[Topic Title: Working with PySpark for Big Data. Your host for this session is Wesley Miller. Terminal is open.] In this video, we're going to set up and deploy a PySpark application to a standalone cluster. So we're here on a Mac OS X Sierra machine on which I already have Apache Spark, Maven, and the GDK installed and added to my system path environment variable. And with all of this already done, all we need to do to import and run PySpark as a Spark standalone job is to use a script that imports PySpark, initializes a SparkContext, and performs some Spark operations on a Spark cluster in standalone mode. So here in Terminal, we're first going to change to the location of the Spark installation directory on this machine. [The presenter types cd ~/Spark/spark-1.2.0. The prompt changes from lombardi:devel wesley$ tolombardi:spark-1.2.0 wesley$.] And then we're going to start a standalone master server by using the start-master shell script [Two worker IDs now display in the Workers section of the Spark Master page.] in the sbin subdirectory. [He types ./sbin/start-master.sh. The master server is deployed.] And then to confirm that our master server is up and running, I'm going to head over to the browser [He opens Firefox.] and check out the default URL, which is at localhost on port 8080. [He refers to the URL localhost:8080/. He runs the URL. The Spark Master page opens.] And here you see that the master URL is here at the top of the page [He refers to the url: spark://lombardi.local:7077.] as well as the number of workers currently running on the server. The number of cores and the amount of memory being used. The number of applications and drivers that are running, [In all cases, the value is zero.] as well as the status of the master server, which we confirm here as being ALIVE or active. Now we can also view more detailed information on the workers and applications that are running on the server. But we'll take a look at that shortly. All right, so let's now go back to Terminal.

And we're going to start a worker and connect it to the master. So let's open a new tab [He clicks the + icon in the Terminal to open a new tab.] and let's name our tabs here. This is going to be worker_1. [He renames the tab in the Inspector.] We name this one master. [He switches to the first tab and renames it in the Inspector.] And we're going to create a second worker. So let's create another tab there and name it worker_2. Then I'm going to create another tab, we're actually going to do the submit. So let's call that, spark-submit. Okay, so let's start our first worker. [He clicks the worker_1 tab.] So again, we're going to change to the directory of our Spark installation. [He types cd ~/Spark/spark-1.2.0.] And we'll start the worker. [The code is .bin/spark-class org.apache.spark.deploy.worker.Worker spark://lombardi.local:7077.] So we're specifying the URL to the master server. And that starts one worker. We'll do the same here for the next worker. [He clicks the worker_2 tab.] So change to the installation directory first. Then we run that code. [The code is .bin/spark-class org.apache.spark.deploy.worker.Worker spark://lombardi.local:7077.] And then we can verify that our workers are up and running. If we go over here to a browser and do a refresh, and here we see we have both workers running. [Two worker IDs and addresses display in the Workers section of the Spark Master page.] They are both ALIVE, [He refers to the State column.] each have 4 cores available. [He refers to the Cores column.] And then, let's go back to Terminal.

And in the spark-submit tab, first let's take a look at the Spark scripts that we're going to use. So let's first confirm that it exists in this current directory. [He types ls -at | grep "my-pyspark-app.py".] Okay, so there it is. [The output is -rw-r--r--@ 1 wesley staff 313 29 Mar 00:19 my-pyspark-app.py.] So let's just use a text editor to open that file. [He types mate my-pyspark-app.py. Code ends. The code displays in a text editor. The complete code is included for reference in the My PySpark section at the end of the transcript.] Okay, so here in the first two lines, we're importing SparkConf and SparkContext from pyspark. [He refers to the code lines from pyspark import SparkConf and from pyspark import SparkContext.] And in line 4, we then create a new SparkContext here in our script and configure it with the URL of the master server, followed by the name that we want to give to our Spark application. So I'm going to call it My PySpark App. [He refers to the code sc = SparkContext('spark://lombardi.local:7077", "My PySpark App".] And then we list the absolute paths to any codependencies in the pyFiles option of the SparkContext constructor. So in this case we have just the one file, which is this current script that we're in, my-pyspark-app.py. [He refers to the code pyFiles=['/Users/wesley/devel/my-pyspark-app.py']).] All right, so next in lines 6 through 8, I've defined this Python function named exp2 in which we import numpy and then use the numpy exp2 function to calculate the result of raising 2 to the power of each element in array x. [He refers to the code lines def exp2(x):, import numpy as np, and return (x, np.exp2(x)).] So that is 2 raised to the power of 1, 2 raised to the power of 2, 2 raised to the power of 3, and so on. And then in line 10, we then create an rdd, or a resilient distributed dataset, that contains a parallelized collection, so as to run our calculation on the Spark cluster. [He refers to the code line rdd = sc.parallelize(range(100000)).map(exp2).take(13).] So here we're creating an array that uses our numpy function to store the result of raising 2 to the power of each element in an array of integers from 1 to 100,000. And then we're retrieving only the first 13 results. And then finally, in line 11, we print the results of the calculation to standard output. [He refers to print(rdd).]

All right, so with this, let's now switch back to Terminal. And we're going to launch our Python application to our standalone Spark cluster. So here in the spark-submit tab, we're first going to change to our Spark installation directory. [He types cd ~/Spark/spark-1.2.0.] And then we're going to use pyspark to run our script, [The code is ./bin/pyspark ~/devel/my-pyspark-app.py.] and then we'll hit Enter. [A note warns that running Python applications through ./bin/pyspark is deprecated as of Spark 1.0. It suggests using .bin/spark_submit <python file> instead.] And we'll go to the browser, we'll do a refresh. We see that the app is still running. [App information displays in the Running Applications section of the Spark Master page. The State is RUNNING.] Let's do a refresh again, and it's still running. [The State remains as RUNNING.] Let's refresh one more time. Okay, so the application has completed and we see the State is FINISHED, and it took 12 seconds. And if we go back to Terminal, here we see our output. [The output produces 13 pairs of values.] So we can see these pairs. [For example, the first pair is (0, 1.0).] So 2 raised to the power of the first argument gives you the result. The second, so 2 raised to 0 is 1. 2 raised to 1 is 2. 2 raised to 2 is 4. 2 raised to 3 is 8. 2 raised to 4 is 16 and so on, until we get to the end, 2 raised to 12 is 4,096. All right, so that's it. In this video, we have set up and deployed a PySpark application to a standalone cluster.

My PySpark

from pyspark import Sparkconf
from pyspark import SparkContext

sc = SparkContext('spark://lombardi.local:7077", "My PySpark App", pyFiles=['/Users/wesley/devel/my-pyspark-app.py'])

def exp2(x):
import numpy as np
return (x, np.exp2(x))

rdd = sc.parallelize(range(100000)).map(exp2).take(13)
print(rdd)

Exercise: Working with Pandas

Exercise

[Exercise Title: Working with Pandas. Your host for this session is Wesley Miller.] Now that you've covered the essentials in working with Python data containers and data manipulation operations, you will use the pandas library to create and merge two DataFrames. More specifically, you will create each DataFrame so that they each contain three columns and three rows. Then, you will merge the two DataFrames by performing an inner join operation. At this point, you may pause the video and attempt the programming solution. Then once you're finished, you may resume the video to view the suggested solution.

Solution

Okay, so here on a Windows 8.1 machine on which I have Python 2.7 installed, I've already used pip to install the pandas package. [The code is pip install pandas.] And as you can see here in the Command Prompt window, all dependent packages were automatically downloaded and installed as well. [A message notes the packages that were downloaded and installed, and a successful installation notification displays.] So let's now open up the Python IDLE console and create our pandas DataFrames. [He opens the console from the taskbar.] So first, we're going to import pandas. [The code is import pandas as pd.] And then we're going to create our first DataFrame containing three columns and three rows, ensuring that one of the columns can be used as a key for performing joins with other DataFrames. [The code is df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'W': ['W0', 'W1', 'W2'], 'Y': ['Y0', 'Y1', 'Y2']}).] All right, and then we're going to create our second DataFrame which will also contain three columns and three rows. And we will also ensure that one of the columns in this DataFrame can be used as a key for performing joins with other DataFrames.

So to just save time, I'm just going to actually use the last input here. And I'm just going to change the name of the DataFrame to df2. And I'm going to use the same key column. And I'm just going to change the labels for the other columns, labels and the values. So we're going to have X, X1, X2, X0, and then we're going to have Z0, Z1, and Z2, [The code is df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'X': ['X0', 'X1', 'X2'], 'Z': ['Z0', 'Z1', 'Z2']}).] and we'll hit Enter. All right, so then we're going to print out our first DataFrame to confirm that it has the structure that we want. [He types df1. The output produces a table with three rows and W, Y, and key columns, each with values from 0 to 2.] Okay, and it does. Then we're going to print out our second DataFrame to also confirm that it has the structure and the elements in the right place. [He types df2. The output produces a table with three rows and columns X, Z, and key, each with values from 0 to 2.] Okay, indeed it does. And finally, we will merge the two DataFrames using an inner join operation on the key column in both DataFrames. [He types pd.merge(df1, df2, on='key'). The output produces a table with three rows and W, Y, key, X, and Z columns, each with values from 0 to 2.] And with this, you see that we have successfully merged our two DataFrames. So that's it, this completes the suggested solution to the exercise.

Search This Blog

MY WORK DIARY

UNISYS 4TH TRAINNING VIDEO DATA SCIENCE FUNDAMENTALS