From f6fabad990cdaf6404beebc24143ffd2fd5d3038 Mon Sep 17 00:00:00 2001 From: Shivaram Venkataraman Date: Mon, 3 Mar 2014 17:59:34 -0800 Subject: [PATCH] Add lab4 --- lab4/joins.ipynb | 440 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 440 insertions(+) create mode 100644 lab4/joins.ipynb diff --git a/lab4/joins.ipynb b/lab4/joins.ipynb new file mode 100644 index 0000000..1f05805 --- /dev/null +++ b/lab4/joins.ipynb @@ -0,0 +1,440 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Joining DataFrames in Pandas\n", + "\n", + "In previous labs, we've explored the power tables as a data management abstraction, in particular with the Pandas DataFrame object.\n", + "Tables let us select rows and columns of interest, group data, and measure aggregates.\n", + "\n", + "But what happens when we have more than one table?\n", + "Traditional relational databases usually contain many tables.\n", + "Moreover, when integrating multiple data sets, we necessarily need tools to combine them.\n", + "\n", + "In this lab, we will use Panda's take on the database **join** operation to see how tables can be linked together.\n", + "Specifically, we're going to perform a \"fuzzy join\" based on string edit-distance as another approach to finding duplicate records.\n", + "\n", + "Remember to fill out the response form at http://goo.gl/ZgfzAN at the end !" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "### Data\n", + "\n", + "Today we'll be using a small data set of restaurants.\n", + "Download the data from [here](https://raw.github.com/amplab/datascience-sp14/master/lab4/data/restaurants.csv).\n", + "Put the data file, \"restaurants.csv\", in the same directory as this notebook.\n", + "\n", + "### Edit Distance\n", + "\n", + "We're going to be using a string-similarity python library to compute \"edit distance\".\n", + "Install it on your VM by running the following:\n", + "\n", + "`sudo apt-get install python-levenshtein`\n", + "\n", + "**NOTE**: You may also need to run `sudo apt-get update`.\n", + "\n", + "To test that it works, the following should run OK:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import Levenshtein as L" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Joins\n", + "\n", + "A **join** is a way to connect rows in two different data tables based on some criteria.\n", + "Suppose the university has a database for student records with two tables in it: *Students* and *Grades*.\n" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import pandas as pd\n", + "\n", + "Students = pd.DataFrame({'student_id': [1, 2], 'name': ['Alice', 'Bob']})\n", + "Students" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "Grades = pd.DataFrame({'student_id': [1, 1, 2, 2], 'class_id': [1, 2, 1, 3], 'grade': ['A', 'C', 'B', 'B']})\n", + "Grades" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say we want to know all of Bob's grades.\n", + "Then, we can look up Bob's student ID in the Students table, and with the ID, look up his grades in the Grades table.\n", + "Joins naturally express this process: when two tables share a common type of column (student ID in this case), we can join the tables together to get a complete view.\n", + "\n", + "In Pandas, we can use the **merge** method to perform a join.\n", + "Pass the two tables to join as the first arguments, then the \"on\" parameter is set to the join column name." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pd.merge(Students, Grades, on='student_id')" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### DIY\n", + "\n", + "1. Use **merge** to join Grades with the Classes table below, and find out what class Alice got an A in." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "Classes = pd.DataFrame({'class_id': [1, 2, 3], 'title': ['Math', 'English', 'Spanish']})" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Joining the Restaurant Data\n", + "\n", + "Now let's load the restaurant data that we will be analyzing:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "resto = pd.read_csv('restaurants.csv')\n", + "resto.info()" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "resto[:10]" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The restaurant data has four columns.\n", + "**id** is a unique ID field (unique for each row), **name** is the name of the restaurant, and **city** is where it is located.\n", + "The fourth column, **cluster**, is a \"gold standard\" column.\n", + "If two records have the same **cluster**, that means they are both about the same restaurant.\n", + "\n", + "The type of join we made above between Students and Grades, where we link records with equal values in a common column, is called an *equijoin*.\n", + "Equijoins may join on more than one column, too (both value have to match).\n", + "\n", + "Let's use an equijoin to find pairs of duplicate restaurant records.\n", + "We join the data to itself, on the **cluster** column.\n", + "\n", + "> Note: a join between a table and itself is called a *self-join*.\n", + "\n", + "The result (\"clusters\" below) has a lot of extra records in it.\n", + "For example, since we're joining a table to itself, every record matches itself.\n", + "We can filter on IDs to get rid of these extra join results.\n", + "Note that when Pandas joins two tables that have columns with the same name, it appends \"_x\" and \"_y\" to the names to distinguish them." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "clusters = pd.merge(resto, resto, on='cluster')\n", + "clusters = clusters[clusters.id_x != clusters.id_y]\n", + "clusters[:10]" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### DIY\n", + "\n", + "1. There are still extra records in *clusters*, above. If records *A* and *B* match each other, then we will get both (*A*, *B*) and (*B*, *A*) in the output.\n", + "Filter *clusters* so that we only keep one instance of each matching pair (HINT: use the IDs again).\n" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fuzzy Joins\n", + "\n", + "Sometimes an equijoin isn't good enough.\n", + "\n", + "Say you want to match up records that are *almost* equal in a column.\n", + "Or where a *function* of a columns is equal.\n", + "Or maybe you don't care about equality: maybe \"less than\" or \"greater than or equal to\" is what you want.\n", + "These cases call for a more general join than equijoin.\n", + "\n", + "We are going to make one of these joins between the restaurants data and itself.\n", + "Specifically, we want to match up pairs of records whose restaurant names are *almost* the same.\n", + "We call this a **fuzzy join**.\n", + "\n", + "To do a fuzzy join in Pandas we need to go about it in a few steps:\n", + "\n", + "1. Join every record in the first table with every record in the second table. This is called the **Cartesian product** of the tables, and it's simply a list of all possible pairs of records.\n", + "2. Add a column to the Cartesian product that measures how \"similar\" each pair of records is. This is our **join criterion**.\n", + "3. Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n", + "\n", + "> SQL Aside: In SQL, all of joins are supported in about the same way as equijoins are.\n", + "> Essentially, you write a boolean expression on columns from the join-tables, and whenever that expression is true, you join the records together.\n", + "> This is very similar to writing an **if** statement in python or Java.\n", + "\n", + "Let's do an example to get the hang of it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1. Join every record in the first table with every record in the second table.\n", + "\n", + "We use a \"dummy\" column to compute the Cartesian product of the data with itself.\n", + "**dummy** takes the same value for every record, so we can do an equijoin and get back all pairs." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "resto['dummy'] = 0\n", + "prod = pd.merge(resto, resto, on='dummy')\n", + "\n", + "# Clean up\n", + "del prod['dummy']\n", + "del resto['dummy']\n", + "\n", + "# Show that prod is the size of \"resto\" squared:\n", + "print len(prod), len(resto)**2" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "prod[:10]" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### DIY\n", + "\n", + "* Like we did with *clusters* remove \"extra\" record pairs, e.g., ones with the same ID." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2. Add a column to the Cartesian product that measures how \"similar\" each pair of records is.\n", + "\n", + "In the homework assignment, we used a string similarity metric called *cosine similarity* which measured how many \"tokens\" two strings shared in common.\n", + "Now, we're going to use an alternative measure of string similarity called *edit-distance*.\n", + "[Edit-distance](http://en.wikipedia.org/wiki/Edit_distance) counts the number of simple changes you have to make to a string to turn it into another string.\n", + "\n", + "Import the edit distance library:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import Levenshtein as L\n", + "\n", + "L.distance('Hello, World!', 'Hallo, World!')" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we add a computed column, named **distance**, that measures the edit distance between the names of two restaurants:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# This takes a minute or two to run\n", + "prod['distance'] = prod.apply(lambda r: L.distance(r['name_x'], r['name_y']), axis=1)" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3. Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n", + "\n", + "Now we complete the join by filtering out pairs or records that aren't equal enough for our liking.\n", + "As in the first homework assignment, we can only figure out how similar is \"similar enough\" by trying out some different options.\n", + "Let's try maximum edit-distance from 0 to 10 and compute precision and recall." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%matplotlib inline\n", + "import pylab\n", + "\n", + "def accuracy(max_distance):\n", + " similar = prod[prod.distance < max_distance]\n", + " correct = float(sum(similar.cluster_x == similar.cluster_y))\n", + " precision = correct / len(similar)\n", + " recall = correct / len(clusters)\n", + " return (precision, recall)\n", + "\n", + "thresholds = range(1, 11)\n", + "p = []\n", + "r = []\n", + "\n", + "for t in thresholds:\n", + " acc = accuracy(t)\n", + " p.append(acc[0])\n", + " r.append(acc[1])\n", + "\n", + "pylab.plot(thresholds, p)\n", + "pylab.plot(thresholds, r)\n", + "pylab.legend(['precision', 'recall'], loc=2)" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### DIY\n", + "\n", + "1. Another common way to visualize the tradeoff between precision and recall is to plot them directly against each other.\n", + "Create a scatterplot with precision on one axis and recall on the other.\n", + "Where are \"good\" points on the plot, and where are \"bad\" ones.\n" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. The python Levenshtein library provides another metric of string similarity called \"ratio\" (use `L.ratio(s1, s1)`).\n", + "`ratio` gives a similarity score between 0 and 1, with higher meaning more similar.\n", + "Add a column to \"prod\" with the `ratio` similarities of the **name** columns, and redo the precision/recall tradeoff analysis with the new metric.\n", + "(Note: you will have to alter the `accuracy` method and the threshold range.)\n", + "On this data, does `Levenshtein.ratio` do better than `Levenshtein.distance`?" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, remember to fill out the response form at http://goo.gl/ZgfzAN !" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file