From f6fabad990cdaf6404beebc24143ffd2fd5d3038 Mon Sep 17 00:00:00 2001
From: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
Date: Mon, 3 Mar 2014 17:59:34 -0800
Subject: [PATCH] Add lab4

---
 lab4/joins.ipynb | 440 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 440 insertions(+)
 create mode 100644 lab4/joins.ipynb

diff --git a/lab4/joins.ipynb b/lab4/joins.ipynb
new file mode 100644
index 0000000..1f05805
--- /dev/null
+++ b/lab4/joins.ipynb
@@ -0,0 +1,440 @@
+{
+ "metadata": {
+  "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Joining DataFrames in Pandas\n",
+      "\n",
+      "In previous labs, we've explored the power tables as a data management abstraction, in particular with the Pandas DataFrame object.\n",
+      "Tables let us select rows and columns of interest, group data, and measure aggregates.\n",
+      "\n",
+      "But what happens when we have more than one table?\n",
+      "Traditional relational databases usually contain many tables.\n",
+      "Moreover, when integrating multiple data sets, we necessarily need tools to combine them.\n",
+      "\n",
+      "In this lab, we will use Panda's take on the database **join** operation to see how tables can be linked together.\n",
+      "Specifically, we're going to perform a \"fuzzy join\" based on string edit-distance as another approach to finding duplicate records.\n",
+      "\n",
+      "Remember to fill out the response form at http://goo.gl/ZgfzAN at the end !"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Setup\n",
+      "\n",
+      "### Data\n",
+      "\n",
+      "Today we'll be using a small data set of restaurants.\n",
+      "Download the data from [here](https://raw.github.com/amplab/datascience-sp14/master/lab4/data/restaurants.csv).\n",
+      "Put the data file, \"restaurants.csv\", in the same directory as this notebook.\n",
+      "\n",
+      "### Edit Distance\n",
+      "\n",
+      "We're going to be using a string-similarity python library to compute \"edit distance\".\n",
+      "Install it on your VM by running the following:\n",
+      "\n",
+      "`sudo apt-get install python-levenshtein`\n",
+      "\n",
+      "**NOTE**: You may also need to run `sudo apt-get update`.\n",
+      "\n",
+      "To test that it works, the following should run OK:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import Levenshtein as L"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Joins\n",
+      "\n",
+      "A **join** is a way to connect rows in two different data tables based on some criteria.\n",
+      "Suppose the university has a database for student records with two tables in it: *Students* and *Grades*.\n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import pandas as pd\n",
+      "\n",
+      "Students = pd.DataFrame({'student_id': [1, 2], 'name': ['Alice', 'Bob']})\n",
+      "Students"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "Grades = pd.DataFrame({'student_id': [1, 1, 2, 2], 'class_id': [1, 2, 1, 3], 'grade': ['A', 'C', 'B', 'B']})\n",
+      "Grades"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Let's say we want to know all of Bob's grades.\n",
+      "Then, we can look up Bob's student ID in the Students table, and with the ID, look up his grades in the Grades table.\n",
+      "Joins naturally express this process: when two tables share a common type of column (student ID in this case), we can join the tables together to get a complete view.\n",
+      "\n",
+      "In Pandas, we can use the **merge** method to perform a join.\n",
+      "Pass the two tables to join as the first arguments, then the \"on\" parameter is set to the join column name."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "pd.merge(Students, Grades, on='student_id')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### DIY\n",
+      "\n",
+      "1. Use **merge** to join Grades with the Classes table below, and find out what class Alice got an A in."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "Classes = pd.DataFrame({'class_id': [1, 2, 3], 'title': ['Math', 'English', 'Spanish']})"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Joining the Restaurant Data\n",
+      "\n",
+      "Now let's load the restaurant data that we will be analyzing:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "resto = pd.read_csv('restaurants.csv')\n",
+      "resto.info()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "resto[:10]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The restaurant data has four columns.\n",
+      "**id** is a unique ID field (unique for each row), **name** is the name of the restaurant, and **city** is where it is located.\n",
+      "The fourth column, **cluster**, is a \"gold standard\" column.\n",
+      "If two records have the same **cluster**, that means they are both about the same restaurant.\n",
+      "\n",
+      "The type of join we made above between Students and Grades, where we link records with equal values in a common column, is called an *equijoin*.\n",
+      "Equijoins may join on more than one column, too (both value have to match).\n",
+      "\n",
+      "Let's use an equijoin to find pairs of duplicate restaurant records.\n",
+      "We join the data to itself, on the **cluster** column.\n",
+      "\n",
+      "> Note: a join between a table and itself is called a *self-join*.\n",
+      "\n",
+      "The result (\"clusters\" below) has a lot of extra records in it.\n",
+      "For example, since we're joining a table to itself, every record matches itself.\n",
+      "We can filter on IDs to get rid of these extra join results.\n",
+      "Note that when Pandas joins two tables that have columns with the same name, it appends \"_x\" and \"_y\" to the names to distinguish them."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "clusters = pd.merge(resto, resto, on='cluster')\n",
+      "clusters = clusters[clusters.id_x != clusters.id_y]\n",
+      "clusters[:10]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### DIY\n",
+      "\n",
+      "1. There are still extra records in *clusters*, above.  If records *A* and *B* match each other, then we will get both (*A*, *B*) and (*B*, *A*) in the output.\n",
+      "Filter *clusters* so that we only keep one instance of each matching pair (HINT: use the IDs again).\n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Fuzzy Joins\n",
+      "\n",
+      "Sometimes an equijoin isn't good enough.\n",
+      "\n",
+      "Say you want to match up records that are *almost* equal in a column.\n",
+      "Or where a *function* of a columns is equal.\n",
+      "Or maybe you don't care about equality: maybe \"less than\" or \"greater than or equal to\" is what you want.\n",
+      "These cases call for a more general join than equijoin.\n",
+      "\n",
+      "We are going to make one of these joins between the restaurants data and itself.\n",
+      "Specifically, we want to match up pairs of records whose restaurant names are *almost* the same.\n",
+      "We call this a **fuzzy join**.\n",
+      "\n",
+      "To do a fuzzy join in Pandas we need to go about it in a few steps:\n",
+      "\n",
+      "1.  Join every record in the first table with every record in the second table. This is called the **Cartesian product** of the tables, and it's simply a list of all possible pairs of records.\n",
+      "2.  Add a column to the Cartesian product that measures how \"similar\" each pair of records is. This is our **join criterion**.\n",
+      "3.  Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n",
+      "\n",
+      "> SQL Aside: In SQL, all of joins are supported in about the same way as equijoins are.\n",
+      "> Essentially, you write a boolean expression on columns from the join-tables, and whenever that expression is true, you join the records together.\n",
+      "> This is very similar to writing an **if** statement in python or Java.\n",
+      "\n",
+      "Let's do an example to get the hang of it."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "#### 1. Join every record in the first table with every record in the second table.\n",
+      "\n",
+      "We use a \"dummy\" column to compute the Cartesian product of the data with itself.\n",
+      "**dummy** takes the same value for every record, so we can do an equijoin and get back all pairs."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "resto['dummy'] = 0\n",
+      "prod = pd.merge(resto, resto, on='dummy')\n",
+      "\n",
+      "# Clean up\n",
+      "del prod['dummy']\n",
+      "del resto['dummy']\n",
+      "\n",
+      "# Show that prod is the size of \"resto\" squared:\n",
+      "print len(prod), len(resto)**2"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "prod[:10]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### DIY\n",
+      "\n",
+      "* Like we did with *clusters* remove \"extra\" record pairs, e.g., ones with the same ID."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "#### 2. Add a column to the Cartesian product that measures how \"similar\" each pair of records is.\n",
+      "\n",
+      "In the homework assignment, we used a string similarity metric called *cosine similarity* which measured how many \"tokens\" two strings shared in common.\n",
+      "Now, we're going to use an alternative measure of string similarity  called *edit-distance*.\n",
+      "[Edit-distance](http://en.wikipedia.org/wiki/Edit_distance) counts the number of simple changes you have to make to a string to turn it into another string.\n",
+      "\n",
+      "Import the edit distance library:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import Levenshtein as L\n",
+      "\n",
+      "L.distance('Hello, World!', 'Hallo, World!')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Next, we add a computed column, named **distance**, that measures the edit distance between the names of two restaurants:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# This takes a minute or two to run\n",
+      "prod['distance'] = prod.apply(lambda r: L.distance(r['name_x'], r['name_y']), axis=1)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "#### 3.  Filter the Cartesian product based on when the join criterion is \"similar enough.\"\n",
+      "\n",
+      "Now we complete the join by filtering out pairs or records that aren't equal enough for our liking.\n",
+      "As in the first homework assignment, we can only figure out how similar is \"similar enough\" by trying out some different options.\n",
+      "Let's try maximum edit-distance from 0 to 10 and compute precision and recall."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%matplotlib inline\n",
+      "import pylab\n",
+      "\n",
+      "def accuracy(max_distance):\n",
+      "    similar = prod[prod.distance < max_distance]\n",
+      "    correct = float(sum(similar.cluster_x == similar.cluster_y))\n",
+      "    precision = correct / len(similar)\n",
+      "    recall = correct / len(clusters)\n",
+      "    return (precision, recall)\n",
+      "\n",
+      "thresholds = range(1, 11)\n",
+      "p = []\n",
+      "r = []\n",
+      "\n",
+      "for t in thresholds:\n",
+      "    acc = accuracy(t)\n",
+      "    p.append(acc[0])\n",
+      "    r.append(acc[1])\n",
+      "\n",
+      "pylab.plot(thresholds, p)\n",
+      "pylab.plot(thresholds, r)\n",
+      "pylab.legend(['precision', 'recall'], loc=2)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### DIY\n",
+      "\n",
+      "1. Another common way to visualize the tradeoff between precision and recall is to plot them directly against each other.\n",
+      "Create a scatterplot with precision on one axis and recall on the other.\n",
+      "Where are \"good\" points on the plot, and where are \"bad\" ones.\n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "2. The python Levenshtein library provides another metric of string similarity called \"ratio\" (use `L.ratio(s1, s1)`).\n",
+      "`ratio` gives a similarity score between 0 and 1, with higher meaning more similar.\n",
+      "Add a column to \"prod\" with the `ratio` similarities of the **name** columns, and redo the precision/recall tradeoff analysis with the new metric.\n",
+      "(Note: you will have to alter the `accuracy` method and the threshold range.)\n",
+      "On this data, does `Levenshtein.ratio` do better than `Levenshtein.distance`?"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Finally, remember to fill out the response form at http://goo.gl/ZgfzAN !"
+     ]
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file