I blogged about data mining US Earmarks here, here, here and here. I started wondering, is there a relationship between Senators and and the sections of the bills they voted for.
Building on the initial PowerShell code, you can download it at the bottom, I used a simple similarity scoring and produced the graph below. This is a circular layout of the data, the placement of the target nodes are in the order they are passed to the AddEdge method, no weighting is done based on the calculated similarity.
I compared the votes of each the following Senators to the rest of the Senators looking for similarities. Clinton, Stevens, Kyl, Obama and Schumer. Choosing the Top 10 most similar and graphed the connections.
Raw Data
Source Target Tanimoto
Clinton Levin 0.7
Clinton Isakson 0.7
Clinton Pryor 0.7
Clinton Leahy 0.71
Clinton Chambliss 0.71
Clinton Nelson 0.72
Clinton Reid 0.73
Clinton Vitter 0.75
Clinton Stabenow 0.81
Clinton Schumer 0.96
Stevens Boxer 0.67
Stevens Akaka 0.67
Stevens Menendez 0.67
Stevens Coleman 0.67
Stevens Klobuchar 0.68
Stevens Leahy 0.68
Stevens Chambliss 0.68
Stevens Schumer 0.69
Stevens Inouye 0.76
Stevens Reid 0.81
Kyl Dorgan 0.41
Kyl Klobuchar 0.41
Kyl Bingaman 0.42
Kyl Wyden 0.43
Kyl Smith 0.45
Kyl Coleman 0.45
Kyl Murkowski 0.47
Kyl Ensign 0.47
Kyl Roberts 0.5
Kyl Thune 0.53
Obama Wyden 0.68
Obama Reed 0.7
Obama Brown 0.7
Obama Lugar 0.71
Obama Snowe 0.71
Obama Collins 0.71
Obama Roberts 0.72
Obama Bayh 0.75
Obama Martinez 0.75
Obama Whitehouse 0.75
Schumer Shelby 0.71
Schumer Vitter 0.72
Schumer Durbin 0.73
Schumer Levin 0.73
Schumer Leahy 0.74
Schumer Chambliss 0.75
Schumer Reid 0.76
Schumer Nelson 0.76
Schumer Stabenow 0.79
Schumer Clinton 0.96
Six Degrees of Separation
The network graph is based on the Tanimoto Coefficient.
Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the angle between them, often used to compare documents in text mining.
My interpretation, take two lists, find the intersection. Add the count of the first list to the second, subtract the count of the intersection. Take this number and divide it into the intersection count.
The earmark data lists the bill section each Senator voted for, therefore, a Tanimoto coefficient can be calculated for say, what Clinton and Schumer voted on.
PowerShell Code
Lines 1 sources/loads the code containing several functions. Line 2 transforms the data from nested hash tables to a hash and array of strings the key being the Senators last name, this is used to calculate the coefficient.
1: . .Do-Analysis.ps1
2: $set = Do-Transform
3: list Clinton Stevens Kyl Obama Schumer |
4: % { Do-Compare $set $_ | select -last 10 } |
5: Show-NetMap C
Next Steps
This is a spike test to see if it makes sense to continue. Using the PowerShell command line enables quick data analysis. Running the above code without the Show-Map displays the dataset including the similarity rating.
Drill down from the graph into the actual data is next. This should be straight forward hooking up the double click events of the NetMap control to PowerShell code.
Also of interest is the Tanimoto Coefficient, included in the Do-Analysis.ps1, it can be used on any list of strings in any application. Here is a version I posted using C#.
Downloads