The following is an excerpt from a term paper I am working on.

Every ten years, each of the 50 US states redraws the boundaries for congressional districts. The lines are redrawn so all of the districts contain about an equal number of people. Each congressional district votes on a person to go to the US House of Representatives. As one can imagine, the political districts control a bit of power, which is shifted around every time the districts are redrawn. It is commonly believed that whichever party is in control of the redistricting process could misuse this ability to redraw districts, and rig elections in their favor. This practice is referred to as “gerrymandering.” We can see how this might work in this illustration:

Gerrymandering in Action

I want to see if we can detect gerrymandering when looking at the results of many US elections. The Federal Election Commision (FEC) has a website that provides historic election results for congressional districts. Unfortunately, The spreadsheets they provide only go back as far as the year 2000, everything provided before that point is a PDF. This gives us eight different elections to look at.

I went through, and scraped as much data as possible from these spreadsheets. They had inconsistent formatting, and I was able to get 97% of the elections into a JSON archive easily.

The most interesting thing I found was the trends that present themselves when looking at gerrymandering on a graph. Below we see the percentage of Republican votes over total votes for state of North Carolina, over time. North Carolina had it’s districts redrawn after a widespread reporting of gerrymandering in the 2010 election cycle. We see that the districts cluster around a 60% concentration of Republicans. This is what is most advantageous for the Republicans: maintaining a majority in most states, and losing big in three.

Gerrymandering in Action

What I learned

I have tried to make useful things for the open source community before, but always seem to fall short. I will be releasing the repository on github soon, but am not sure how useful it will be to others. When scraping the excel files, about 5% of the data was was not scraped properly. I did not need it for my analysis. It seems the issue was in the order that the political races were listed. If the first race was not the winning race, than my scraper did not return the proper output.

When reading in the votes for each race, There seemed to be four cases. For example, with the number 56789, it might appear:

  • As a string (eg: “56,789”)
  • As a list (eg: “[56,789]”)
  • As a floating point (eg: 56789.0)
  • As a string saying, “Unopposed”

For my sake, I saved unopposed elections with a flag of 0. I gave missing elections a flag of -1. I see this was a bad decision for others to use my data. I should have saved the flags as strings, leaving unopposed elections alone. This would have been more semantic. This would have also prevented people from doing some bad math, treating the -1’s and 0’s as literal values.