Sweden have a long tradition of keeping statistics of different KPI:s by municipality through many institutions. Let’s see what KPI:s correlate with different voting patterns. And while we’re at it, demonstrate some Pandas method chaining – and do a Bayesian analysis of the Pearson product-moment correlation coefficient often denoted as Persons r.

Note that the conclusions might mainly be of intrest of Swedes, and there will probably slip in some Swedish. Quick wordlist: kommun = municipality, brott = crime, gravid = pregnant.

TLDR; Sweden Democrats (SD) shows high correlation with mothers smoking while pregnant, unemployment and motorcycles. The results are a the bottom of the post if you’re not interested in the process.

And as usual, the Jupyter Notebook (former Ipython notebook) is here in a repo together with the data.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

from matplotlib import rcParams
rcParams['font.family'] = 'monospace'
rcParams['font.sans-serif'] = ['Courier']

Starting with the data from the Swedish election 2014, we can read it directly from an url with the very nice Pandas method read_html.

Encapsulating the DataFrame votes with parentheseses makes method chaining possible. Every method in fact returns a DataFrame which makes every preprocessing step just another step in the chain.

Percent votes in Swedish election 2014

votes = (pd.read_html('http://www.val.se/val/val2014/slutresultat/K/rike/index.html')[1]
           .rename(columns={u'Område': 'Kommun'})
           [['Kommun', 'SD', 'M', 'C', 'FP', 'KD', 'S', 'V', 'MP', 'FI', u'ÖVR', 'BLANK']]
           .set_index('Kommun')
           .apply(lambda x: x.str.replace(",", ".")
                             .str.replace("%", "").astype(float))
)
partier = votes.columns
votes.head(3)
SD M C FP KD S V MP FI ÖVR BLANK
Kommun
Ale 10.64 21.08 4.95 3.28 3.50 33.59 6.25 5.70 0.18 10.83 1.39
Alingsås 7.48 21.58 6.17 10.30 6.44 29.19 6.85 10.81 0.30 0.89 1.21
Alvesta 14.44 19.70 14.89 1.74 3.12 28.77 4.08 3.49 0.20 9.56 1.98

Now let’s read some migration patterns straight of the web. I.e. what municipalitys that had a positive net growth in terms of people living there.

Net inhabitant growth

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17248'
col = u'Befolkningsökning, 2012-2015, procent, treårsbasis'
befolkningsokning = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
      .set_index('Kommun')
)
befolkningsokning.head(3)
Befolkningsökning, 2012-2015, procent, treårsbasis
Kommun
Ale 3.7
Alingsås 3.3
Alvesta 2.9

Rating of how good a municipality is for companies

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17257'
col = u'Företagsklimat, ranking 1-290, 2015, ordningstal'
foretagsklimat = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].astype(int)})
      .set_index('Kommun')
)
foretagsklimat.head(3)
Företagsklimat, ranking 1-290, 2015, ordningstal
Kommun
Ale 43
Alingsås 185
Alvesta 221

Percent people with a higher education

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17251'
col = u'Andel högskoleutbildade, 2015, procent'
hogskoleutbildning = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
      .set_index('Kommun')
)
hogskoleutbildning.head(3)
Andel högskoleutbildade, 2015, procent
Kommun
Ale 18.9
Alingsås 24.3
Alvesta 15.0

Percent early retirees

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17256'
col = u'Andel förtidspensionärer, 2015, procent'
fortidspensionarer = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
      .set_index('Kommun')
)
fortidspensionarer.head(3)
Andel förtidspensionärer, 2015, procent
Kommun
Ale 5.5
Alingsås 6.5
Alvesta 6.0

Median income

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17249'
col = u'Medianinkomst, 2014, kronor'
medianinkomst = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u"\xa0", u"").astype(int)})
      .set_index('Kommun')
)
medianinkomst.head(3)
Medianinkomst, 2014, kronor
Kommun
Ale 273684
Alingsås 259211
Alvesta 241077

Unemployment rate

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17255'
col = u'Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent'
arbetsloshet = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
      .set_index('Kommun')
)
arbetsloshet.head(3)
Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent
Kommun
Ale 4.9
Alingsås 5.5
Alvesta 9.3

Number of crimes per 100k inhabitants

brott = (pd.read_csv("brott_per_1000.txt", sep=";", encoding="latin-1")
        .rename(columns={"Region": "Kommun", "/100 000 inv": "Brott per 100k inv"})
        .groupby("Kommun")
        .sum()
        .reset_index()
        [['Kommun', 'Brott per 100k inv']]
        .assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
        .set_index('Kommun'))
brott.head(3)
Brott per 100k inv
Kommun
Ale 8231
Alingsås 10511
Alvesta 8300

Violent crimes

valdsbrott = (pd.read_csv("vbrott_per_1000.txt", sep=";", encoding="latin-1")
        .rename(columns={"Region": "Kommun", "/100 000 inv": u"Våldsbrott per 100k inv"})
        .groupby("Kommun")
        .sum()
        .reset_index()
        [['Kommun', u'Våldsbrott per 100k inv']]
        .assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
        .set_index('Kommun'))
valdsbrott.head(3)
Våldsbrott per 100k inv
Kommun
Ale 823
Alingsås 870
Alvesta 967

Drug crimes

knarkbrott = (pd.read_csv("knarkbrott_per_1000.txt", sep=";", encoding="latin-1")
        .rename(columns={"Region": "Kommun", "/100 000 inv": u"Narkotikabrott per 100k inv"})
        .groupby("Kommun")
        .sum()
        .reset_index()
        [['Kommun', u'Narkotikabrott per 100k inv']]
        .assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
        .set_index('Kommun'))
knarkbrott.head(3)
Narkotikabrott per 100k inv
Kommun
Ale 402
Alingsås 1487
Alvesta 327

Mean age

url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17247'
col = u'Medelålder, 2015, år'
medianalder = (
    pd.read_html(url, thousands=' ')[0]
      .assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
      .set_index('Kommun')
)
medianalder.head(3)
Medelålder, 2015, år
Kommun
Ale 39.9
Alingsås 42.0
Alvesta 42.1

Percent born in another country than Sweden

utrikesfodda = (pd.read_excel("BE0101-Utrikes-fodda-kom-fland.xlsx", header=5)
        .dropna(subset=[u'Kommun'])
        .assign(procent= lambda x: x[u'I procent'].astype(float))
        .rename(columns={'procent': u'Procent utrikesfödda'})
        [['Kommun', u'Procent utrikesfödda']]
        .set_index('Kommun')
)
utrikesfodda.head(3)
Procent utrikesfödda
Kommun
Botkyrka 40.1
Danderyd 15.3
Ekerö 11.0

Percent mothers smoking while pregnant

rokande_grav = (
    pd.read_excel("Lev03.xlsx", header=3)
      .reset_index()
      .rename(columns={'index': 'Kommun', '2014': u'Rökande gravida, procent'})
      .assign(Kommun=lambda x: x['Kommun'].str.split())
      .assign(Kommun=lambda x: x['Kommun'].str[1:])
      .assign(Kommun=lambda x: x['Kommun'].str.join(" "))
      .dropna()
      .set_index('Kommun')
      .replace("..", 0)
      .assign(**{u'Rökande gravida, procent': lambda x: x[u'Rökande gravida, procent'].astype(float)})
)
rokande_grav.head(3)
Rökande gravida, procent
Kommun
Upplands Väsby 3.6
Vallentuna 2.3
Österåker 2.9

Rapes per 10k inhabitants

valdtakter = (
    pd.read_excel("Sex03.xlsx", header=3)
      .reset_index()
      .rename(columns={'index': 'Kommun', '2015': u'Våldtäkter per 10k inv'})
      .assign(Kommun=lambda x: x['Kommun'].str.split())
      .assign(Kommun=lambda x: x['Kommun'].str[1:])
      .assign(Kommun=lambda x: x['Kommun'].str.join(" "))
      .dropna()
      .set_index('Kommun')
      .assign(**{u'Våldtäkter per 10k inv': lambda x: x[u'Våldtäkter per 10k inv'].astype(float)})
)
valdtakter.head(3)
Våldtäkter per 10k inv
Kommun
Upplands Väsby 5.0
Vallentuna 1.9
Österåker 2.2

Asylum seekers per 1k inhabitants

asylsokande = (pd.read_excel("asylsdochm.xlsx", header=1)
        .dropna(subset=[u'Kommun'])
        .rename(columns={u'Antal asylsökande per': u'Antal asylsökande per 1000 invånare'})
        [['Kommun', u'Antal asylsökande per 1000 invånare']]
        .set_index('Kommun')
        .assign(**{u'Antal asylsökande per 1000 invånare': lambda x: x[u'Antal asylsökande per 1000 invånare'].astype(float)})

)
asylsokande.head(3)
Antal asylsökande per 1000 invånare
Kommun
Ljusnarsberg 138.26
Norberg 113.94
Hultsfred 78.33

Types of vehicles

fordon = (pd.read_excel("Fordon_lan_och_kommuner_2015.xlsx", header=5, sheetname=2)
        .dropna(subset=[u'Kommun'])
        .set_index('Kommun')
        .drop(['Kommun-', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
        .apply(lambda x: x.astype(float))
        .apply(lambda x: x/sum(x), axis=1) # Convert to percentages
        .dropna()
        .rename(index=lambda x: x.strip())
        .rename(columns=lambda x: x.strip())
        .rename(columns=lambda x: x + ", 2015, procent")

)
fordon.head(3)
Personbilar, 2015, procent Lastbilar, 2015, procent Bussar, 2015, procent Motorcyklar, 2015, procent Mopeder, 2015, procent Traktorer, 2015, procent Snöskotrar, 2015, procent Terränghjulingar, 2015, procent Terrängskotrar1), 2015, procent Släpvagnar, 2015, procent
Kommun
Upplands Väsby 0.757944 0.069553 0.000088 0.042546 0.009148 0.007748 0.006128 0.005077 0.000175 0.101593
Vallentuna 0.672324 0.066290 0.000000 0.051673 0.009153 0.029813 0.008131 0.021327 0.000489 0.140801
Österåker 0.692083 0.069446 0.000252 0.047521 0.013716 0.015516 0.009396 0.010332 0.000252 0.141484

Since we now have every KPI as a DataFrame, let’s put them together with concat.

concat = pd.concat([
        votes,
        utrikesfodda,
        medianalder, 
        brott, 
        medianinkomst, 
        arbetsloshet, 
        fortidspensionarer, 
        hogskoleutbildning, 
        befolkningsokning,
        foretagsklimat, 
        valdsbrott,
        knarkbrott,
        asylsokande,
        fordon,
        valdtakter,
        rokande_grav,
    ], 
    axis=1,
    join='inner')
concat.head(5)
SD M C FP KD S V MP FI ÖVR ... Bussar, 2015, procent Motorcyklar, 2015, procent Mopeder, 2015, procent Traktorer, 2015, procent Snöskotrar, 2015, procent Terränghjulingar, 2015, procent Terrängskotrar1), 2015, procent Släpvagnar, 2015, procent Våldtäkter per 10k inv Rökande gravida, procent
Kommun
Ale 10.64 21.08 4.95 3.28 3.50 33.59 6.25 5.70 0.18 10.83 ... 0.000141 0.054257 0.016427 0.039285 0.002159 0.009058 0.000188 0.150005 2.1 4.4
Alingsås 7.48 21.58 6.17 10.30 6.44 29.19 6.85 10.81 0.30 0.89 ... 0.000068 0.052143 0.009086 0.053940 0.001729 0.014883 0.000203 0.154902 4.8 3.8
Alvesta 14.44 19.70 14.89 1.74 3.12 28.77 4.08 3.49 0.20 9.56 ... 0.002950 0.042747 0.014082 0.082378 0.000946 0.015418 0.000000 0.182122 2.0 4.8
Aneby 9.38 11.19 20.46 4.06 14.82 31.30 2.19 5.75 0.05 0.80 ... 0.003179 0.042438 0.011197 0.119712 0.000968 0.017556 0.000000 0.199198 1.5 0.0
Arboga 9.75 20.54 5.53 7.57 4.08 39.04 5.77 5.62 0.14 1.96 ... 0.001810 0.043954 0.010256 0.060502 0.006722 0.009997 0.000172 0.170818 2.2 13.2

5 rows × 35 columns

Pandas gives us the method corr that by default uses Persons r to show correlations between columns. Let’s start of by using this simple frequentistic approach.

corr_matrix = concat.corr()
corr = concat.corr()

# Keep only parties as columns
corr = corr.loc[:, corr.columns.isin(partier)] 
# Keep only KPIs as rows
corr = corr[~corr.index.isin(partier)]

And use Seaborn:s heatmap to give us a nice visual representation of the correlation table.

fig1 = plt.figure(figsize=(8, 8))
ax = sns.heatmap(
    corr, 
    annot=True, fmt='.1f', linewidths=.5
)

ax.text(-1., 1.141, 
        u'Korrelation med andel riksdagsröster',
        verticalalignment='bottom', horizontalalignment='left',
        transform=ax.transAxes,
        color='#000000', fontsize=14)

ax.text(-1., -0.181, 
        u'Källa: BRÅ/SCB/Val.se/Ekonomifakta.se – Korrelationen är på kommunnivå och Pearson-korr har använts',
        verticalalignment='bottom', horizontalalignment='left',
        transform=ax.transAxes,
        color='#666666', fontsize=8)

plt.savefig('SD_korr.pdf', bbox_inches='tight')

First of, a Persons r correlation ranges from 0 to 1. Zero meaning no correlation and one full correlation. Depending on textbook, somewhere over 0.4 will be a correlation to take seriously. But let’s get back to that in a moment.

Some interesting correlations are though:

  • The right-wing populist party the Sweden Democrats (SD in the figure)
    • Correlations with municipailtys having large porportions motorcycles,
    • and women smoking while pregnant.
    • Also high unemployment rate shows some correlation.
  • Right wing Moderaterna (M)
    • High percentage mopeds
    • High percentage cars
    • High education
    • High income
  • The Greens Miljöpartiet (MP)
    • Pretty much the same as Moderaterna. Only not that high income.
  • Centerpartiet (C) historically a party for the farmers
    • High correlation with tractors

So let’s go into detail what a correlation of 0.5 actually is. Is it a fully linear correlation between smoking while pregnant and votage on xenophobic parties? Let’s plot the municipalitys with these KPI:s on the axis.

def corrplot_SD(kpi):
    (concat.replace(0, np.nan).dropna()
           .plot(kind='scatter', 
                 x=kpi, 
                 y=u'SD'))

corrplot_SD(u'Rökande gravida, procent')

png

If you ask me, that looks like a correlation. But that’s also what I want to see since the Sweden Democrats is pretty much as far away from my personal political opinions you can go. I’d pretty much take any chance to misscredit them. So let’s go further down the rabbit hole, but first plot the other KPIs.

corrplot_SD(u'Motorcyklar, 2015, procent')

png

corrplot_SD(u'Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent')

png

We are now going to calculate the r coefficient in a Bayesian fashion with Monte Carlo Markov Chains (MCMC). That will give us a distribution over r instead of a single number which allows us to give a upper and lower bound of the correlation. This closely follows the method described by Philipp Singer here.

You should read his blog post if you want more details. And in more general, Monte Carlo simulations is a really appealing way to solve problems. Whenever I get the chance, I recommend Bayesian methods for hackers. The topics covered there should really be in any data scientist tool belt.

import pymc as pymc
from pymc import Normal, Uniform, MvNormal, Exponential
from numpy.linalg import inv, det
from numpy import log, pi, dot
import numpy as np
from scipy.special import gammaln

def _model(data):
    mu = Normal('mu', 0, 0.000001, size=2)
    sigma = Uniform('sigma', 0, 1000, size=2)
    rho = Uniform('r', -1, 1)
    
    @pymc.deterministic
    def precision(sigma=sigma, rho=rho):
        ss1 = float(sigma[0] * sigma[0])
        ss2 = float(sigma[1] * sigma[1])
        rss = float(rho * sigma[0] * sigma[1])
        return inv(np.mat([[ss1, rss], 
                           [rss, ss2]]))
    
    mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True)
    
    return locals()
    
def analyze(kpi):
    freq_corr = corr.loc[kpi, 'SD']
    print "Standard Persons r correlations gave {}".format(freq_corr)
    data = np.array([concat['SD'].values, concat[kpi].values])
    model = pymc.MCMC(_model(data)) 
    model.sample(10000, 5000) 
    
    print "\nMCMC gives an lower and upper bound for this of {} and {}".format(
        *model.stats()['r']['95% HPD interval']
    )
analyze(u'Rökande gravida, procent')
Standard Persons r correlations gave 0.483367702644
 [-----------------100%-----------------] 10000 of 10000 complete in 25.1 sec
MCMC gives an lower and upper bound for this of 0.384673816867 and 0.569554905111

This tells us that the correlation is statistically sound. No p-values, just a interval that the true r should be in with 95 % probability. Even if it’s just the lower bound 0.38 it’s a correlation worth mentioning. But of course, it does not tell us if correlation imply causation.