Investigation of voting patterns demographic correlations in Sweden
Sweden have a long tradition of keeping statistics of different KPI:s by municipality through many institutions. Let’s see what KPI:s correlate with different voting patterns. And while we’re at it, demonstrate some Pandas method chaining – and do a Bayesian analysis of the Pearson product-moment correlation coefficient often denoted as Persons r
.
Note that the conclusions might mainly be of intrest of Swedes, and there will probably slip in some Swedish. Quick wordlist: kommun = municipality, brott = crime, gravid = pregnant
.
TLDR; Sweden Democrats (SD) shows high correlation with mothers smoking while pregnant, unemployment and motorcycles. The results are a the bottom of the post if you’re not interested in the process.
And as usual, the Jupyter Notebook (former Ipython notebook) is here in a repo together with the data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import rcParams
rcParams['font.family'] = 'monospace'
rcParams['font.sans-serif'] = ['Courier']
Starting with the data from the Swedish election 2014, we can read it directly from an url with the very nice Pandas method read_html
.
Encapsulating the DataFrame votes
with parentheseses makes method chaining possible. Every method in fact returns a DataFrame which makes every preprocessing step just another step in the chain.
Percent votes in Swedish election 2014
votes = (pd.read_html('http://www.val.se/val/val2014/slutresultat/K/rike/index.html')[1]
.rename(columns={u'Område': 'Kommun'})
[['Kommun', 'SD', 'M', 'C', 'FP', 'KD', 'S', 'V', 'MP', 'FI', u'ÖVR', 'BLANK']]
.set_index('Kommun')
.apply(lambda x: x.str.replace(",", ".")
.str.replace("%", "").astype(float))
)
partier = votes.columns
votes.head(3)
SD | M | C | FP | KD | S | V | MP | FI | ÖVR | BLANK | |
---|---|---|---|---|---|---|---|---|---|---|---|
Kommun | |||||||||||
Ale | 10.64 | 21.08 | 4.95 | 3.28 | 3.50 | 33.59 | 6.25 | 5.70 | 0.18 | 10.83 | 1.39 |
Alingsås | 7.48 | 21.58 | 6.17 | 10.30 | 6.44 | 29.19 | 6.85 | 10.81 | 0.30 | 0.89 | 1.21 |
Alvesta | 14.44 | 19.70 | 14.89 | 1.74 | 3.12 | 28.77 | 4.08 | 3.49 | 0.20 | 9.56 | 1.98 |
Now let’s read some migration patterns straight of the web. I.e. what municipalitys that had a positive net growth in terms of people living there.
Net inhabitant growth
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17248'
col = u'Befolkningsökning, 2012-2015, procent, treårsbasis'
befolkningsokning = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
.set_index('Kommun')
)
befolkningsokning.head(3)
Befolkningsökning, 2012-2015, procent, treårsbasis | |
---|---|
Kommun | |
Ale | 3.7 |
Alingsås | 3.3 |
Alvesta | 2.9 |
Rating of how good a municipality is for companies
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17257'
col = u'Företagsklimat, ranking 1-290, 2015, ordningstal'
foretagsklimat = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].astype(int)})
.set_index('Kommun')
)
foretagsklimat.head(3)
Företagsklimat, ranking 1-290, 2015, ordningstal | |
---|---|
Kommun | |
Ale | 43 |
Alingsås | 185 |
Alvesta | 221 |
Percent people with a higher education
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17251'
col = u'Andel högskoleutbildade, 2015, procent'
hogskoleutbildning = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
.set_index('Kommun')
)
hogskoleutbildning.head(3)
Andel högskoleutbildade, 2015, procent | |
---|---|
Kommun | |
Ale | 18.9 |
Alingsås | 24.3 |
Alvesta | 15.0 |
Percent early retirees
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17256'
col = u'Andel förtidspensionärer, 2015, procent'
fortidspensionarer = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
.set_index('Kommun')
)
fortidspensionarer.head(3)
Andel förtidspensionärer, 2015, procent | |
---|---|
Kommun | |
Ale | 5.5 |
Alingsås | 6.5 |
Alvesta | 6.0 |
Median income
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17249'
col = u'Medianinkomst, 2014, kronor'
medianinkomst = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u"\xa0", u"").astype(int)})
.set_index('Kommun')
)
medianinkomst.head(3)
Medianinkomst, 2014, kronor | |
---|---|
Kommun | |
Ale | 273684 |
Alingsås | 259211 |
Alvesta | 241077 |
Unemployment rate
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17255'
col = u'Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent'
arbetsloshet = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
.set_index('Kommun')
)
arbetsloshet.head(3)
Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent | |
---|---|
Kommun | |
Ale | 4.9 |
Alingsås | 5.5 |
Alvesta | 9.3 |
Number of crimes per 100k inhabitants
brott = (pd.read_csv("brott_per_1000.txt", sep=";", encoding="latin-1")
.rename(columns={"Region": "Kommun", "/100 000 inv": "Brott per 100k inv"})
.groupby("Kommun")
.sum()
.reset_index()
[['Kommun', 'Brott per 100k inv']]
.assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
.set_index('Kommun'))
brott.head(3)
Brott per 100k inv | |
---|---|
Kommun | |
Ale | 8231 |
Alingsås | 10511 |
Alvesta | 8300 |
Violent crimes
valdsbrott = (pd.read_csv("vbrott_per_1000.txt", sep=";", encoding="latin-1")
.rename(columns={"Region": "Kommun", "/100 000 inv": u"Våldsbrott per 100k inv"})
.groupby("Kommun")
.sum()
.reset_index()
[['Kommun', u'Våldsbrott per 100k inv']]
.assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
.set_index('Kommun'))
valdsbrott.head(3)
Våldsbrott per 100k inv | |
---|---|
Kommun | |
Ale | 823 |
Alingsås | 870 |
Alvesta | 967 |
Drug crimes
knarkbrott = (pd.read_csv("knarkbrott_per_1000.txt", sep=";", encoding="latin-1")
.rename(columns={"Region": "Kommun", "/100 000 inv": u"Narkotikabrott per 100k inv"})
.groupby("Kommun")
.sum()
.reset_index()
[['Kommun', u'Narkotikabrott per 100k inv']]
.assign(Kommun= lambda x: x['Kommun'].str.replace(" kommun", ""))
.set_index('Kommun'))
knarkbrott.head(3)
Narkotikabrott per 100k inv | |
---|---|
Kommun | |
Ale | 402 |
Alingsås | 1487 |
Alvesta | 327 |
Mean age
url = 'http://www.ekonomifakta.se/Fakta/Regional-statistik/Din-kommun-i-siffror/Nyckeltal-for-regioner/?var=17247'
col = u'Medelålder, 2015, år'
medianalder = (
pd.read_html(url, thousands=' ')[0]
.assign(**{col: lambda x: x[col].str.replace(u",", u".").astype(float)})
.set_index('Kommun')
)
medianalder.head(3)
Medelålder, 2015, år | |
---|---|
Kommun | |
Ale | 39.9 |
Alingsås | 42.0 |
Alvesta | 42.1 |
Percent born in another country than Sweden
utrikesfodda = (pd.read_excel("BE0101-Utrikes-fodda-kom-fland.xlsx", header=5)
.dropna(subset=[u'Kommun'])
.assign(procent= lambda x: x[u'I procent'].astype(float))
.rename(columns={'procent': u'Procent utrikesfödda'})
[['Kommun', u'Procent utrikesfödda']]
.set_index('Kommun')
)
utrikesfodda.head(3)
Procent utrikesfödda | |
---|---|
Kommun | |
Botkyrka | 40.1 |
Danderyd | 15.3 |
Ekerö | 11.0 |
Percent mothers smoking while pregnant
rokande_grav = (
pd.read_excel("Lev03.xlsx", header=3)
.reset_index()
.rename(columns={'index': 'Kommun', '2014': u'Rökande gravida, procent'})
.assign(Kommun=lambda x: x['Kommun'].str.split())
.assign(Kommun=lambda x: x['Kommun'].str[1:])
.assign(Kommun=lambda x: x['Kommun'].str.join(" "))
.dropna()
.set_index('Kommun')
.replace("..", 0)
.assign(**{u'Rökande gravida, procent': lambda x: x[u'Rökande gravida, procent'].astype(float)})
)
rokande_grav.head(3)
Rökande gravida, procent | |
---|---|
Kommun | |
Upplands Väsby | 3.6 |
Vallentuna | 2.3 |
Österåker | 2.9 |
Rapes per 10k inhabitants
valdtakter = (
pd.read_excel("Sex03.xlsx", header=3)
.reset_index()
.rename(columns={'index': 'Kommun', '2015': u'Våldtäkter per 10k inv'})
.assign(Kommun=lambda x: x['Kommun'].str.split())
.assign(Kommun=lambda x: x['Kommun'].str[1:])
.assign(Kommun=lambda x: x['Kommun'].str.join(" "))
.dropna()
.set_index('Kommun')
.assign(**{u'Våldtäkter per 10k inv': lambda x: x[u'Våldtäkter per 10k inv'].astype(float)})
)
valdtakter.head(3)
Våldtäkter per 10k inv | |
---|---|
Kommun | |
Upplands Väsby | 5.0 |
Vallentuna | 1.9 |
Österåker | 2.2 |
Asylum seekers per 1k inhabitants
asylsokande = (pd.read_excel("asylsdochm.xlsx", header=1)
.dropna(subset=[u'Kommun'])
.rename(columns={u'Antal asylsökande per': u'Antal asylsökande per 1000 invånare'})
[['Kommun', u'Antal asylsökande per 1000 invånare']]
.set_index('Kommun')
.assign(**{u'Antal asylsökande per 1000 invånare': lambda x: x[u'Antal asylsökande per 1000 invånare'].astype(float)})
)
asylsokande.head(3)
Antal asylsökande per 1000 invånare | |
---|---|
Kommun | |
Ljusnarsberg | 138.26 |
Norberg | 113.94 |
Hultsfred | 78.33 |
Types of vehicles
fordon = (pd.read_excel("Fordon_lan_och_kommuner_2015.xlsx", header=5, sheetname=2)
.dropna(subset=[u'Kommun'])
.set_index('Kommun')
.drop(['Kommun-', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
.apply(lambda x: x.astype(float))
.apply(lambda x: x/sum(x), axis=1) # Convert to percentages
.dropna()
.rename(index=lambda x: x.strip())
.rename(columns=lambda x: x.strip())
.rename(columns=lambda x: x + ", 2015, procent")
)
fordon.head(3)
Personbilar, 2015, procent | Lastbilar, 2015, procent | Bussar, 2015, procent | Motorcyklar, 2015, procent | Mopeder, 2015, procent | Traktorer, 2015, procent | Snöskotrar, 2015, procent | Terränghjulingar, 2015, procent | Terrängskotrar1), 2015, procent | Släpvagnar, 2015, procent | |
---|---|---|---|---|---|---|---|---|---|---|
Kommun | ||||||||||
Upplands Väsby | 0.757944 | 0.069553 | 0.000088 | 0.042546 | 0.009148 | 0.007748 | 0.006128 | 0.005077 | 0.000175 | 0.101593 |
Vallentuna | 0.672324 | 0.066290 | 0.000000 | 0.051673 | 0.009153 | 0.029813 | 0.008131 | 0.021327 | 0.000489 | 0.140801 |
Österåker | 0.692083 | 0.069446 | 0.000252 | 0.047521 | 0.013716 | 0.015516 | 0.009396 | 0.010332 | 0.000252 | 0.141484 |
Since we now have every KPI as a DataFrame, let’s put them together with concat
.
concat = pd.concat([
votes,
utrikesfodda,
medianalder,
brott,
medianinkomst,
arbetsloshet,
fortidspensionarer,
hogskoleutbildning,
befolkningsokning,
foretagsklimat,
valdsbrott,
knarkbrott,
asylsokande,
fordon,
valdtakter,
rokande_grav,
],
axis=1,
join='inner')
concat.head(5)
SD | M | C | FP | KD | S | V | MP | FI | ÖVR | ... | Bussar, 2015, procent | Motorcyklar, 2015, procent | Mopeder, 2015, procent | Traktorer, 2015, procent | Snöskotrar, 2015, procent | Terränghjulingar, 2015, procent | Terrängskotrar1), 2015, procent | Släpvagnar, 2015, procent | Våldtäkter per 10k inv | Rökande gravida, procent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Kommun | |||||||||||||||||||||
Ale | 10.64 | 21.08 | 4.95 | 3.28 | 3.50 | 33.59 | 6.25 | 5.70 | 0.18 | 10.83 | ... | 0.000141 | 0.054257 | 0.016427 | 0.039285 | 0.002159 | 0.009058 | 0.000188 | 0.150005 | 2.1 | 4.4 |
Alingsås | 7.48 | 21.58 | 6.17 | 10.30 | 6.44 | 29.19 | 6.85 | 10.81 | 0.30 | 0.89 | ... | 0.000068 | 0.052143 | 0.009086 | 0.053940 | 0.001729 | 0.014883 | 0.000203 | 0.154902 | 4.8 | 3.8 |
Alvesta | 14.44 | 19.70 | 14.89 | 1.74 | 3.12 | 28.77 | 4.08 | 3.49 | 0.20 | 9.56 | ... | 0.002950 | 0.042747 | 0.014082 | 0.082378 | 0.000946 | 0.015418 | 0.000000 | 0.182122 | 2.0 | 4.8 |
Aneby | 9.38 | 11.19 | 20.46 | 4.06 | 14.82 | 31.30 | 2.19 | 5.75 | 0.05 | 0.80 | ... | 0.003179 | 0.042438 | 0.011197 | 0.119712 | 0.000968 | 0.017556 | 0.000000 | 0.199198 | 1.5 | 0.0 |
Arboga | 9.75 | 20.54 | 5.53 | 7.57 | 4.08 | 39.04 | 5.77 | 5.62 | 0.14 | 1.96 | ... | 0.001810 | 0.043954 | 0.010256 | 0.060502 | 0.006722 | 0.009997 | 0.000172 | 0.170818 | 2.2 | 13.2 |
5 rows × 35 columns
Pandas gives us the method corr
that by default uses Persons r
to show correlations between columns. Let’s start of by using this simple frequentistic approach.
corr_matrix = concat.corr()
corr = concat.corr()
# Keep only parties as columns
corr = corr.loc[:, corr.columns.isin(partier)]
# Keep only KPIs as rows
corr = corr[~corr.index.isin(partier)]
And use Seaborn:s heatmap
to give us a nice visual representation of the correlation table.
fig1 = plt.figure(figsize=(8, 8))
ax = sns.heatmap(
corr,
annot=True, fmt='.1f', linewidths=.5
)
ax.text(-1., 1.141,
u'Korrelation med andel riksdagsröster',
verticalalignment='bottom', horizontalalignment='left',
transform=ax.transAxes,
color='#000000', fontsize=14)
ax.text(-1., -0.181,
u'Källa: BRÅ/SCB/Val.se/Ekonomifakta.se – Korrelationen är på kommunnivå och Pearson-korr har använts',
verticalalignment='bottom', horizontalalignment='left',
transform=ax.transAxes,
color='#666666', fontsize=8)
plt.savefig('SD_korr.pdf', bbox_inches='tight')
First of, a Persons r
correlation ranges from 0 to 1. Zero meaning no correlation and one full correlation. Depending on textbook, somewhere over 0.4 will be a correlation to take seriously. But let’s get back to that in a moment.
Some interesting correlations are though:
- The right-wing populist party the Sweden Democrats (SD in the figure)
- Correlations with municipailtys having large porportions motorcycles,
- and women smoking while pregnant.
- Also high unemployment rate shows some correlation.
- Right wing Moderaterna (M)
- High percentage mopeds
- High percentage cars
- High education
- High income
- The Greens Miljöpartiet (MP)
- Pretty much the same as Moderaterna. Only not that high income.
- Centerpartiet (C) historically a party for the farmers
- High correlation with tractors
So let’s go into detail what a correlation of 0.5 actually is. Is it a fully linear correlation between smoking while pregnant and votage on xenophobic parties? Let’s plot the municipalitys with these KPI:s on the axis.
def corrplot_SD(kpi):
(concat.replace(0, np.nan).dropna()
.plot(kind='scatter',
x=kpi,
y=u'SD'))
corrplot_SD(u'Rökande gravida, procent')
If you ask me, that looks like a correlation. But that’s also what I want to see since the Sweden Democrats is pretty much as far away from my personal political opinions you can go. I’d pretty much take any chance to misscredit them. So let’s go further down the rabbit hole, but first plot the other KPIs.
corrplot_SD(u'Motorcyklar, 2015, procent')
corrplot_SD(u'Öppen arbetslöshet (Arbetsförmedlingen), 2015, procent')
We are now going to calculate the r
coefficient in a Bayesian fashion with Monte Carlo Markov Chains (MCMC). That will give us a distribution over r
instead of a single number which allows us to give a upper and lower bound of the correlation. This closely follows the method described by Philipp Singer here.
You should read his blog post if you want more details. And in more general, Monte Carlo simulations is a really appealing way to solve problems. Whenever I get the chance, I recommend Bayesian methods for hackers. The topics covered there should really be in any data scientist tool belt.
import pymc as pymc
from pymc import Normal, Uniform, MvNormal, Exponential
from numpy.linalg import inv, det
from numpy import log, pi, dot
import numpy as np
from scipy.special import gammaln
def _model(data):
mu = Normal('mu', 0, 0.000001, size=2)
sigma = Uniform('sigma', 0, 1000, size=2)
rho = Uniform('r', -1, 1)
@pymc.deterministic
def precision(sigma=sigma, rho=rho):
ss1 = float(sigma[0] * sigma[0])
ss2 = float(sigma[1] * sigma[1])
rss = float(rho * sigma[0] * sigma[1])
return inv(np.mat([[ss1, rss],
[rss, ss2]]))
mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True)
return locals()
def analyze(kpi):
freq_corr = corr.loc[kpi, 'SD']
print "Standard Persons r correlations gave {}".format(freq_corr)
data = np.array([concat['SD'].values, concat[kpi].values])
model = pymc.MCMC(_model(data))
model.sample(10000, 5000)
print "\nMCMC gives an lower and upper bound for this of {} and {}".format(
*model.stats()['r']['95% HPD interval']
)
analyze(u'Rökande gravida, procent')
Standard Persons r correlations gave 0.483367702644
[-----------------100%-----------------] 10000 of 10000 complete in 25.1 sec
MCMC gives an lower and upper bound for this of 0.384673816867 and 0.569554905111
This tells us that the correlation is statistically sound. No p
-values, just a interval that the true r
should be in with 95 % probability. Even if it’s just the lower bound 0.38 it’s a correlation worth mentioning. But of course, it does not tell us if correlation imply causation.