Many towns in Sweden share the same atomic parts in their names. -hult is common in Småland -arp is common in Skåne. I got curious of how they cluster geographically. Let’s load up a DataFrame with the positions of Swedish towns that I made by querying Google for the position of every Swedish town I could find on Wikipedia.

This notebook is also available as a Jupyter Notebook here if you want to execute the code yourself.

Scroll past the code if you’re only here for the maps.

import pandas as pd

df = pd.read_csv('swe_towns_latlon.csv', encoding='utf-8')
df['town'] = df['town'].str.lower()

And then do some crappy character n-gram extractions.

def find_features(s):
    feats = [s[i:i+7] for i in xrange(len(s)-6)] 
    feats += [s[i:i+6] for i in xrange(len(s)-5)] 
    feats += [s[i:i+5] for i in xrange(len(s)-4)] 
    feats += [s[i:i+4] for i in xrange(len(s)-3)] 
    feats += [s[i:i+3] for i in xrange(len(s)-2)] 
    return feats

features = []
for town in df['town'].values:
    features.extend(find_features(town))

To be able to count the occurances.

features = list(set(features))
final_features = []
occurances = []

for feature in features:
    try:
        occurances.append(sum(df['town'].str.contains(feature)))
        final_features.append(feature)
    except:
        pass
    
df_features = pd.DataFrame({
    'feature': final_features, 
    'occurances': occurances 
})

So now we get interesting parts of town names. Sanity check (at least if you’re Swedish) is that sta, hult and so on are present.

print ", ".join(df_features.sort('occurances', ascending=False).head(100)['feature'].values)

sta, tor, ing, erg, nge, orp,  oc,  och , och , och, ch ,  och, vik, torp, ra , ber, ors, berg, und, sjö, tra, and, str, näs, ter, inge, stra, tad, stad, den, for, fors, mar, ken, sto, olm, ste, äst, storp, stor, lla, lle, all, dal, hol, ngs, holm, äll, lst, tra , ång, ers, orr, stra , red, ran, nda, arp, ham, ill, est, ung, äck, ten, ster, gen, jör, rby, jär, nne, cke, ult, amm, mma, lin, sun, ryd, ack, sby, äng, eby, byn, len, lan, sund, ling, ker, nna, mmar, löv, lun, ike, bäck, bro, tan, nga, bäc, ård, rst, tte

I went ahead and took out the ones I deemed interesting. Now let’s plot them to se how they group geographically.

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Helvetica Neue']
%matplotlib inline

def plot(data, mult=0.9):
    """ Take data with 'part-of-towns-name' and color and plot it. Use `mult` for easy scaling of plot """
    
    def get_coordinates(townpart):
        try:
            townpart = townpart.decode('utf-8')
        except:
            pass
        return df[df['town'].str.contains(townpart)]['lat'].values, df[df['town'].str.contains(townpart)]['lon'].values

    fig = plt.figure(figsize=(20*mult, 16*mult), dpi=200)

    m = Basemap(
        projection='merc',
        resolution='i', 
        area_thresh=250,
        llcrnrlon=9.5, 
        llcrnrlat=54.5,
        urcrnrlon=24.5, 
        urcrnrlat=69.5
    ) 

    m.drawcoastlines(linewidth=0, color="#000000") 
    m.drawcountries()
    m.drawstates()
    m.drawmapboundary()
    m.fillcontinents(color='black', lake_color='white', zorder=0)
    m.drawmapboundary(fill_color='white')
    title = plt.title(u'Town names containing:', fontsize=16*mult) 
    title.set_y(1.01) 

    for townpart in data.keys():
        lats, lons = get_coordinates(townpart)
        x, y = m(lons, lats)
        m.scatter(x, y, marker='o', s=30*mult, alpha=1, label=townpart, edgecolors='none', c=data[townpart])

    plt.legend(loc=2, fontsize=16*mult)
    plt.show()


plot({
    'arp': '#00ff00',
    'holm': '#ff0000',
})

png

plot({
    'sta': '#ff00ff',
    'ing': '#00ffff',
})

png

plot({
    'hult': '#ffff00',
    'fors': '#00ff00',
})

png

plot({
    'vik': '#4f96c5',
    'torp': '#00ff00',
})

png

plot({
    u'näs': '#b15928',
    'ryd': '#08a060',
})

png

plot({
    'tuna': '#33cc33',
    'hammar': '#ff0066',
})

png

plot({
    u'köping': '#cc99ff',
    u'bruk': '#ffff00',
})

png

plot({
    'stor': '#00ccff',
    'sund': '#e3c471',
})

png

plot({
    'stad': '#ffcc00',
    'berg': '#9933ff',
})

png

Berg is really common in the old mining regions called Bergslagen.

plot({
    u'ån': '#ffccff',
    u'ås': '#ee5657'
})

png