Mapping clustered websites

TriggerScrape is a python script to for example map out the cluster of Swedish sites containing highly anti-immigrant content.

It does this by the following procedure:

Start at some entry point, with many outgoing links
Collecting all outgoing links
Randomly choosing a subsample of them and visiting them
Looking at how many trigger words are found on those links
Visiting them again by probability set by previous step
If the percentage trigger words by the number of visited links is high – use that site as next starting point and restart at (1)

It looks something like this:

png

In the end it produces a list such as:

domain	ratio	triggered	n_links
http://avpixlat.info	7.774193548387097	210	31
http://petterssonsblogg.se	4.835680751173709	817	213
http://gruvmor.wordpress.com	3.8	28	10
http://thoralfalfsson.webblogg.se	3.6484375	339	128
http://tobbesmedieblogg.blogspot.se	2.583333333333333	19	12
http://galnegunnarsblogg.wordpress.com	2.388888888888889	250	180
http://samnytt.se	2.193548387096774	74	62
http://imittsverige.blogspot.se	1.98	49	50

and if you give it enough time, it will map out the most of the sites in that cluster.

The script is build on top of the exellent Python library Grab, and can be found on my github if you are interested.