Mapping clustered websites
TriggerScrape
is a python script to for example map out the cluster of Swedish sites containing highly anti-immigrant content.
It does this by the following procedure:
- Start at some entry point, with many outgoing links
- Collecting all outgoing links
- Randomly choosing a subsample of them and visiting them
- Looking at how many trigger words are found on those links
- Visiting them again by probability set by previous step
- If the percentage trigger words by the number of visited links is high – use that site as next starting point and restart at (1)
It looks something like this:
In the end it produces a list such as:
domain | ratio | triggered | n_links |
---|---|---|---|
http://avpixlat.info | 7.774193548387097 | 210 | 31 |
http://petterssonsblogg.se | 4.835680751173709 | 817 | 213 |
http://gruvmor.wordpress.com | 3.8 | 28 | 10 |
http://thoralfalfsson.webblogg.se | 3.6484375 | 339 | 128 |
http://tobbesmedieblogg.blogspot.se | 2.583333333333333 | 19 | 12 |
http://galnegunnarsblogg.wordpress.com | 2.388888888888889 | 250 | 180 |
http://samnytt.se | 2.193548387096774 | 74 | 62 |
http://imittsverige.blogspot.se | 1.98 | 49 | 50 |
and if you give it enough time, it will map out the most of the sites in that cluster.
The script is build on top of the exellent Python library Grab, and can be found on my github if you are interested.