Bruteforce, Cloud stylez.

Way back when, I’ve wrote about scraping cities in N.America using Mechanize: http://demetriusmichael.com/2017/07/24/2013-03-19-scraping-cities.html

60k cities is small time, and yes Bruteforce will always be satisfying. But it’s way easier to just do research. If you spend 10 seconds googling, you’ll see that some asshole already aggregated a list of coordinates by city across the planet (http://www.maxmind.com/en/worldcities), and we didn’t need to go through the process of destroying some other poor guys web server!

Like all lists on the internet, they need a little house cleaning. This one uses non industry standard encodings, so we have translate strings to UTF-8, or in my case, not give two shits and delete whatever doesn’t look remotely like English.

The other pain-point is that he uses country codes, rather than country names, so I have to normalize the data. This isn’t a big deal, but country names are more memorable and easier to query to meat sacks.

Here’s the code:

https://gist.github.com/D3MZ/5390631

That above script uses ~3GB of memory, and inserts maybe 10-50 cities / second on my laptop. Since the Earth has millions of cities and I’m impatient, I’m going to brute force it anyway —- On the cloud.

Now the fun.. 

  1. First, we spin up the most retarded server on the planet: image
  2. Run the “parallel” hook on the script.
  3. export/zip mongo.
  4. import wherever you want it.
  5. You’re done.

https://gist.github.com/D3MZ/5390801