Web scraping geographic coordinates: Global Ecovillage Network

The last couple weeks, I've been learning how to scrape latitude-longitude data from the web. In today's example, I scrape the geographic coordinates from an embedded map on the Global Ecovillage Network's online directory.
Global Ecovillage Network map (left) reproduced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License; both maps copyright 2022 Google, INEGI.

Overview

The Global Ecovillage Newtwork’s (GEN) website contains a directory of hundreds of ecovillages around the world.

My goal: to efficiently transfer all ecovillage data points from GEN’s embedded map to my own Google My Maps.

My approach: download GEN’s geographic data, extract and covert data to CSV, and upload the CSV to Google My Maps.

Step-by-step

In trying to figure out how to scrape data from embedded maps, I came across this site, which suggested inspecting the HTML for geographic data. Searching the code for terms like “lat”, I discovered that the geographic coordinates data were all buried in right in the HTML!

So using wget in bash, I downloaded the HTML:

wget -O GEN_map.html https://ecovillage.org/projects/map/

Exploring the HTML code, it appeared that the latitude-longitude data for all the points on the map were in one very, very, very long line of code (in this case, line 162):

All of the geographic coordinate data were in one very, very, very long line of HTML code (line 162), which continues well beyond the bottom of this screenshot.

Next, I ran a series of piped commands to extract and organize the text and format it as a CSV file:

grep '^<body class=["]page-template' GEN_map.html | # extract line containing coordinate data
    awk '{gsub("},{","\n")}1' | # split most ecovillage records onto separate lines
    awk '{gsub("}","\n")}1' | # split the remaining ecovillage records onto separate lines
    awk '{gsub("{","\n")}1' | # split the remaining ecovillage records onto separate lines
    grep '^\"ID\"' | # extract only lines containing ecovillage records
    awk '{gsub("\"ID\":\"","")}1' | # remove string '"ID":"'
    awk '{gsub("\",\"post_title\":\"","\t")}1' | # replace string '","post_title":"' with tab
    awk '{gsub("\",\"post_type\":\"","\t")}1' | # replace string '","post_type":"' with tab
    awk '{gsub("\",\"lat\":","\t")}1' | # replace string '","lat":' with tab 
    awk '{gsub(",\"lng\":","\t")}1' | # replace string ',"lng":' with tab
    awk '{gsub(",","")}1' | # remove commas (in preparation for writing CSV)
    sed 's/\\//g' | # remove backslashes (they mess up the syntax)
    sed 's/\"//g' | # remove quotation marks (they mess up the syntax)
    sed 's/\t/,/g' | # replace tabs with commas
    sed '1s/^/ID,post_title\,post_type,lat,long\n/' > GEN_map_LatLong.csv # append header and write to CSV

Figuring out most of commands above was an iterative process of going back and forth between the script and the output to figure out (1) where to put line breaks, (2) what extraneous text to remove, (3) where to put tab breaks, and (4) which troublesome characters I needed to remove for clean syntax. I chose to start by separating fields with tab breaks and then replace tabs with commas after removing all troublesome commas.

Keep in mind that the set of commands I used is specific to GEN’s formatting. This script probably would need to be modified to work with other websites.

Finally, I imported the CSV into Google Maps:

[copyright 2022 Google, INEGI]

The results

Here’s a side-by-side comparison from a zoomed-in portion of the two maps–one from GEN and the one I created:

Zoomed-in portion of the GEN map.
[screenshot from Global Ecovillage Network; reproduced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License; Map copyright 2022 Google, INEGI]
Portion of the map I generated in Google Maps (My Maps) using latitude-longitude data downloaded from GEN. All points from the GEN map were transferred successfully!
[copyright 2022 Google, INEGI]

It looks like all the points were transfer successfully!

RESOURCES/LINKS


Was this useful for you? How would you have done it? I’d love to hear your thoughts in the comments below!


Leave a Reply

Your email address will not be published. Required fields are marked *