Scraping Wikipedia User Data w Google Spreadsheets

creative commons licensed ( BY-SA ) flickr photo shared by nojhan Alice Campbell in the VCU library hosted a Wikipedia edit-a-thon today. It was interesting and we had a variety of faculty and even some students show up. Gardner joked at one point whether we had a leader board for edits. It got me thinking. I remembered that Wikipedia keeps track of the edits of logged in users and I figured I’d take a shot at scraping some of that data so we’d have a rough idea of how many edits were made by our group. I started off by looking at the contributions page. This URL will get you the page for my user name. I used the IMPORTHTML formula in Google Spreadsheets.1 It was easy because this was the first list on the page. You can see in the image above that you have the choice between trying to grab a list or a table. The other variable is what number that element is from the top of the page. You can see the working document embedded below. I considered parsing out2 the ..(+30)..3 but after talking to Alice that wasn’t the kind of data that would travel well. She was more interested in number of edits which, as it turns out, is available on the Edit […]

Little Trick, Big Numbers

I often want to know just a bit more about numbers I see in tables. As I was looking at some thing today, I stumbled on the Wikipedia page for “List of Most Viewed YouTube Videos“. After being more than a bit amazed at the utterly staggering numbers. I wanted to know what they translated to in terms of years because the numbers were just too big. I remembered that Google Spreadsheets will let you pull in a table from a website with no fuss. All I needed to do was put =IMPORTHTML(“”,”table”,1) in the first cell on the spreadsheet and viola the table is transcluded. I can now add a few more calculations to figure out the import stuff – like how many years worth of time have been spent watching Gangnam Style (16,274.24 years for the record1). You can go mess around with the data here. 1 Assuming I didn’t screw something up.