Scraping with Google Spreadsheets Across Instagram, Flickr, YouTube etc.

I remain kind of amazed with how many little tricks can be done with Google Sheets. After seeing Alan’s post today, I wonder how much of the data I could pull (assuming we had the right user names and knew the services . . . really the harder part) just using Google Sheets. Turns out we could get a pretty good amount. The following is a mix of XPath, regex, and APIs. I started with as little real programming as possible and gradually increased sophistication. The following are just meant to get a rough idea of how much stuff you’ve got in the various spaces. Flickr The URL: http://flickr.com/photos/bionicteaching The function: =IMPORTXML(C2,”//*[@class=’photo-count’]”) This uses a basic Google Sheets function to grab the photo-count content. The function is grabbing the div class with the title photo-count. Vimeo The URL: http://vimeo.com/twwoodward The function: =INDEX(IMPORTXML(C3,”//*[@class=’stat_list_count’]”),1) Pretty similar to the example above but with the addition of INDEX. That solves the problem that there are multiple items that are all in the stat_list_count class and we only want the first matching item. Sound Cloud The URL: http://soundcloud.com/cogdog The function: =REGEXEXTRACT(IMPORTXML(C4,”//*[@name=’description’]/@content”),”([0-9]+) Tracks”) This gets a bit fancier. IMPORTXML brings in a large chunk of content from the page but it wasn’t structured in a way that I could get the exact information I wanted. REGEX […]

Scraping Instagram #2

flickr photo shared by Marco Gomes under a Creative Commons ( BY ) license Remember last night when I posted about scraping data from Instagram? I woke up this morning about 5:30 (literally with a start) astounded by how easy the solution to archiving the pagination returns was. So before I even left for work I managed to get this working so much better than my previous attempt. I stripped out all of the previous GitHub stuff as I realized I didn’t really need it. It had provided a nice crutch and let me know I sort of knew what I was doing. The explanation of what’s going on is in the comments interspersed in the code below. There’s a much cleaner way to do this where I don’t duplicate so much code. I could just call the part that builds the csv1 twice. I may do that at some point but I think having it all in one place will help people new to this sort of thing see what’s going on more clearly. This is fun stuff. I need to do more of it and more consistently. In the past, I’d do some programming for a few days and then not do any for a number of months. That makes for slow progress and frustration. I’m going to […]