Scraping with Google Spreadsheets Across Instagram, Flickr, YouTube etc.

live long and prosper

I remain kind of amazed with how many little tricks can be done with Google Sheets. After seeing Alan’s post today, I wonder how much of the data I could pull (assuming we had the right user names and knew the services . . . really the harder part) just using Google Sheets. Turns out we could get a pretty good amount. The following is a mix of XPath, regex, and APIs. I started with as little real programming as possible and gradually increased sophistication.

The following are just meant to get a rough idea of how much stuff you’ve got in the various spaces.

Flickr

The URL: http://flickr.com/photos/bionicteaching
The function: =IMPORTXML(C2,”//*[@class=’photo-count’]”)
This uses a basic Google Sheets function to grab the photo-count content. The function is grabbing the div class with the title photo-count.
get total photos from Flickr

Vimeo

The URL: http://vimeo.com/twwoodward
The function: =INDEX(IMPORTXML(C3,”//*[@class=’stat_list_count’]”),1)
Pretty similar to the example above but with the addition of INDEX. That solves the problem that there are multiple items that are all in the stat_list_count class and we only want the first matching item.

Sound Cloud

The URL: http://soundcloud.com/cogdog
The function: =REGEXEXTRACT(IMPORTXML(C4,”//*[@name=’description’]/@content”),”([0-9]+) Tracks”)
This gets a bit fancier. IMPORTXML brings in a large chunk of content from the page but it wasn’t structured in a way that I could get the exact information I wanted. REGEX comes to the rescue and grabs the chunk that has a pattern of numbers [0-9]+ followed by the word Tracks.

Instagram

The URL: I cheated and used PHP to grab the data and just hand it off.
The function: =IMPORTXML(“http://bionicteaching.com/tools/reclaim/reclaim_insta.php?user=”&B5,”//*[@class=’reclaim’]”)

$raw = file_get_contents('https://www.instagram.com/' . $user); //replace with user
preg_match('/\"media\"\:\{\"count\"\:([0-9]+)/', $raw, $m);
echo '<div class="reclaim">' . intval($m[1]). '</div>';

YouTube

This is a custom function in Google Sheets. You’d write =myYT(“bionicteaching”) and it’ll return the number of videos.

function myYT(input) {
  
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheets = ss.getSheets();
  var user = input; //gets the variable from the function
  var sheet = ss.getSheetByName("data");
  var apiKey = 'THEAPIKEY';// replace this with your API client ID
  var url = 'https://www.googleapis.com/youtube/v3/channels?part=statistics&forUsername=' + user + '&key=' + apiKey; 
    
  var response = UrlFetchApp.fetch(url); // get feed
  var json = response.getContentText(); //
  var data = JSON.parse(json);
  var stats = [];
  stats.push(data.items[0].statistics.videoCount);
  return stats; //returns the data to the spreadsheet cell 	 
}

Tumblr

Pretty much the same thing- grab the JSON, parse it, and return it.

function myTumblr(input) {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheets = ss.getSheets();
  var site = input; // make sure it doesn't have http leading
  var sheet = ss.getSheetByName("data");
  var apiKey = 'THEAPIKEY';
  var url = 'https://api.tumblr.com/v2/blog/' + site + '/info?api_key=' + apiKey; // replace this with your API client ID
    
  var response = UrlFetchApp.fetch(url); // get feed
  var json = response.getContentText(); //
  var data = JSON.parse(json);
  var stats = [];
  stats.push(data.response.blog.total_posts);
  return stats;
 	 
}

Comments on this post

  1. CogDog said on April 23, 2016 at 10:44 pm

    Now we’re talking, or scraping. I am wary of the scrape / Xpath route; like my first generation Flickr CC Attributor, everything breaks when Yahoo (or whomever becomes their next overlord) alters the display of flickr content. The API way is the way, writing something to get info from flickr is not hard.

    The tripping point is authentication. Getting those keys is a PITA. Wonder about doing some authorization stuff the way Martin Hawksey does the twitter auth in his TAGs sheets.

    Oh, what I meant to say is “thanks” for doing this.

    • Tom Woodward said on April 24, 2016 at 7:58 am

      I think I like xpath because it’s more of a puzzle and reminds me of non-programistan. 🙂 I was also thinking of avoiding people having to get an API Key just to this exploratory stuff. I figured that’d be a roadblock. Thinking about it more though, I think I’ll just use my keys for the service side

      Two of them are official API returns though, I’ll probably do the others that way this evening.

  2. John Johnston said on April 26, 2016 at 2:37 am

    Hi Tom,
    Thanks, This looks like a nice rabbit hole.

    Went to give this a quick try this morning and on my flickr page (and yours) I don’t see the photo count. Wondering if temporary or another change?

    Screenshot:
    https://www.flickr.com/photos/troutcolor/26379416160

    the paragraph is still there so the formula works, but I needed to straighten the quotes

    Thanks Again

  3. ravi janjwadia said on June 5, 2016 at 3:48 am

    Heyy,
    I also want to do a smiliar thing with the ‘followed by’ count of Instagram, can you please help me out?
    And when I run your Instagram code it shows that https wrapper is unavailable.

    • Tom Woodward said on June 7, 2016 at 7:40 am

      Yep. Cogdog’s dire warnings of changes impacting the scraping occurred when Instagram added that hideous icon. I fixed it and added followers/following to the mix. Still better than Instagram’s new stupid API restrictions 🙂

  4. Eric Jonas said on July 23, 2016 at 5:55 pm

    For instagram, could I grab users & its followers? Plus is it also possible to scrap users from a particular location e.g. California (without knowing all the users)? Any help? Thanks.

  5. djahill said on September 20, 2016 at 10:41 am

    Hello Tom, thanks for your article. I’m not very familiar with programming but google spreadsheet is opening some doors to my curiosity.

    I’d like to ask if (‘/\”media\”\:\{\”count\”\:([0-9]+)/’, $raw, $m ) in your Instagram section could be used as REGEX argument in a formula like =REGEXREPLACE(CONCATENATE(IMPORTDATA( instagram url ));”(^.*media: {“”count””: )(\d+)(}.*)”;”$2″) – that is similar to a followers count scrape I’ve found while googling.

    • Tom Woodward said on September 20, 2016 at 12:21 pm

      I spent a long time doing semi-programming in spreadsheets. It’s a friendly place that I highly recommend.

      You’re correct, that’s a regex argument. As mine is in PHP it’s slightly different than what javascript will let you do and Google has other restrictions if I recall correctly.

      Assuming you’ve got the Instagram URL you want in cell A1, then this ought to work to return the numbers of users that person follows.
      =REGEXEXTRACT(concatenate(IMPORTDATA(A1)),”follows: {.count.: [0-9]+”)

      This would return the number of followers they have
      =REGEXEXTRACT(concatenate(IMPORTDATA(A1)),”followed_by: {.count.: [0-9]+”)

      • Tom Woodward said on September 21, 2016 at 11:55 am

        Actually this one ends up being much cleaner
        =REGEXEXTRACT(concatenate(IMPORTDATA(A1)),”(?:followed_by: {.count.: )([0-9]+)”)

        • Sam said on October 5, 2016 at 12:16 am

          I tried using both the php script and this formula to display followers but neither one worked. Is this still working or am I missing something?

          • Tom Woodward said on October 5, 2016 at 1:04 pm

            Assuming the Instagram URL is in cell A1, give this a shot . . .

            =REGEXEXTRACT(concatenate(IMPORTDATA(A1)),”(?:followed_by: {.count.: )([0-9]+)”)

            I did some different PHP stuff more recently. You can see that here.

Leave a Reply

Trackbacks and Pingbacks on this post

No trackbacks.

TrackBack URL