Scraping with Google Spreadsheets Across Instagram, Flickr, YouTube etc.

I remain kind of amazed with how many little tricks can be done with Google Sheets. After seeing Alan’s post today, I wonder how much of the data I could pull (assuming we had the right user names and knew the services . . . really the harder part) just using Google Sheets. Turns out we could get a pretty good amount. The following is a mix of XPath, regex, and APIs. I started with as little real programming as possible and gradually increased sophistication. The following are just meant to get a rough idea of how much stuff you’ve got in the various spaces. Flickr The URL: The function: =IMPORTXML(C2,”//*[@class=’photo-count’]”) This uses a basic Google Sheets function to grab the photo-count content. The function is grabbing the div class with the title photo-count. Vimeo The URL: The function: =INDEX(IMPORTXML(C3,”//*[@class=’stat_list_count’]”),1) Pretty similar to the example above but with the addition of INDEX. That solves the problem that there are multiple items that are all in the stat_list_count class and we only want the first matching item. Sound Cloud The URL: The function: =REGEXEXTRACT(IMPORTXML(C4,”//*[@name=’description’]/@content”),”([0-9]+) Tracks”) This gets a bit fancier. IMPORTXML brings in a large chunk of content from the page but it wasn’t structured in a way that I could get the exact information I wanted. REGEX […]


YouTube API to Google Spreadsheets

Because I love Alan. Here’s the API version in Google Script to grab YouTube stats. It does a bit more than the previous XPath version and you can set it to be triggered repeatedly. I’m going to add a loop to add multiple videos etc. in the near future but it’s a good start for anyone who’s doing research on stuff like this. It is funny what you might notice when you can see the data like this. I triggered it manually twice just to get a few lines in there. Notice that between the first two entries there are no additional views but a chunk more likes/dislikes. Makes me wonder if people are just weighing in without watching or if the data are collected differently resulting in some delay. Here’s the script1 and it’s pretty well commented up. You’ll need an API key. 🙂 You do see some weird stuff in the raw JSON. Like there’s a Favorites field. Does that exist in YouTube? I didn’t really think about it until it came up 0 for every video . . . even Gangnam Style. Here’s the result running every hour on a video that I’m hoping changes a bit. I got it off the trending page so it has to be cool right? 1 It took me a good […]


YouTube Scraping, XPath, and Google Sheets

APIs can give you much more power but they are often overkill for what people are trying to around here- lightweight social media Here’s a lightweight example of how you can use Google Sheets and the IMPORTXML function to grab quite a bit of data from various video pages with no API or technical skills. Straight off, we’re going to want the URL of the video. We’ll put that in column A and we’ll use it as a variable in all our other formulas. Getting the Paths to the Data =IMPORTXML(A2,”(//*[contains(@class, ‘watch-title’)])[1]”) So how’d that come to be? A2 is just asking what URL we want to go to. The XPATH stuff gets a little more interesting. It’s going to look for any class that is named watch-title. I found out the title was in that div by right clicking on the title and choosing inspect in Chrome. The appended [1] will only give us the first item that meets those qualifications. Otherwise the title shows up twice. The rest of the formulas are pretty much variations on that theme. =IMPORTXML(A2,”//*[contains(@class, ‘watch-view-count’)]”) – View count =IMPORTXML(A2,”(//*[contains(@class, ‘like-button-renderer-like-button’)])[1]”) – Likes count =IMPORTXML(A2,”(//*[contains(@class, ‘like-button-renderer-dislike-button’)])[1]”) – Dislikes count =IMPORTXML(A2,”(//*[contains(@class, ‘yt-user-info’)])[1]”) – User name Throw in a video of your own if you’d like.