findURL – Custom Google Function Fun

via the magical XKCD

Fun being a fairly relative term . . . but I’m amused.

The Meat

You can write custom functions in Google Spreadsheets and then use them like other built in functions. I didn’t realize that. The script below1 grabs all the URLs from a chunk of text.

You could add it to your spreadsheet in Google by going to Tools>Script Editor and opening a blank project. Replace all of the content with this and then save it. You can now use it like other functions by putting =findULR(A1) (assuming A1 is the text chunk you want). It spits out a single cell with the URLs in it on individual lines and with a count of the URLs found at the top of that cell. I’ve commented up the script below in case you want to understand/change it to better suit your needs.

function findURL(input) {
  var urlsFound = [];
  var endGame = [];
  var regExp = new RegExp("\"https?:\/\/([^\"]*)\"", "gmi"), urlsFound; //this is the regex
  while( urlsFound = regExp.exec(input) ) endGame.push( urlsFound[1] ); //sticks the results in an array
  return endGame.length + ' urls\n' + endGame.join('\n'); //returns the count of the array items then the items with line breaks

Next Steps

The regex works well about 90-95% of the time. So it sure beats doing it by hand but it could be improved. You can try your own stuff here. The post I’m using there is pretty messy so it’s a good thing to check against.

It’s hard to guess what odd things people will do with URLs. I’ve already seen people doing stuff like “”””. No idea why they did double quotes. I have to believe it’s through an odd interaction with the WP WYSIWG editor. Plus, like programming, I only barely understand regex (and the URL regex holly grail is a crazy rabbit hole that can consume your life).

Currently, the regex doesn’t screen out wp-uploads links or links to img files. That could be added or it could be part of a second tier of filtering.

I feel like I shouldn’t need two arrays but . . . it works so I’ll let it ride for the moment.

I could return the results in individual cells with a vanilla return endGame; but it inserts the URLs vertically by default which makes things awkward when dealing with data in spreadsheets. Even horizontal insertions would be kind of hard to deal with given the variability of link numbers.

This particular setup is mainly of use to people doing research after the fact. I used Google Sheets to make it accessible to anyone who felt like using it. The next step is likely a WordPress plugin that’ll give you a page that’s viewable by all authors or by single author. I think I can do this semi-easily with the JSON API and Angular. In my head, it’d be a horizontal bar chart with dates on the X axis and the Y axis being the number of URLs. Selecting the date would display the URLs for that post. There would be a summary of URLs by core domain in order of reference (6 links to other blogs, 3 to etc.).

Some Random Things I Learned

Want to count the occurrences of a string in a single cell?
=LEN(A7)-LEN(SUBSTITUTE(A7,”p”,””)) h/t MS but it also works in Google sheets. Before I used the js option to count the items in the array, I was separating the items with commas and counting the commas.

I was overly-focused on the regex side of things and neglected the javascript. So I asked the wrong question on StackOverflow but someone helped me anyway. It’s still pretty easy for me to get lost and focus on bark instead of the trees/forest. I’m also wandering between a few different places (php, javascript, Angular, regex, various APIs). It might make more sense to focus on one place until I get it mastered but that doesn’t tend to fit my typical patterns. I’m also trying to give back a little by answering the StackOverflow questions that are within my skill set. I’ve now earned the points necessary to make comments so I’ve got that going for me.

1 Which works but feels like it has an extra array it shouldn’t need . . . I’m still really fumbling with fundamentals here. I’m not sure if I say that in defense or in the interest of transparency. You’ll have to ask the psychologist I have yet to hire.