Ever lay awake at night and wonder “Can I teach my computer to recognise whether a lyric comes from a Taylor Swift song or a Lady Gaga song”? Well today is your lucky day!
What we’re going to do is feed three songs from each artist to a supervised learning algorithm, and then give it a lyric from another song to see if it can correctly figure out who sung it.
So our task isn’t just asking a computer to look something up, it’s asking it to infer the outcome based on previous learnings…
What is supervised learning?
Supervised learning is the machine learning task of inferring a function from labeled training data.1 The training data normally consists of data in pairs, the first being the input object and the second being the supervisory signal. Our input object would be “He’s the reason for the teardrops on my guitar”, with the supervisory signal being “Taylor Swift”.
The supervised learning algorithm we’ll use is the Naive Bayes classifier which is a probabilistic classifier applying Bayes’ theorem for text categorisation. The most common use of the Naive Bayes classifier is in Email spam filters, determining if the text in an email is “Spam” or “Not Spam”.
Let’s use a similiar process to see if a lyrics is from “Taylor Swift” or “Lady Gaga”.
But first, the training data!
Let’s do this with Ruby
require 'open-uri' require 'nokogiri' html = open("http://www.metrolyrics.com/teardrops-on-my-guitar-lyrics-taylor-swift.html") lyrics = Nokogiri::HTML(html).css("#lyrics-body-text").text puts lyrics # => Drew looks at me, I fake a smile so he won't see # That I want and I'm needing everything that we should be # I'll bet she's beautiful, that girl he talks about # ...
Time to start learning
There’s no need for us to learn how to write a Naive Bayes classifier from scratch, instead we’ll use the classifier gem which does the heavy lifting for us.
Using the gem, here’s an example of creating a new classifier with two categories and providing it with training data:
@skynet = Classifier::Bayes.new 'lady_gaga', 'taylor_swift' @skynet.train_taylor_swift "'Cause I knew you were trouble when you walked in" @skynet.train_lady_gaga "I want your leather studded kiss in the sand"
Putting it together
So let’s get three lyrics per artist and feed them to our Naive Bayes classifier which I’ve playfully called Skynet.
# From http://www.metrolyrics.com/taylor-swift-lyrics.html taylor_swift_urls = [ "http://www.metrolyrics.com/i-knew-you-were-trouble-lyrics-taylor-swift.html", "http://www.metrolyrics.com/teardrops-on-my-guitar-lyrics-taylor-swift.html", "http://www.metrolyrics.com/we-are-never-ever-getting-back-together-lyrics-taylor-swift.html" ] taylor_swift_urls.each do |url| train_skynet(url, :train_taylor_swift) end # From http://www.metrolyrics.com/lady-gaga-lyrics.html lady_gaga_urls = [ "http://www.metrolyrics.com/bad-romance-lyrics-lady-gaga.html", "http://www.metrolyrics.com/telephone-lyrics-lady-gaga.html", "http://www.metrolyrics.com/pokerface-lyrics-lady-gaga.html" ] lady_gaga_urls.each do |url| train_skynet(url, :train_lady_gaga) end
The train_skynet function just DRY’s up the code a bit:
def train_skynet(url, trainer) # Grab the HTML from the url html = open(url) # Extract the lyrics from the #lyrics-body-text div lyrics = Nokogiri::HTML(html).css("#lyrics-body-text").text # Filter text to ensure only alphanumeric characters and # appropriate punctuation are used lyrics.gsub!(/[^A-Za-z0-9,.'\s]/, ' ') # Feed each line to skynet @skynet.send(trainer, lyrics) end
Now we can see how smart our computer is.
@skynet.classify "Shake it off, shake it off!" # => Taylor swift @skynet.classify "At least that's what people say mmm, that's what people say mmm" # => Taylor swift @skynet.classify "But I won't stop until that boy is mine" # => Lady gaga @skynet.classify "Papa-paparazzi" # => Taylor swift @skynet.classify "I'll follow you until you love me" # => Lady gaga
Not bad! If Meatloaf was happy with 2 out of 3, then I can be overjoyed with 4 out of 5.
Full source code can be found at this gist.