Ever lay awake at night and wonder “Can I teach my computer to recognise whether a lyric comes from a Taylor Swift song or a Lady Gaga song”? Well today is your lucky day!

What we’re going to do is feed three songs from each artist to a supervised learning algorithm, and then give it a lyric from another song to see if it can correctly figure out who sung it.

So our task isn’t just asking a computer to look something up, it’s asking it to infer the outcome based on previous learnings

Credit: https://www.flickr.com/photos/jmlawlor/3117174819

What is supervised learning?

Supervised learning is the machine learning task of inferring a function from labeled training data.1 The training data normally consists of data in pairs, the first being the input object and the second being the supervisory signal. Our input object would be “He’s the reason for the teardrops on my guitar”, with the supervisory signal being “Taylor Swift”.

The supervised learning algorithm we’ll use is the Naive Bayes classifier which is a probabilistic classifier applying Bayes’ theorem for text categorisation. The most common use of the Naive Bayes classifier is in Email spam filters, determining if the text in an email is “Spam” or “Not Spam”.

Let’s use a similiar process to see if a lyrics is from “Taylor Swift” or “Lady Gaga”.

But first, the training data!

We are going to get our lyrics from Metro Lyrics, which has a list of Taylor Swift songs and Lady Gaga songs.

Let’s do this with Ruby

Using the OpenURI Ruby module we can quickly load the HTML for lyrics, and then using Nokogiri we can parse that HTML to get the lyrics as text.

require 'open-uri'
require 'nokogiri'

html = open("http://www.metrolyrics.com/teardrops-on-my-guitar-lyrics-taylor-swift.html")
lyrics = Nokogiri::HTML(html).css("#lyrics-body-text").text

puts lyrics
# => Drew looks at me, I fake a smile so he won't see
# That I want and I'm needing everything that we should be
# I'll bet she's beautiful, that girl he talks about
# ...

Time to start learning

There’s no need for us to learn how to write a Naive Bayes classifier from scratch, instead we’ll use the classifier gem which does the heavy lifting for us.

Using the gem, here’s an example of creating a new classifier with two categories and providing it with training data:

@skynet = Classifier::Bayes.new 'lady_gaga', 'taylor_swift'

@skynet.train_taylor_swift "'Cause I knew you were trouble when you walked in"
@skynet.train_lady_gaga "I want your leather studded kiss in the sand"

Putting it together

So let’s get three lyrics per artist and feed them to our Naive Bayes classifier which I’ve playfully called Skynet.

# From http://www.metrolyrics.com/taylor-swift-lyrics.html
taylor_swift_urls = [
  "http://www.metrolyrics.com/i-knew-you-were-trouble-lyrics-taylor-swift.html",
  "http://www.metrolyrics.com/teardrops-on-my-guitar-lyrics-taylor-swift.html",
  "http://www.metrolyrics.com/we-are-never-ever-getting-back-together-lyrics-taylor-swift.html"
]
taylor_swift_urls.each do |url|
  train_skynet(url, :train_taylor_swift)
end 

# From http://www.metrolyrics.com/lady-gaga-lyrics.html
lady_gaga_urls = [
  "http://www.metrolyrics.com/bad-romance-lyrics-lady-gaga.html",
  "http://www.metrolyrics.com/telephone-lyrics-lady-gaga.html",
  "http://www.metrolyrics.com/pokerface-lyrics-lady-gaga.html"
]
lady_gaga_urls.each do |url|
  train_skynet(url, :train_lady_gaga)
end

The train_skynet function just DRY’s up the code a bit:

def train_skynet(url, trainer)
  # Grab the HTML from the url
  html = open(url)

  # Extract the lyrics from the #lyrics-body-text div
  lyrics = Nokogiri::HTML(html).css("#lyrics-body-text").text

  # Filter text to ensure only alphanumeric characters and
  # appropriate punctuation are used
  lyrics.gsub!(/[^A-Za-z0-9,.'\s]/, ' ')

  # Feed each line to skynet
  @skynet.send(trainer, lyrics)
end

Results

Now we can see how smart our computer is.

@skynet.classify "Shake it off, shake it off!"
# => Taylor swift

@skynet.classify "At least that's what people say mmm, that's what people say mmm"
# => Taylor swift

@skynet.classify "But I won't stop until that boy is mine"
# => Lady gaga

@skynet.classify "Papa-paparazzi"
# => Taylor swift

@skynet.classify "I'll follow you until you love me"
# => Lady gaga

Not bad! If Meatloaf was happy with 2 out of 3, then I can be overjoyed with 4 out of 5.

Full source code can be found at this gist.

  1. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.

A quick guide on how to move a running linux process to the screen terminal multiplexer.

When connecting to a remote linux server most of the time I’m doing a quick task such as looking at log files or doing routine server maintenance. It only takes a few minutes. But sometimes I launch a process that can take many hours to run. Those long running processes are normally related to technical SEO tasks such as multi-gigabyte log file analysis, site crawls and data scraping, or to admin tasks such as database migrations.

To be as efficient with my own time as possible a common workflow would be to start a long running process in the afternoon, and then later in the evening from home I’d monitor the progress and perform any cleanup as needed. The snag there is that I’d start the operation from my work computer and be in my home office in the evening. To get that to work we use a terminal multiplexer which is software that allows us to create, detach and then reattach linux sessions.

My terminal multiplexer of choice is screen, although some people prefer tmux. My needs are simple and screen works just fine. Matt Cutts wrote a quick tutorial on screen in 2007 which explains what it is and how it’s used.

Now, when performing quick and simple tasks I don’t use screen. There’s no point as the session will only last a few minutes. Every now and then though I discover that a running task is taking much longer than expected, or I’ve simply forgotten to use screen to begin with. In that case we’re left with three options:

  • Stay back late in the office for an unknown length of time(!!),
  • Quit the current process, losing any unsaved work and potentially wasting hours of processing, or
  • Move the running process to a new screen shell. This is the solution that I’ll describe in the rest of this post.

Move the running process to a new screen shell

Most of the commands that we’ll be using are builtin commands, but you’ll need to install two programs to complete this task. Firstly and rather obviously, screen. The second is Reptyr. If you need some assistance with that here’s a great guide on installing software on Linux.

The steps we need to take are:

  1. Suspend the process
  2. Resume the process in the background
  3. Disown the process
  4. Launch a screen session
  5. Find the PID of the process
  6. Use reptyr to take over the process

I’ll summarise the steps into a sequence of commands to be used, but first let’s have a quick look at which each step entails.

Suspend the process

The first thing that we need to do is to suspend the process by pressing Ctrl+Z.

When you press Ctrl+Z the TSTP signal is sent to the process. This halts the execution and the kernel won’t schedule any more CPU time to the process.

Resume the process in the background

Type in the bg command. This sends the SIGCONT signal to the process and it’s now happily running in the background.

Disown the process

We now run the disown command like so: disown %1

Disown removes the process from the table of active jobs, essentially allowing it to be taken over by another session.

Launch a screen session

This is pretty easy, we now run the screen command which is where our process will soon be moved to.

Find the PID of the process

Now we need to find the Process ID, (PID) of the process we’d like to take over. The method I use and recommend is pgrep. As an example if our process is called myprogram we can run the command pgrep myprogram which will return the PID.

Use reptyr to take over the process

Finally we pass the PID to reptyr to take over the process. If pgrep gave us a PID of 1234, we can now use the command: reptyr 1234

Pro tip: You can combine pgrep and reptyr together with the following syntax: reptyr $(pgrep myprogram)

Summary

As a quick reference guide:

  1. Suspend the process with Ctrl+Z
  2. Resume the process in the background with bg
  3. Disown the process with disown %1
  4. Launch a screen session with screen
  5. Find the PID of the process using pgrep <process name>
  6. Use reptyr to take over the process reptyr <pid>

Success! Now we’re able to exit screen and reattach from another computer at another time.

The Ubuntu Ptrace gotcha

There’s one gotcha though, with Ubuntu. As a security measure Maverick Meerkat (Ubuntu 10.10) introduced a patch that disallows ptracing of non-child processes by non-root users. This stops reptyr from working.

To disable this setting permanently you can edit the /etc/sysctl.d/10-ptrace.conf file and set kernel.yama.ptrace_scope to 0. You’ll then need to update the kernel parameters by running sudo sysctl -p /etc/sysctl.d/10-ptrace.conf

However if you only want to disable ptrace scoping temporarily you can use the following command before reptyr: echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, and then this one after using reptyr: echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope

So for Ubuntu our quick reference guide is:

  1. Suspend the process with Ctrl+Z
  2. Resume the process in the background with bg
  3. Disown the process with disown %1
  4. Launch a screen session with screen
  5. Find the PID of the process using pgrep <process name>
  6. Enable ptracing with echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
  7. Use reptyr to take over the process reptyr <pid>
  8. Disable ptracing with echo 1 | sudo tee /proc/sys/kernel/yama/ptrace_scope