Provoke

How to Think About Machine Learning

Written by Rachel Adams, Software Developer at Provoke. 

Being able to pick patterns out from a lot of information is useful. Humans have evolved to be pretty great at this. After our ancestors were attacked by enough bears, we started noticing a pattern that loud roars usually mean getting eaten, so let’s run away if we hear a roar. The same applies to when you first start driving and see cars pulling over when they hear sirens. You quickly learn the pattern that sirens mean get out of the way.

 

Sometimes we imagine patterns that don’t actually exist. My neighbor won’t fly because of the plane crashes he sees on TV. He prefers driving because he drives all the time and only sees a wreck now and then. The truth is, you’re about 2,200 times more likely to die in a car crash than a plane crash[1].

 

Luckily, machines aren’t as imaginative as us (yet), and they can handle a lot more information than us (like roars and sirens from thousands of animals and vehicles at once). Machine learning puts computers to use finding patterns and making decisions.

 

Machine learning is the process of using existing data to predict factors about new data. The computer learns and adapts from data itself, rather than needing a developer to hardcode what decisions to make based on all sorts of values. If a computer has enough existing data that says it should run away from the roar of a grizzly bear, black bear, brown bear and polar bear; it can start to predict things like “that new roar is from some kind of bear, and I know bears are generally bad, so I should probably run now.”

 

Real world uses for machine learning

What are you trying to predict? What data can you collect to predict it? Those are the key questions in machine learning. If you know what data to collect and how to interpret it, machine learning has applications everywhere - from determining what content to show a user on your site, to figuring out the subject matter of online posts, building pricing models, or even determining whether customers are satisfied or not based on their tweets.

 

Let’s look at two cases in more depth: determining insurance premiums and recommending movies.

 

Determining insurance premiums. There’s a lot of data out there about automobile accidents that we can use to predict optimal premiums for a given person. Males are more likely to be in an accident than females, so their premiums are higher. People under 23 tend to be in more accidents than people over 23, so their premiums are higher. Using all of that existing data as a training set for machine learning algorithms like a decision tree, we can make an estimate for a new customer’s premium.

The same concept could apply to health insurance. Is the customer a smoker? Do they have a pre-existing condition? Are they over 50? If we’ve recorded this data on all our thousands of past customers and know how much they cost to insure, we can use that data to predict the cost of insuring future customers.

 

Recommending movies. Let’s look at a more in-depth example. As you watch and rate more and more movies on Netflix, the recommendations for your next movie to watch get better and better. If you were building Netflix, what data would you collect and use to recommend movies to your users?

 

Maybe you’d categorize each movie as comedy, horror, sci-fi, and so on, then keep track of which genres a user rates most highly. If I tend to rate sci-fi and westerns highly, I would receive recommendations for movies from those genres.

 

But that’s tricky because I’ll watch a Keanu Reeves movie regardless of genre (because Keanu’s acting skills transcend genre). So maybe we should keep track of even more metrics for each movie – genre, actors, running time, and language – and use those to find similar movies to recommend to our users.

 

But how do you weigh all those factors for each user? Is the fact that I like sci-fi or that I like Keanu Reeves more important? Or maybe the fact that I can’t make it through a two-hour movie without falling asleep? Or that I hate reading subtitles? The amount of data we could compile is immense, and defining a user as a sum of her favorite genres and actors and penchants for napping would be an extremely complex task.

 

What data should you collect?

 

When I think about real life, the most successful movie recommendations I get come from my sister. We like a few of the same actors and genres, but mostly we just have very similar tastes. Instead of receiving recommendations based on those rigid, pre-defined categories and factors above, what if I was just recommended movies that my sister likes?

 

Let’s pretend Netflix only has a handful of users and movies. Below are users’ ratings for all the movies they’ve watched (if blank, they haven’t seen it).

 

What do we recommend for Rachel to watch next? Looking at our other users we can tell: 

 

- Casey and Rachel have watched five of the same movies.

- Bob and Rachel have only watched three of the same movies.

 

- Casey and Rachel have given those five movies similar ratings.

- Bob and Rachel have given those three movies very different ratings.

 

Based on our data, Casey is the most similar user to Rachel (in machine learning speak, she’s Rachel’s nearest neighbor). So, what do we recommend to Rachel? Movies that Casey rated highly that Rachel hasn’t seen yet. Rachel hasn’t seen Scream or Psycho, and Casey liked Psycho better out of the two. So next time Rachel logs in, she’ll see a big flashy banner saying “Watch Psycho! You’ll love it!”

 

Rather than finding similar movies to the ones Rachel likes, we’ll find similar users to Rachel and see what movies they like. If we keep adding more and more users, we can run our machine learning algorithms to keep an eye on users’ nearest neighbors as we accumulate more data.

And if we add a whole lot of users, they might start to forms some clusters that we can use to build categories of tastes (k-means clustering in machine learning speak).


Go forth and conquer!

 

These examples are only the tip of the machine learning iceberg, intended to spark thought about what can be accomplished with machine learning and how. If you’re not yet capturing data in your business, now is the time to start capturing all of it that you reasonably can, and building up a large training set. Once you have your data, there are an enormous number of ways machine learning can help you find patterns, and make predictions, or recommendations.

 

Comments powered by Disqus