Machine Learning in Ruby

Alexey Vasiliev, Railsware

Machine
Learning
in Ruby

Brought to you by Alexey Vasiliev, Railsware

Alexey Vasiliev

What is Machine Learning?

Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed

Machine learning focuses on the development of computer programs that can change when exposed to new data. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for human comprehension -- as is the case in data mining applications -- machine learning uses that data to detect patterns in data and adjust program actions accordingly

You (probably) don't need Machine Learning

no need ML

If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s)

Machine Learning Areas

MachineLearningAreas

Practical Machine Learning Problems

  • Spam/Fraud detection
  • Digit Recognition
  • Speech Understanding/Face Detection
  • Product Recommendation
  • Medical Diagnosis
  • Customer Segmentation
  • Autonomous ("self-driving") vehicles
MachineLearningAreas

Ruby
Ruby
Ruby!

CanWeUseRuby

Classifier Reborn

Classifier Reborn is a general classifier module to allow Bayesian Classifier and Latent Semantic Indexer (LSI)

gem install classifier-reborn
RubyClassifierReborn

Classifier Reborn - Bayesian Classifiers


require 'classifier-reborn'
classifier = ClassifierReborn::Bayes.new
   'Interesting', 'Uninteresting'
classifier.train "Interesting",
   "Here are some good words. I hope you love them."
classifier.train "Uninteresting",
   "Here are some bad words, I hate you."
classifier.classify "I hate bad words and you." # => "Uninteresting"
classifier.classify "I love" #=> 'Interesting'
    

Classifier Reborn - Latent Semantic Indexer (LSI)


lsi = ClassifierReborn::LSI.new
strings = [["This text deals with dogs. Dogs.", :dog],
           ["This text involves dogs too. Dogs!", :dog],
           ["This text revolves around cats. Cats.", :cat],
           ["This text also involves cats. Cats!", :cat],
           ["This text involves birds. Birds.", :bird]]
strings.each { |x| lsi.add_item x.first, x.last }
lsi.classify "This text is also about dogs!" #=> :dog
    

Decision Tree

Decision Tree a ruby library which implements ID3 (information gain) algorithm for decision tree learning

gem install decisiontree
RubyDecisionTree

Decision Tree

require 'decisiontree'
attributes = ['Temperature']
training = [
  [36.6, 'healthy'],
  [37, 'sick'],
  [38, 'sick'],
  [36.7, 'healthy'],
  [40, 'sick'],
  [50, 'really sick'],
]
    

Decision Tree


# Instantiate the tree, and train it based on the data (set default to '1')
dec_tree = DecisionTree::ID3Tree.new(attributes, training, 'sick', :continuous)
dec_tree.train

test = [37, 'sick']
decision = dec_tree.predict(test)
puts "Predicted: #{decision} ... True decision: #{test.last}"

# => Predicted: sick ... True decision: sick
    

K-nearest neighbors

K Nearest Neighbours (KNN) a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions)

gem install knn
RubyKNearestNeighbors

K-nearest neighbors

require 'knn'
data = Array.new(10000) { Array.new(4) { rand } }
knn = KNN.new(data)
knn.nearest_neighbours([0.5, 0.5, 0.5, 0.5], 2)
# [[4929, 0.027298057904151424,
    [0.5033650144041532, 0.5127064912412195,
		0.5229515382673083, 0.49324480830032635]],
	[8060, 0.08873704527823544,
	  [0.553585611436454, 0.5318254655421701,
		0.45424417942626927, 0.4564524388933011]]]
    

K-nearest neighbors - Distance Measures

a, b = [1,1], [2,2]
a.euclidean_distance(b)
# 1.4142135623730951
a.cosine_similarity(b)
# 0.9999999999999998
a.jaccard_index(b)
# 0.0
a.jaccard_distance(b)
# 1.0
a.binary_jaccard_index(b)
# 0.0
a.binary_jaccard_distance(b)
# 1.0
a.tanimoto_coefficient(b)
# 0.6666666666666666
a.haversine_distance(b)
# 157225.43636105652

Similarity

Similarity a Ruby library for calculating the similarity between pieces of text using a Term Frequency-Inverse Document Frequency (TF-IDF) method

gem install similarity
RubySimilarity

Similarity

require 'similarity'
corpus = Corpus.new
doc1 = Document.new(content:
  "A document with a lot of additional words some of which are about chunky bacon")
doc2 = Document.new(content:
  "Another longer document with many words and again about chunky bacon")
doc3 = Document.new(content:
  "Some text that has nothing to do with pork products")
[doc1, doc2, doc3].each { |doc| corpus << doc }
    

Similarity

corpus.similar_documents(doc1).each do |doc, similarity|
 puts "Similarity between doc #{doc1.id} and doc #{doc.id} is #{similarity}"
end

Similarity between doc 70205829269140 and doc 70205829269140
  is 1.0000000000000002
Similarity between doc 70205829269140 and doc 70205829206760
  is 0.06068602112714361
Similarity between doc 70205829269140 and doc 70205829156640
  is 0.04882114791611662
		

Similarity

corpus.similarity_matrix

[  1.000e+00  6.069e-02  4.882e-02
   6.069e-02  1.000e+00  7.359e-02
   4.882e-02  7.359e-02  1.000e+00 ]
		

K-Means clustering

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

Gems: KMeansClusterer, KMeans

K-Means clustering

RubyKMeans RubyKMeans

K-Means clustering

RubyKMeans RubyKMeans

K-Means clustering - KMeansClusterer

require 'kmeans-clusterer'
data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
        [45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]
labels = ['New York', 'Los Angeles', 'Baltimore',
          'Portland', 'Washington DC', 'Las Vegas']

k = 2 # find 2 clusters in data

kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5

K-Means clustering - KMeansClusterer

kmeans.clusters.each do |cluster|
  puts  cluster.id.to_s + '. ' +
        cluster.points.map(&:label).join(", ") + "\t" +
        cluster.centroid.to_s
end
# 0. Baltimore, Washington DC, New York	[39.63333333333333, -75.88666666666667]
# 1. Las Vegas, Los Angeles, Portland	[38.559999999999995, -118.69666666666667]

puts kmeans.predict [[41.85,-87.65]] # Chicago
# [0] mean (0. Baltimore, Washington DC, New York)

K-Means clustering - KMeans

require 'k_means'
data = [[1,1], [1,2], [1,1],
  [800, 800], [1000, 1000], [500, 500]]
KMeans.new(data, centroids: 2)
# [[0, 1, 2], [3, 4, 5]]
KMeans.new(data, centroids: 2, distance_measure: :jaccard_index)
# [[0, 1, 2, 3, 4, 5], []]
KMeans.new(data, centroids: 2, distance_measure: :haversine_distance)
# [[4], [0, 1, 2, 3, 5]]

Artificial Neural Networks

...a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.

Dr. Robert Hecht-Nielsen

Artificial Neural Networks

Artificial Neural Networks

require 'cerebrum'
network = Cerebrum.new

network.train([
  {input: [0, 0], output: [0]},
  {input: [0, 1], output: [1]},
  {input: [1, 0], output: [1]},
  {input: [1, 1], output: [0]}
])





result = network.run([1, 0])
# => [0.9333206724219677]

I need more Ruby gems!

OpenCV

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library

Gems: Ruby-opencv

Apache Mahout

Apache Mahout project's goal is to build an environment for quickly creating scalable performant machine learning applications

Gems: JRuby Mahout

Apache PredictionIO

Apache PredictionIO (incubating) is an open source Machine Learning Server built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task

Apache PredictionIO

TensorFlow

TensorFlow is an open source software library for numerical computation using data flow graphs

Playground

Gems: Tensorflow.rb

Conclusion

  • Data management should be your first step before diving into any other data project(s)
  • Ruby is not considered bad for ML
  • Better to make ML system as a separate service (or even cluster) for big amount of data

<Thank You!> Questions?

Contact information

QuestionsSlide