Alexey Vasiliev, Railsware
Brought to you by Alexey Vasiliev, Railsware
Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed
Machine learning focuses on the development of computer programs that can change when exposed to new data. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for human comprehension -- as is the case in data mining applications -- machine learning uses that data to detect patterns in data and adjust program actions accordingly
If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s)
Classifier Reborn is a general classifier module to allow Bayesian Classifier and Latent Semantic Indexer (LSI)
gem install classifier-reborn
require 'classifier-reborn'
classifier = ClassifierReborn::Bayes.new
'Interesting', 'Uninteresting'
classifier.train "Interesting",
"Here are some good words. I hope you love them."
classifier.train "Uninteresting",
"Here are some bad words, I hate you."
classifier.classify "I hate bad words and you." # => "Uninteresting"
classifier.classify "I love" #=> 'Interesting'
lsi = ClassifierReborn::LSI.new
strings = [["This text deals with dogs. Dogs.", :dog],
["This text involves dogs too. Dogs!", :dog],
["This text revolves around cats. Cats.", :cat],
["This text also involves cats. Cats!", :cat],
["This text involves birds. Birds.", :bird]]
strings.each { |x| lsi.add_item x.first, x.last }
lsi.classify "This text is also about dogs!" #=> :dog
Decision Tree a ruby library which implements ID3 (information gain) algorithm for decision tree learning
gem install decisiontree
require 'decisiontree'
attributes = ['Temperature']
training = [
[36.6, 'healthy'],
[37, 'sick'],
[38, 'sick'],
[36.7, 'healthy'],
[40, 'sick'],
[50, 'really sick'],
]
# Instantiate the tree, and train it based on the data (set default to '1')
dec_tree = DecisionTree::ID3Tree.new(attributes, training, 'sick', :continuous)
dec_tree.train
test = [37, 'sick']
decision = dec_tree.predict(test)
puts "Predicted: #{decision} ... True decision: #{test.last}"
# => Predicted: sick ... True decision: sick
K Nearest Neighbours (KNN) a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions)
gem install knn
require 'knn'
data = Array.new(10000) { Array.new(4) { rand } }
knn = KNN.new(data)
knn.nearest_neighbours([0.5, 0.5, 0.5, 0.5], 2)
# [[4929, 0.027298057904151424,
[0.5033650144041532, 0.5127064912412195,
0.5229515382673083, 0.49324480830032635]],
[8060, 0.08873704527823544,
[0.553585611436454, 0.5318254655421701,
0.45424417942626927, 0.4564524388933011]]]
a, b = [1,1], [2,2]
a.euclidean_distance(b)
# 1.4142135623730951
a.cosine_similarity(b)
# 0.9999999999999998
a.jaccard_index(b)
# 0.0
a.jaccard_distance(b)
# 1.0
a.binary_jaccard_index(b)
# 0.0
a.binary_jaccard_distance(b)
# 1.0
a.tanimoto_coefficient(b)
# 0.6666666666666666
a.haversine_distance(b)
# 157225.43636105652
Similarity a Ruby library for calculating the similarity between pieces of text using a Term Frequency-Inverse Document Frequency (TF-IDF) method
gem install similarity
require 'similarity'
corpus = Corpus.new
doc1 = Document.new(content:
"A document with a lot of additional words some of which are about chunky bacon")
doc2 = Document.new(content:
"Another longer document with many words and again about chunky bacon")
doc3 = Document.new(content:
"Some text that has nothing to do with pork products")
[doc1, doc2, doc3].each { |doc| corpus << doc }
corpus.similar_documents(doc1).each do |doc, similarity|
puts "Similarity between doc #{doc1.id} and doc #{doc.id} is #{similarity}"
end
Similarity between doc 70205829269140 and doc 70205829269140
is 1.0000000000000002
Similarity between doc 70205829269140 and doc 70205829206760
is 0.06068602112714361
Similarity between doc 70205829269140 and doc 70205829156640
is 0.04882114791611662
corpus.similarity_matrix
[ 1.000e+00 6.069e-02 4.882e-02
6.069e-02 1.000e+00 7.359e-02
4.882e-02 7.359e-02 1.000e+00 ]
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster
Gems: KMeansClusterer, KMeans
require 'kmeans-clusterer'
data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
[45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]
labels = ['New York', 'Los Angeles', 'Baltimore',
'Portland', 'Washington DC', 'Las Vegas']
k = 2 # find 2 clusters in data
kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5
kmeans.clusters.each do |cluster|
puts cluster.id.to_s + '. ' +
cluster.points.map(&:label).join(", ") + "\t" +
cluster.centroid.to_s
end
# 0. Baltimore, Washington DC, New York [39.63333333333333, -75.88666666666667]
# 1. Las Vegas, Los Angeles, Portland [38.559999999999995, -118.69666666666667]
puts kmeans.predict [[41.85,-87.65]] # Chicago
# [0] mean (0. Baltimore, Washington DC, New York)
require 'k_means'
data = [[1,1], [1,2], [1,1],
[800, 800], [1000, 1000], [500, 500]]
KMeans.new(data, centroids: 2)
# [[0, 1, 2], [3, 4, 5]]
KMeans.new(data, centroids: 2, distance_measure: :jaccard_index)
# [[0, 1, 2, 3, 4, 5], []]
KMeans.new(data, centroids: 2, distance_measure: :haversine_distance)
# [[4], [0, 1, 2, 3, 5]]
require 'cerebrum'
network = Cerebrum.new
network.train([
{input: [0, 0], output: [0]},
{input: [0, 1], output: [1]},
{input: [1, 0], output: [1]},
{input: [1, 1], output: [0]}
])
result = network.run([1, 0])
# => [0.9333206724219677]
OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library
Gems: Ruby-opencv
Apache Mahout project's goal is to build an environment for quickly creating scalable performant machine learning applications
Gems: JRuby Mahout
Apache PredictionIO (incubating) is an open source Machine Learning Server built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task
TensorFlow is an open source software library for numerical computation using data flow graphs
Gems: Tensorflow.rb