What Killed the Curse of Dimensionality?

How does Deep Learning overcome this hurdle in machine learning and why?

To begin we should clearly define the Curse of Dimensionality:

A phenomena that occurs when the dimensionality of the data increases, the sparsity of the data increases.

Data has dimensionality to it. The more dimensions that are added to the data, the more difficult it becomes to find patterns. Think about dimensionality as the range of movement of an animal you’re playing tag with. If you’re chasing an animal that can only move on the ground they can only go in 2 dimensions left or right(x), forward or backward(y). It’s harder to catch a bird because that bird can move in 3 dimensions left or right (x), forward or backward (y), up or down (z). We can imagine some mythical time traveling beast that can move in 4 dimensions left or right(x), forward or backward(y), up or down(z), past or future(t). You can see this gets more difficult as the dimensions increase.

This same problem applies with data and machine learning. As the data’s dimensionality increases the sparsity of the data increases making it harder to ascertain a pattern. There are certain ways around the curse of dimensionality in traditional ML that require certain techniques such as function smoothing and approximation. Deep Learning has overcome this “curse” by it’s inherent nature and may be one of the contributions to the increased popularity.

Deep Learning

In high dimensionality applications deep learning does not suffer from the same consequences as other machine learning algorithms such as Linear Regression. This fact is part of the magic that makes this methodology of modeling with Neural Networks so effective. Neural Network’s imperviousness to the curse of dimensionality is a helpful characteristic in today’s world of big data.

There are multiple theories on why this occurs. We will now quickly go over the main ideas of each:

Linear Manifold

Manifold Hypothesis [1]

At a high level the Manifold Hypothesis suggests that high dimensional data actually sits on a lower dimensional manifold embedded in higher dimensional space.

In a hand-wavey sense this nearly means that in this high dimensional data there is some underlying pattern in lower level dimensions that deep learning methods are good at exploiting. So given a high dimensional matrix that represents images, neural networks excel at finding low dimensional features that are not apparent in the high dimensional representation.

Image from deep learning book. Manifold over high dimensional data

The image above represents a bunch of data points in a high dimensional space. It is in 2-d for easy representation on this blog post. The Image below it shows that there is some manifold in this high dimensional space where most of the data lies. Neural networks and Deep Learning methods exploit that and theoretically find that manifold.

Visualization of activations from different neurons of a Convolution Neural Network

Sparse Coding [2]

This is the occurrence when the data flows through the different neurons in a network. Each neuron in the Neural network has it’s own activation functions. When each neuron fires on it’s activation it is causing sparse coding.

The image above depicts the visualization of activations of a neural network. Each individual neuron is picking up on a different feature found from being fed an image. The representation above shows how the different neurons pick up on different features and each individually contribute to the final output of the network.

For example the English language is composed of 26 letters. These letters can make all the words of the language. In terms of neural networks there are some number of neurons that when fired can be combined with the firing of other neurons to combine to output the correct answer no matter the dimensionality of the inputs.


There’s no sole proven theory that indicates why Neural Networks overcome the curse of dimensionality. There’s a lot of research going on to understand the different aspects and underlying characteristics that allow Deep Learning and Neural networks to be so effective in practice. Until then we have a good idea on what types of concepts are going on under the hood, that help fuel deep learning and neural networks to be used in practice.

Further Reading:

Manifold Hypothesis [1]: https://www.ima.umn.edu/2008-2009/SW10.27-30.08/6687

Deep Learning Book [1]: http://www.deeplearningbook.org/version-2015-10-03/contents/manifolds.html

[1] & [2] Lecture from Partha Niyogi: https://www.ima.umn.edu/2008-2009/SW10.27-30.08/6687

Wikipedia [2]:https://en.wikipedia.org/wiki/Neural_coding#Sparse_coding

Writeup on Sparse Coding [2]:http://redwood.berkeley.edu/vs265/handout-sparse-08.pdf

Inspired by this reddit post: https://goo.gl/jJFtP1

Getting Started with Sonnet, Deep Mind’s Deep Learning Library

Intro article to installing Sonnet. Mirroring what is on their github with a little commentary.


Deep Mind released a new library built on top of TensorFlow that abstracts building a network into simpler blocks. Available here: https://github.com/deepmind/sonnet This library was released April 6th and already has 3000 stars on github (checked April 10th).


  1. Install TensorFlow

To do this (I’m on MacOSX with python2.7) If you’re on another setup checkout https://www.tensorflow.org/install/

pip install --upgrade https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.0.1-py2-none-any.whl

2. Install Bazel

Bazel is googles package builder. To install on Mac with homebrew. Anything else checkout https://bazel.build/versions/master/docs/install.html

brew install bazel

3. Install Sonnet

Sonnet does not yet support Python3. There is a promising pull request already to set this up. But not as of April 10th 2017.

$ git clone --recursive https://github.com/deepmind/sonnet
$ cd sonnet/tensorflow
$ ./configure
$ cd ../

Configure will ask you some questions about what installation you would like.

Now we will run the install script:

$ mkdir /tmp/sonnet
$ bazel build --config=opt :install
$ ./bazel-bin/install /tmp/sonnet

After that is complete. Took a couple minutes for me:

$ pip install /tmp/sonnet/*.whl

Now to test everything worked 🙂

$ python
>>> import sonnet as snt
>>> import tensorflow as tf
>>> snt.resampler(tf.constant([0.]), tf.constant([0.]))

We should see:

<tf.Tensor 'resampler/Resampler:0' shape=(1,) dtype=float32>

Congratulations you have now setup Deep Mind’s Deep Learning Library on top of TensorFlow

TensorFlow in a Nutshell — Part Three: All the Models

The fast and easy guide to the most popular Deep Learning framework in the world.

Make sure to check out the other articles here.


In this installment we will be going over all the models that are easily currently available in TensorFlow and describe use cases for that particular model as well as simple sample code. Full sources of working examples are in the TensorFlow In a Nutshell repo.

A recurrent neural network

Recurrent Neural Networks

Use Cases: Language Modeling, Machine translation, Word embedding, Text processing.

Since the advent of Long Short Term Memory and Gated Recurrent Units, Recurrent Neural Networks have made leaps and bounds above other models in natural language processing. They can be fed vectors representing characters and be trained to generate new sentences based on the training set. The merit in this model is that it keeps the context of the sentence and derives meaning that “cat sat on the mat” means the cat is on the mat. Since the creation of TensorFlow writing these networks have become increasingly simpler. There are even hidden features covered by Denny Britz here that make writing RNN’s even simpler heres a quick example.

import tensorflow as tf
import numpy as np
# Create input data
X = np.random.randn(2, 10, 8)

# The second example is of length 6 
X[1,6,:] = 0
X_lengths = [10, 6]

cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 4, state_is_tuple=True)

outputs, last_states = tf.nn.dynamic_rnn(

result = tf.contrib.learn.run_n(
    {"outputs": outputs, "last_states": last_states},

Convolution Neural Network

Convolution Neural Networks

Use Cases: Image processing, Facial recognition, Computer Vision

Convolution Neural Networks are unique because they’re created in mind that the input will be an image. CNNs perform a sliding window function to a matrix. The window is called a kernel and it slides across the image creating a convolved feature.

from http://deeplearning.standford.edu/wiki/index.php/Feature_extraction_using_convolution

Creating a convolved feature allows for edge detection which then allows for a network to depict objects from pictures.

edge detection from GIMP manual

The convolved feature to create this looks like this matrix below:

Convolved feature from GIMP manual

Here’s a sample of code to identify handwritten digits from the MNIST dataset.

### Convolutional network
def max_pool_2x2(tensor_in):
  return tf.nn.max_pool(
      tensor_in, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
def conv_model(X, y):
  # reshape X to 4d tensor with 2nd and 3rd dimensions being image width and
  # height final dimension being the number of color channels.
  X = tf.reshape(X, [-1, 28, 28, 1])
  # first conv layer will compute 32 features for each 5x5 patch
  with tf.variable_scope('conv_layer1'):
    h_conv1 = learn.ops.conv2d(X, n_filters=32, filter_shape=[5, 5],
                               bias=True, activation=tf.nn.relu)
    h_pool1 = max_pool_2x2(h_conv1)
  # second conv layer will compute 64 features for each 5x5 patch.
  with tf.variable_scope('conv_layer2'):
    h_conv2 = learn.ops.conv2d(h_pool1, n_filters=64, filter_shape=[5, 5],
                               bias=True, activation=tf.nn.relu)
    h_pool2 = max_pool_2x2(h_conv2)
    # reshape tensor into a batch of vectors
    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
  # densely connected layer with 1024 neurons.
  h_fc1 = learn.ops.dnn(
      h_pool2_flat, [1024], activation=tf.nn.relu, dropout=0.5)
  return learn.models.logistic_regression(h_fc1, y)

Feed Forward Neural Networks

Use Cases: Classification and Regression

These networks consist of perceptrons in layers that take inputs that pass information on to the next layer. The last layer in the network produces the output. There is no connection between each node in a given layer. The layer that has no original input and no final output is called the hidden layer.

The goal of this network is similar to other supervised neural networks using back propagation, to make inputs have the desired trained outputs. These are some of the simplest effective neural networks for classification and regression problems. We will show how easy it is to create a feed forward network to classify handwritten digits:

def init_weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))
def model(X, w_h, w_o):
    h = tf.nn.sigmoid(tf.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions
    return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels
X = tf.placeholder("float", [None, 784])
Y = tf.placeholder("float", [None, 10])
w_h = init_weights([784, 625]) # create symbolic variables
w_o = init_weights([625, 10])
py_x = model(X, w_h, w_o)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, Y)) # compute costs
train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost) # construct an optimizer
predict_op = tf.argmax(py_x, 1)
# Launch the graph in a session
with tf.Session() as sess:
    # you need to initialize all variables
for i in range(100):
        for start, end in zip(range(0, len(trX), 128), range(128, len(trX)+1, 128)):
            sess.run(train_op, feed_dict={X: trX[start:end], Y: trY[start:end]})
        print(i, np.mean(np.argmax(teY, axis=1) ==
                         sess.run(predict_op, feed_dict={X: teX, Y: teY})))

Linear Models

Use Cases: Classification and Regression

Linear models take X values and produce a line of best fit used for classification and regression of Y values. For example if you have a list of house sizes and their price in a neighborhood you can predict the price of house given the size using a linear model.

One thing to note is that linear models can be used for multiple X features. For example in the housing example we can create a linear model given house sizes, how many rooms, how many bathrooms and price and predict price given a house with size, # of rooms, # of bathrooms.

import numpy as np
import tensorflow as tf
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=1)
    return tf.Variable(initial)
# dataset
xx = np.random.randint(0,1000,[1000,3])/1000.
yy = xx[:,0] * 2 + xx[:,1] * 1.4 + xx[:,2] * 3
# model
x = tf.placeholder(tf.float32, shape=[None, 3])
y_ = tf.placeholder(tf.float32, shape=[None])
W1 = weight_variable([3, 1])
y = tf.matmul(x, W1)
# training and cost function
cost_function = tf.reduce_mean(tf.square(tf.squeeze(y) - y_))
train_function = tf.train.AdamOptimizer(1e-2).minimize(cost_function)
# create a session
sess = tf.Session()
# train
for i in range(10000):
    sess.run(train_function, feed_dict={x:xx, y_:yy})
    if i % 1000 == 0:
        print(sess.run(cost_function, feed_dict={x:xx, y_:yy}))

Support Vector Machines

Use Cases: Currently only Binary Classification

The general idea behind a SVM is that there is an optimal hyperplane for linearly separable patterns. For data that is not linearly separable we can use a kernel function to transform the original data into a new space. SVMs maximize the margin around separating the hyperplane. They work extremely well in high dimensional spaces and and are still effective if the dimensions are greater than the number of samples.

def input_fn():
      return {
          'example_id': tf.constant(['1', '2', '3']),
          'price': tf.constant([[0.6], [0.8], [0.3]]),
          'sq_footage': tf.constant([[900.0], [700.0], [600.0]]),
          'country': tf.SparseTensor(
              values=['IT', 'US', 'GB'],
              indices=[[0, 0], [1, 3], [2, 1]],
              shape=[3, 5]),
          'weights': tf.constant([[3.0], [1.0], [1.0]])
      }, tf.constant([[1], [0], [1]])
price = tf.contrib.layers.real_valued_column('price')
    sq_footage_bucket = tf.contrib.layers.bucketized_column(
        boundaries=[650.0, 800.0])
    country = tf.contrib.layers.sparse_column_with_hash_bucket(
        'country', hash_bucket_size=5)
    sq_footage_country = tf.contrib.layers.crossed_column(
        [sq_footage_bucket, country], hash_bucket_size=10)
    svm_classifier = tf.contrib.learn.SVM(
        feature_columns=[price, sq_footage_bucket, country, sq_footage_country],
svm_classifier.fit(input_fn=input_fn, steps=30)
    accuracy = svm_classifier.evaluate(input_fn=input_fn, steps=1)['accuracy']

Deep and Wide Models

Use Cases: Recommendation systems, Classification and Regression

Deep and Wide models were covered with greater detail in part two, so we won’t get too heavy here. A Wide and Deep Network combines a linear model with a feed forward neural net so that our predictions will have memorization and generalization. This type of model can be used for classification and regression problems. This allows for less feature engineering with relatively accurate predictions. Thus, getting the best of both worlds. Here’s a code snippet from part two’s github.

def input_fn(df, train=False):
  """Input builder function."""
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values) for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
    indices=[[i, 0] for i in range(df[k].size)],
    shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols)
  # Converts the label column into a constant Tensor.
  if train:
    label = tf.constant(df[SURVIVED_COLUMN].values)
      # Returns the feature columns and the label.
    return feature_cols, label
    return feature_cols
m = build_estimator(model_dir)
m.fit(input_fn=lambda: input_fn(df_train, True), steps=200)
print m.predict(input_fn=lambda: input_fn(df_test))
results = m.evaluate(input_fn=lambda: input_fn(df_train, True), steps=1)
for key in sorted(results):
  print("%s: %s" % (key, results[key]))

Random Forest

Use Cases: Classification and Regression

Random Forest model takes many different classification trees and each tree votes for that class. The forest chooses the classification having the most votes.

Random Forests do not overfit, you can run as many treees as you want and it is relatively fast. Give it a try on the iris data with this snippet below:

hparams = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
        num_trees=3, max_nodes=1000, num_classes=3, num_features=4)
classifier = tf.contrib.learn.TensorForestEstimator(hparams)
iris = tf.contrib.learn.datasets.load_iris()
data = iris.data.astype(np.float32)
target = iris.target.astype(np.float32)
monitors = [tf.contrib.learn.TensorForestLossMonitor(10, 10)]
classifier.fit(x=data, y=target, steps=100, monitors=monitors)
classifier.evaluate(x=data, y=target, steps=10)

Bayesian Reinforcement Learning

Use Cases: Classification and Regression

In the contrib folder of TensorFlow there is a library called BayesFlow. BayesFlow has no documentation except for an example of the REINFORCE algorithm. This algorithm is proposed in a paper by Ronald Williams.

REward Increment = Nonnegative Factor * Offset Reinforcement * Characteristic Eligibility

This network trying to solve an immediate reinforcement learning task, adjusts the weights after getting the reinforcement value at each trial. At the end of each trial each weight is incremented by a learning rate factor multiplied by the reinforcement value minus the baseline multiplied by characteristic eligibility. Williams paper also discusses the use of back propagation to train the REINFORCE network.

"""Build the Split-Apply-Merge Model.
  Route each value of input [-1, -1, 1, 1] through one of the
  functions, plus_1, minus_1.  The decision for routing is made by
  4 Bernoulli R.V.s whose parameters are determined by a neural network
  applied to the input.  REINFORCE is used to update the NN parameters.
    The 3-tuple (route_selection, routing_loss, final_loss), where:
      - route_selection is an int 4-vector
      - routing_loss is a float 4-vector
      - final_loss is a float scalar.
  inputs = tf.constant([[-1.0], [-1.0], [1.0], [1.0]])
  targets = tf.constant([[0.0], [0.0], [0.0], [0.0]])
  paths = [plus_1, minus_1]
  weights = tf.get_variable("w", [1, 2])
  bias = tf.get_variable("b", [1, 1])
  logits = tf.matmul(inputs, weights) + bias
# REINFORCE forward step
  route_selection = st.StochasticTensor(
      distributions.Categorical, logits=logits)

Linear Chain Conditional Random Fields

Use Cases: Sequential Data

CRFs are conditional probability distributions that factoirze according to an undirected model. They predict a label for a single sample keeping context from the neighboring samples. CRFs are similar to Hidden Markov Models. CRFs are often used for image segmentation and object recognition, as well as shallow parsing, named entity recognition and gene finding.

# Train for a fixed number of iterations.
  for i in range(1000):
    tf_unary_scores, tf_transition_params, _ = session.run(
       [unary_scores, transition_params, train_op])
    if i % 100 == 0:
      correct_labels = 0
      total_labels = 0
      for tf_unary_scores_, y_, sequence_length_ in zip(tf_unary_scores, y, sequence_lengths):
        # Remove padding from the scores and tag sequence.
        tf_unary_scores_ = tf_unary_scores_[:sequence_length_]
        y_ = y_[:sequence_length_]

        # Compute the highest scoring sequence.
        viterbi_sequence, _ = tf.contrib.crf.viterbi_decode(
            tf_unary_scores_, tf_transition_params)

        # Evaluate word-level accuracy.
        correct_labels += np.sum(np.equal(viterbi_sequence, y_))
        total_labels += sequence_length_
      accuracy = 100.0 * correct_labels / float(total_labels)
      print("Accuracy: %.2f%%" % accuracy)


Ever since TensorFlow has been released the community surrounding the project has been adding more packages, examples and cases for using this amazing library. Even at the time of writing this article there are more models and sample code being written. It is amazing to see how much TensorFlow as grown in these past few months. The ease of use and diversity in the package are increasing overtime and don’t seem to be slowing down anytime soon.

As always — Feel free to email me any questions or inquiries at camron@camron.xyz

Originally posted at Camron.xyz

Sales Automation Through a Deep Learning Platform

Sales Automation through a Deep Learning Platform

A deep learning platform that actually sells to the customer.

This article outlines a successful sales automation platform that uses deep learning to drive customers down an order flow.


Most sales processes are mostly linear and have a flow. A good sales agent knows how to guide the customer down this path gracefully. If the customer is deviating from the flow the sales agent knows how to sell the product and continue on with the order flow. With this sales architecture platform the deep learning model can sell the product to the client and also keep context of the situation while maintaining the sales order flow.

Automatic voice sales systems are already place but such systems do not have the ability to sell the product and drive the customer farther down the sales process that this architecture provides.

Platform Design

This architecture is based on having the sales process mapped out into finite states. For example when ordering a plane ticket, a state could be ‘getting the date for departure’, or ‘getting the date for arrival’. These states are essentially a singular step in the order process. All of these steps are held together in underlying structure with a finite state machine to keep track of the states. Each state has its own properties. It’s not necessarily important that these states are mapped numerically in order because each order may deviate and the platform will have to bring the customer between different states.

Each state dictates what the network response will be. Using a bucketed response that is pre determined based on the state — this method is effective because it is very structured and the network will always give a correct response.

A trivial example of an order flow

In this platform there is middle ware for session storage. This session storage keeps track of certain customer information such as name, location and other information related to the order process. This temporal storage is an external key value store that is used with a unique session identifier. The storage used in this case was redis. The impact of this is holding data specific to the sale and makes it feel personalized. The sales network responds with information already input into they system such as “Alright Mr. Smith the next step is to get your credit information”.

When the customer gives input it is then given a unique sessionID, this sessionID is handed to the finite state machine and then encoded returning a current state matrix and a possible states matrix. These two matrices along with the raw user input is passed to a Wide and Recurrent Model. This model is based on the Wide and Deep model but replaces a feed forward network with a recurrent neural network for language processing. The target label in this model is what should be the next state in the finite state machine. The model is given the current state, all possible states and what the user said, taking this information into account, the network determines what is the correct next state to go in the order process. As of writing this article there are no known articles/papers that reference such a model as Wide and Recurrent. An example of this can be seen in this code example using the Keras package.

model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.2))
model.add(GRU(128, dropout_W=0.2, dropout_U=0.2))
current_state = Sequential()
current_state.add(Dense(num_states, input_dim=num_states, activation="relu"))
possible_states = Sequential()
possible_states.add(Dense(num_states, input_dim=num_states, activation="relu"))
model.add(Dense(num_states, activation='softmax'))
final_answer = Sequential()
merge = (Merge([current_state, possible_states, model], mode='concat'))
final_answer.add(Dense(num_states, activation='softmax'))
final_answer.compile(optimizer='rmsprop', loss='categorical_crossentropy',

For each state in the order flow there is a hook that takes the user if it detects they’re uncertain or they have a question taking them to an intermediate state and will answer their questions and try to sell them. This model is a deep memory network with episodic memory and is fed a set of common questions particular for that business that it can easily interpret the users question and accurately answer. This piece is what sets apart this platform from other automated order processes. The deep memory network efficiently allows for recall of prepared statements such as how does the product function, or what is the warranty on this product and other questions customers might ask a sales associate.

The dynamic memory network for selling the customer


For this model to succeed it is necessary to have data sets that contain what a user might say in each state with what the correct next state is for that response, as well as a data set of what frequently asked questions that allow for the customer to be sold on with what successfully drives the customer down the order flow. In practice this data has been generated based off of domain knowledge and scripting of the order process. The session column indicates the order session, the text is what the customer says and the label is the correct next state. Example training set looks like the following based on selling a space trip:

1,of course,planets
1,Jupiter is cool with me,passengers
1,I want the premium meal plan,housing
1,what ones are good,list_meal
1,the hilton,payment
1,can you tell me which ones there are,list_housing
1,use it,done
2,that would be awesome,planets
2,Mars would be dope,passengers
2,the average one,housing
2,which one is the best?,list_meal
2,the ritz carlton,payment
2,I dont understant,list_housing


In practice this platform allowed for an MVP of a sales automation process that eliminated the need for human to human interaction through voice chat as well as text chat. This platform could scale up and work for large corporations who primary function is selling through agents. This platform is necessary for rigid business order processes to create uniformity throughout the sales process and delivers professional sales methods to increase the likelihood of order completion.

TensorFlow in a Nutshell — Part Two: Hybrid Learning

TensorFlow in a Nutshell — Part Two: Hybrid Learning

The fast and easy guide to the most popular Deep Learning framework in the world.

Make sure to check out Part One: Basics

In this article we will Demonstrate a Wide ‘N Deep Network that will use wide linear model trained simultaneously with a feed forward network for more accurate predictions than some tradition machine learning techniques. This hybrid learning method will be used to predict Survival probability of Titanic passengers.

These hybrid learning methods are already in production by Google in the Play store for app suggestions. Even Youtube is using similar hybrid learning techniques to suggest videos.

The code for this article is available here.

Wide and Deep Network

A Wide and Deep Network combines a linear model with a feed forward neural net so that our predictions will have memorization and generalization. This type of model can be used for classification and regression problems. This allows for less feature engineering with relatively accurate predictions. Thus, getting the best of both worlds.

The Data

We are going to be using the Titanic Kaggle data to predict whether or not the passenger will survive based on certain attributes like Name, Sex, what ticket they had, the fare they paid the cabin they stayed in etc. For more information on this data set check out here at Kaggle.

First off we’re going to define all of our columns as Continuos or Categorical.

Continuous columns — any numerical value in a continuous range. Pretty much if it is a numerical representation like money, or age.

Categorical columns — part of a finite set. Like male or female, or even what country someone is from.

CATEGORICAL_COLUMNS = ["Name", "Sex", "Embarked", "Cabin"]
CONTINUOUS_COLUMNS = ["Age", "SibSp", "Parch", "Fare", "PassengerId", "Pclass"]

Since we are only looking to see if a person survived, this is a binary classification problem. We predict a 1 if that person survives and a 0… if they do not 🙁 , We then create a column solely for our survived category.


The Network

Now we can get to creating the columns and adding embedding layers. When we build our model were going to want to change our categorical columns into a sparse column. For our columns with a small set of categories such as Sex or Embarked (S, Q, or C) we will transform them into sparse columns with keys

sex = tf.contrib.layers.sparse_column_with_keys(column_name="Sex",
  embarked = tf.contrib.layers.sparse_column_with_keys(column_name="Embarked",

The other categorical columns have many more options than we want to put keys, and since we don’t have a vocab file to map all of the possible categories into an integer we will hash them.

cabin = tf.contrib.layers.sparse_column_with_hash_bucket(
      "Cabin", hash_bucket_size=1000)
name = tf.contrib.layers.sparse_column_with_hash_bucket(
      "Name", hash_bucket_size=1000)

Our continuous columns we want to use their real value. The reason passengerId is in continuous and not categorical is because they’re not in string format and they’re already an integer ID.

age = tf.contrib.layers.real_valued_column("Age")
passenger_id = tf.contrib.layers.real_valued_column("PassengerId")
sib_sp = tf.contrib.layers.real_valued_column("SibSp")
parch = tf.contrib.layers.real_valued_column("Parch")
fare = tf.contrib.layers.real_valued_column("Fare")
p_class = tf.contrib.layers.real_valued_column("Pclass")

We are going to bucket the ages. Bucketization allows us to find the survival correlation by certain age groups and not by all the ages as a whole, thus increasing our accuracy.

age_buckets = tf.contrib.layers.bucketized_column(age,
                                                        5, 18, 25,
                                                        30, 35, 40,
                                                        45, 50, 55,

Almost done, we are going to define our wide columns and our deep columns. Our wide columns are going to effectively memorize interactions between our features. Our wide columns don’t generalize our features, this is why we have our deep columns.

wide_columns = [sex, embarked, p_class, cabin, name, age_buckets,
                  tf.contrib.layers.crossed_column([p_class, cabin],
                      [age_buckets, sex],
                  tf.contrib.layers.crossed_column([embarked, name],

The benefit of having these deep columns is that it takes our sparse high dimension features and reduces them into low dimensions.

deep_columns = [
      tf.contrib.layers.embedding_column(sex, dimension=8),
      tf.contrib.layers.embedding_column(embarked, dimension=8),
      tf.contrib.layers.embedding_column(cabin, dimension=8),
      tf.contrib.layers.embedding_column(name, dimension=8),

We finish off our function by creating our classifier with our deep columns and wide columns,

return tf.contrib.learn.DNNLinearCombinedClassifier(
        dnn_hidden_units=[100, 50])

The last thing we will have to do before running the network is create mappings for our continuous and categorical columns. What we are doing here by creating this function, and this is standard throughout the Tensorflow learning code, is creating an input function for our dataframe. This converts our dataframe into something that Tensorflow can manipulate. The benefit of this is that we can change and tweak how our tensors are being created. If we wanted we could pass feature columns into .fit .feature .predict as an individually created column like we have above with our features, but this is a much cleaner solution.

def input_fn(df, train=False):
  """Input builder function."""
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values) for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
    indices=[[i, 0] for i in range(df[k].size)],
    shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols)
  # Converts the label column into a constant Tensor.
  if train:
    label = tf.constant(df[SURVIVED_COLUMN].values)
      # Returns the feature columns and the label.
    return feature_cols, label
    # so we can predict our results that don't exist in the csv
    return feature_cols

Now after all this we can write our training function

def train_and_eval():
  """Train and evaluate the model."""
  df_train = pd.read_csv(
  df_test = pd.read_csv(
  model_dir = "./models"
  print("model directory = %s" % model_dir)
  m = build_estimator(model_dir)
  m.fit(input_fn=lambda: input_fn(df_train, True), steps=200)
  print m.predict(input_fn=lambda: input_fn(df_test))
  results = m.evaluate(input_fn=lambda: input_fn(df_train, True), steps=1)
  for key in sorted(results):
    print("%s: %s" % (key, results[key]))

We read in our csv files that were preprocessed, like effectively imputed missing values, for simplicity sake. Details on how the files were preprocessed along with the code are contained in the repo.

These csv’s are converted to tensors using our input_fn by lambda. we build our estimator then we print our predictions and print out our evaluation results.


Network results

Running our code as is gives us reasonably good results with out adding any extra columns or doing any great acts of feature engineering. With very little fine tuning this model can be used to achieve relatively good results.

The ability of adding an embedding layer along with tradition wide linear models allows for accurate predictions by reducing sparse dimensionality down to low dimensionality.


This part deviates from traditional Deep Learning to illustrate the many uses and applications of Tensorflow. This article is heavily based on the paper and code provided by Google for wide and deep learning. The research paper can be found here. Google uses this model as a product recommendation engine for the Google Play store and has helped them increase sales on app suggestions. Youtube has also released a paper about their recommendation system using hybrid learning as well available here. These models are starting to be more prevalent for recommendation by various companies and will likely continue to be for their embedding ability.

TensorFlow in a Nutshell — Part One: Basics

tensorflow_nutshellThe fast and easy guide to the most popular Deep Learning framework in the world.

TensorFlow is a framework created by Google for creating Deep Learning models. Deep Learning is a category of machine learning models that use multi-layer neural networks. The idea of deep learning has been around since 1943 when neurophysiologist Warren McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work and they model a simple neural network using electrical circuits.

Many, many developments have occurred since then. These highly accurate mathematical models are extremely computationally expensive. With recent advances in processing power from GPUs and increasing CPU power Deep Learning has been exploding with popularity.

TensorFlow was created with processing power limitations in mind. Open sourced in November 2015, this library can be ran on computers of all kinds including smartphones. It allows for instant creation of trained production models. It is currently the number 1 Deep Learning framework at the time of writing this article.

Created by Francois Chollet @fchollet (twitter)

Basic Computational Graph

Everything in TensorFlow is based on creating a computational graph. If you’ve ever used Theano then this section will look familiar. Think of a computational graph as a network of nodes, with each node known as an operation, running some function that can be as simple as addition or subtraction to as complex as some multi variate equation.

An Operation also referred to as op can return zero or more tensors which can be used later on in the graph. Heres a list of operations with their output for example

Each operation can be handed a constant, array, matrix or n-dimensional matrix. Another word for an n-dimensional matrix is a tensor, a 2-dimensional tensor is equivalent to a m x m matrix.

Our computational graph

The code above is creating two constant tensors and multiplying them together and outputting our result. This is a trivial example that demonstrates how you can create a graph and run the session. All inputs needed by the op are run automatically. They’re typically ran in parallel. This session run actually causes the execution of three operations in the graph, creating the two constants then the matrix multiplication.


The constants and operation that we created above was automagically added to the graph in TensorFlow. The graph default is intantiated when the library is imported. Creating a Graph object instead of using the default graph is useful when creating multiple models in one file that do not depend on each other.

new_graph = tf.Graph()
with new_graph.as_default():
    new_g_const = tf.constant([1., 2.])

any variables or operations used outside of the with new_graph.as_default() will be added to the default graph that is created when the library is loaded. You can even get a handle to the default graph with

default_g = tf.get_default_graph()

for most cases it’s best to stick with the default graph.


There are two kinds of Session objects in TensorFlow:


This encapsulates teh environment that operations and tensors are executed and evaluated. Sessions can have their own variables, queues and readers that are allocated. So it’s important to use the close() method when the session is over. There are 3 arguments for a Session, all of which are optional.

  1. target — The execution engine to connect to.
  2. graph — The Graph to be launched.
  3. config — A ConfigProto protocl buffer with configuration options for the session

To have run one “step” of the TensorFlow computation this function is called and all of the necessary dependencies for the graph to execute are ran.


This is the exact same as tf.Session() but is targeted for using IPython and Jupyter Notebooks that allows you to add things and use Tensor.eval() and Operation.run() instead of having to do Session.run() every time you want something to be computed.

sess = tf.InteractiveSession()
a = tf.constant(1)
b = tf.constant(2)
c = a + b
# instead of sess.run(c)

InteractiveSession allows so that you dont have to explicitly pass Session object.


Variables in TensorFlow are managed by the Session. They persist between sessions which are useful because Tensor and Operation objects are immutable. Variables can be created by tf.Variable().

tensorflow_var = tf.Variable(1, name="my_variable")

most of the time you will want to create these variables as tensors of zeros, ones or random values:

  • tf.zeros() — creates a matrix full of zeros
  • tf.ones() — creates a matrix full of ones
  • tf.random_normal() — a matrix with random uniform values between an interval
  • tf.random_uniform() — random normally distributed numbers
  • tf.truncated_normal() — same as random normal but doesn’t include any numbers more than 2 standard deviations.

These functions take an inital shape parameter where the dimension of the matrix is defined. For example:

# 4x4x4 matrix normally distribued mean 0 std 1
normal = tf.truncated_normal([4, 4, 4], mean=0.0, stddev=1.0)

To have your variable set to one of these matrix helper functions:

normal_var = tf.Variable(tf.truncated_normal([4,4,4] , mean=0.0, stddev=1.0)

To have these variables initialized you must use TensorFlow’s variable initialization function then pass it to the session. This way when multiple sessions are ran the variables are the same.

init = tf.initialize_all_variables()
sess = tf.Session()

If you’d like to completely change the value of a variable you can use Variable.assign() operation, this must be run in a session update the value.

initial_var = tf.Variable(1)
changed_var = initial_var.assign(initial_var + initial_var)
init = tf.initialize_all_variables()
sess = tf.Session()
# 2
# 3
# 4
# .... and so on

Sometimes you would like to add a counter inside your model this is where you can do a Variable.assign_add() method which takes a numeric parameter and increments it by the parameter. Similarily there is Variable.assign_sub().

counter = tf.Variable(0)
# 1
# -1


To control the complexity of models and make them easier to break down into individual pieces TensorFlow has scopes. Scopes are very simple and even help break down your model when using TensorBoard (which will be covered in Part 2). Scopes can even be nested inside of other scopes.

with tf.name_scope("Scope1"):
    with tf.name_scope("Scope_nested"):
        nested_var = tf.mul(5, 5)

Scopes may not seem that powerful right now but used in collaboration with TensorBoard and they’re very useful.


I’ve demonstrated many of the building blocks that TensorFlow offers. These individual pieces added together can create very complicated models. There is much more that TensorFlow offers, if there are any requests for features in upcoming parts let me know.

Creating a Search Engine

The science behind finding a needle in a needlestack.


Language is hard. It’s very difficult to find out what people are trying to say. It’s even harder to try to search through a corpus of text to find something relevant. With so many differences in meaning and phrasing how are we supposed to find what we want? This is where topic modeling comes in.

Topic Modeling is a form of identifying patterns in text using math. Using these methods actually allows us to find hidden meaning and relationships that are not obvious to us as humans. To create our search engine and find what we want from the text, we will be using an algorithm known as Latent Semantic Indexing (LSI) others may refer to it as Latent Semantic Analysis or (LSA) but they’re one in the same.

Latent Semantic Indexing

LSI is a method which allows users to find documents based on keywords. LSI works with the assumption that there is an underlying structure to language that is hidden because there are so many words to choose from. I’m going to highlight the main steps then go through each of them in detail.

  1. Organize your data into a matrix for word counts per document using Term Frequency Inverse Document Frequency
  2. Reduce the dimensionality of the matrix using Singular Value Decomposition.
  3. Compare words by taking the cosine similarity between two vectors to find how similar they are.

Organize your data

The first thing to getting started on using LSI is to transform your documents into a usable form. Most commonly used is Term Frequency Inverse Document Frequency (TFIDF). First off we need to create a list of all the words that appear in the document. After getting all the words we need to remove the words that don’t give us much meaning to the sentece such as “of”, or “because”. These words are considered stop words. We can even cut down this list even more by removing words that only appear in one document.


This form creates an n-dimensional matrix with terms that appear in any of the documents by the different documents in the corpus. The intersection of the two is how many times a particular term appears in the document.

Reduce Dimensionality

Since our tfidf matrix is a very large and sparse it may be straining on memory, and cpu intensive to search through this entire thing everytime we have a query. A solution to this problem is to use Singular Value Decomposition. What SVD does is takes our tfidf matrix and reduces the dimensionality of it to something manageable. There is some debate around what dimensionality is ideal. Anywhere within the 50–1000 dimensions work are effective.

SVD comes from linear algebra where a rectangular matrix A can be broken down int the product of three different matrices. The theorem is usually written as:

A is the matrix that is broken into an orthogonal matrix U, a diagonal matrix S and the transpose of an orthogonal matrix V. “Where UTU = I,VTV = I; the columns of U are orthonormal eigenvectors of AAT, the columns of V are orthonormal eigenvectors of AT A, and S is a diagonal matrix containing the square roots of eigenvalues from U or V in descending order.” (Kirk Baker)

The idea behind this is that we take our tfidf and break it down into independent components. Then take these independent components and multiply them all together to get A. These components are an abstraction from the noisy correlations in the original data. This gives us the best approximation of the underlying structure of the documents. SVD makes the documents that are similar appear more similar and the documents that were dissimilar appear to be more dissimilar as well. Our goal is not to actually reconstruct the original matrix but to use the reduced dimensionality representation to get similar words and documents.

Compute the similarity

You can query your model to find the documents relevent to search keywords. To find what matches the query in the reduced term-document space, the query must be transformed into a psuedo-document. The terms are represented in a m x 1 vector. The appropriate local and global weighting functions for the document collection are ran on the terms vector. This vector is compared to all the existing term and document vectors using cosine similarity. The documents closer to one are more similar to the query and the documents closer to zero are less similar to the query. A ranked list is returned with all of the cosine similarities.


Latent Semantic Indexing is a helpful technique to wade through massive amounts of documents to find the ones that are useful. This algorithm has been proven to work effectively as a means to index documents and allow for more documents to be added and indexed. With this in mind it can be seen why using this algorithm can be used in a search engine. Each document is a website and the search query can find the website that is most relevant to the query.

Here are some cool implementations of LSI:

tm — Text Mining in R: https://cran.r-project.org/web/packages/tm/index.html

Gensim — Amazing package for LSI in python: https://radimrehurek.com/gensim/tut2.html

Recurrent Neural Networks for Beginners

What are Recurrent Neural Networks and how can you use them?

In this post I discuss the basics of Recurrent Neural Networks (RNNs) which are deep learning models that are becoming increasingly popular. I don’t intend to get too heavily into the math and proofs behind why these work and am aiming for a more abstract understanding.

General Recurrent Neural Network information

Recurrent Neural Networks were created in the 1980’s but have just been recently gaining popularity from advances to the networks designs and increased computational power from graphic processing units. They’re especially useful with sequential data because each neuron or unit can use its internal memory to maintain information about the previous input. This is great because in cases of language, “I had washed my house” is much more different than “I had my house washed”. This allows the network to gain a deeper understanding of the statement.

This is important to note because reading through a sentence even as a human, you’re picking up the context of each word from the words before it.

A rolled up RNN

 A RNN has loops in them that allow infromation to be carried across neurons while reading in input.

An unrolled RNN

In these diagrams x_t is some input, A is a part of the RNN and h_t is the output. Essentially you can feed in words from the sentence or even characters from a string as x_t and through the RNN it will come up with a h_t.

The goal is to use h_t as output and compare it to your test data (which is usually a small subset of the original data). You will then get your error rate. After comparing your output to your test data, with error rate in hand, you can use a technique called Back Propagation Through Time (BPTT). BPTT back checks through the network and adjusts the weights based on your error rate. This adjusts the network and makes it learn to do better.

Theoretically RNNs can handle context from the begging of the sentence which will allow more accurate predictions of a word at the end of a sentence. In practice this isn’t necessarily true for vanilla RNNs. This is a major reason why RNNs faded out from practice for a while until some great results were achieved with using a Long Short Term Memory(LSTM) unit inside the Neural Network. Adding the LSTM to the network is like adding a memory unit that can remember context from the very beggining of the input.


These little memory units allow for RNNs to be much more accurate, and have been the recent cause of the popularity around this model. These memory units allow for the ability across inputs for context to be remembered. Two of these units are widely used today LSTMs and Gated Recurrent Units(GRU), the latter of the two are more efficient computationally because they take up less computer memory.

Applications of Recurrent Neural Networks

There are many different applications of RNNs. A great application is in collaboration with Natural Language Processing (NLP). RNNs have been demonstrated by many people on the internet who created amazing models that can represent a language model. These language models can take input such as a large set of shakespeares poems, and after training these models they can generate their own Shakespearean poems that are very hard to differentiate from originals!

Below is some Shakespeare

Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.

Come, sir, I will make did behold your worship.

I'll drink it.

This poem was actually written by an RNN. This was from an awesome article here http://karpathy.github.io/2015/05/21/rnn-effectiveness/ that goes more indepth on Char RNNs.

This particular type of RNNs is fed in a dataset of text and reads the input in character by character. The amazing thing about these networks in comparison to feeding in a word at a time is that the network can create it’s own unique words that were not in the vocabulary you trained it on.


This diagram taken from the article referenced above shows how the model would predict “hello”. This gives a good visualization of how these networks take in a word character by character and predict the likely hood of the next probable character.

Another amazing application of RNNs is machine translation. This method is interesting because it involves training two RNNs simultaneously. In these networks the inputs are pairs of sentences in different languages. For example you can feed the network an English sentence paired with its French translation. With enough training you can give the network an english sentence and it will translate it to french! This model is called a Sequence 2 Sequences model or Encoder Decoder model.


This diagram shows how information flows through Encoders Decoder model. This diagram is using a word embedding layer to get better word representation. A word embedding layer is usally GloVe or Word2Vec algorithm that just takes a bunch of words and creates a weighted matrix that allows similar words to be correlated with each other. Using an embedding layer genererally makes your RNN more accurate because it is a better representation of how similar words are so the net has less to infer.


Recurrent Neural Networks have been becoming very popular as of recently and for a very good reason. They’re one of the most effective models out for natural language processing. New applications of these models are coming out all the time and its exciting to see what researchers come up with.

To play around with some RNN check out these awesome libraries

Tensorflow — Googles Machine Learning frameworks RNN example: https://www.tensorflow.org/versions/r0.10/tutorials/recurrent/index.html

Keras — a high level machine learning package that runs on top of Tensorflow or Theano: https://keras.io

Torch — Facebook machine learning framework in LUA: http://torch.ch