The science behind finding a needle in a needlestack.
Language is hard. It’s very difficult to find out what people are trying to say. It’s even harder to try to search through a corpus of text to find something relevant. With so many differences in meaning and phrasing how are we supposed to find what we want? This is where topic modeling comes in.
Topic Modeling is a form of identifying patterns in text using math. Using these methods actually allows us to find hidden meaning and relationships that are not obvious to us as humans. To create our search engine and find what we want from the text, we will be using an algorithm known as Latent Semantic Indexing (LSI) others may refer to it as Latent Semantic Analysis or (LSA) but they’re one in the same.
Latent Semantic Indexing
LSI is a method which allows users to find documents based on keywords. LSI works with the assumption that there is an underlying structure to language that is hidden because there are so many words to choose from. I’m going to highlight the main steps then go through each of them in detail.
- Organize your data into a matrix for word counts per document using Term Frequency Inverse Document Frequency
- Reduce the dimensionality of the matrix using Singular Value Decomposition.
- Compare words by taking the cosine similarity between two vectors to find how similar they are.
Organize your data
The first thing to getting started on using LSI is to transform your documents into a usable form. Most commonly used is Term Frequency Inverse Document Frequency (TFIDF). First off we need to create a list of all the words that appear in the document. After getting all the words we need to remove the words that don’t give us much meaning to the sentece such as “of”, or “because”. These words are considered stop words. We can even cut down this list even more by removing words that only appear in one document.
This form creates an n-dimensional matrix with terms that appear in any of the documents by the different documents in the corpus. The intersection of the two is how many times a particular term appears in the document.
Since our tfidf matrix is a very large and sparse it may be straining on memory, and cpu intensive to search through this entire thing everytime we have a query. A solution to this problem is to use Singular Value Decomposition. What SVD does is takes our tfidf matrix and reduces the dimensionality of it to something manageable. There is some debate around what dimensionality is ideal. Anywhere within the 50–1000 dimensions work are effective.
SVD comes from linear algebra where a rectangular matrix A can be broken down int the product of three different matrices. The theorem is usually written as:
A is the matrix that is broken into an orthogonal matrix U, a diagonal matrix S and the transpose of an orthogonal matrix V. “Where UTU = I,VTV = I; the columns of U are orthonormal eigenvectors of AAT, the columns of V are orthonormal eigenvectors of AT A, and S is a diagonal matrix containing the square roots of eigenvalues from U or V in descending order.” (Kirk Baker)
The idea behind this is that we take our tfidf and break it down into independent components. Then take these independent components and multiply them all together to get A. These components are an abstraction from the noisy correlations in the original data. This gives us the best approximation of the underlying structure of the documents. SVD makes the documents that are similar appear more similar and the documents that were dissimilar appear to be more dissimilar as well. Our goal is not to actually reconstruct the original matrix but to use the reduced dimensionality representation to get similar words and documents.
Compute the similarity
You can query your model to find the documents relevent to search keywords. To find what matches the query in the reduced term-document space, the query must be transformed into a psuedo-document. The terms are represented in a m x 1 vector. The appropriate local and global weighting functions for the document collection are ran on the terms vector. This vector is compared to all the existing term and document vectors using cosine similarity. The documents closer to one are more similar to the query and the documents closer to zero are less similar to the query. A ranked list is returned with all of the cosine similarities.
Latent Semantic Indexing is a helpful technique to wade through massive amounts of documents to find the ones that are useful. This algorithm has been proven to work effectively as a means to index documents and allow for more documents to be added and indexed. With this in mind it can be seen why using this algorithm can be used in a search engine. Each document is a website and the search query can find the website that is most relevant to the query.
Here are some cool implementations of LSI:
tm — Text Mining in R: https://cran.r-project.org/web/packages/tm/index.html
Gensim — Amazing package for LSI in python: https://radimrehurek.com/gensim/tut2.html