Latent Semantic Analysis (LSA) is the most popular technique of Corpus-Based similarity. One way to tackle this problem is to use Latent Semantic Analysis[1]. The technique was proposed by Deerwester et. al and takes advantage of the implicit higher-order structure in linking terms to documents.

Latent semantic indexing (LSI) is a machine learning technique that takes as input a collection of documents and produces as output a vector space representation of the documents and terms of the collection. The central feature of LSI is the use of singular value decomposition (SVD) to achieve a large-scale dimensionality reduction for the problem addressed. LSI overcomes two of the most problematic constraints of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy). Synonymy is often the cause of mismatches in the vocabulary used by the authors of documents and the users of information retrieval systems.[19] As a result, Boolean or keyword queries often return irrelevant results and miss information that is relevant.

LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows[4].


Case Study:

Patents are issued by a specific government which grants exclusivity rights to the owner of invention. this is to protect the inventor's IP(intellectual property) from getting copied or stolen. Since many business are built on novel technology these days, its useful to protect their 'trade secret' by filing a patent and thus making a multi-billion dollar industry. Filing a patent is tricky business and requires a 'prior art' search where the patent officer looks for similar filed patents to make sure a similar invention wasn't historically filed. A patent search process always starts with a multi-page document describing the invention patent search professionals are experts in creating large queries with both keywords and metadata. It is typical for a patent examiner to formulate a set of search queries from a given new patent) and examine about 100 patents retrieved by 15 different queries on average. This process of validating the patentability takes about 12 hours to complete. What makes this search tricky is the use of inscrutable text which describes the invention. Apart from patent filing there exist several other use-cases. e.g keeping tract of inventions or following the development of particular set of technologies  that are being developed by your business competitors. Historically the options are Using either google patent search or United States Patent and Trademark OfficeUnited States Patent and Trademark Office. But keyword search has its limitation and here We are interested in going beyond the keyword search to supplement the analysis

Dataset:

Patent data can be obtained directly from the patent offices, or from research collections. US patents corpus has 10 million documents which we can get from IFI Claims patent database or Google public patents data. USPTO has its full text database available online for manual search[11].

Pre-Processing:

The patenting process requires the documents to adhere to a certain structure where usually we have an abstract, description and claims. Building a search involved building a data pipeline which cleans, transforms and models the data. We use gensim and spark to pre-process our data. Following the standard pre-processing steps for text-data we perform the following tasks:

  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Words that have fewer than 3 characters are removed.
  • All stopwords are removed.
  • Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
  • Words are stemmed — words are reduced to their root form.

The aim of this step is thus to We Start by pre-processing the data with an aim clean the data,remove stop words, handle missing entries and so on...

‘a document is represented by the features extracted from it, not by its “surface” string form’.

There are several was to pre-process text data so to come with the right set of features. we need to parse the XML content(stored in postgres) to extract  title + description + claims text. These three sections can be combined to create one text blob. For patent documents, Claims text can be separated to create a separate claims index, this is if usually done to increase search accuracy. At the end of the day we generate two different indexes, the description_text index's corpus is comprised of description text objects and the claim_text index's corpus is comprised of claim text documents. Next we use we use a general document representation called Bag-of-words, (unigram) for each of the given documents to create an unordered list of words which become the features of the document. we tokenize the documents by splitting on whitespace with final result being a python list of words. This mimics the Deerwester experiment[1].

Several other document representation strategies exist as well and can potentially give us better results. In this step using the bag of words list as input we create a sparse vector representation for each document. In this representation, each document is represented by one vector and elements of that vector represents (word, frequency) pair. It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary in Gensim jargon. Gensim provides a dictionary class for creating this vector representation. At the end of this step we would have a sparse vector(a list of tuples, where each tuple is a word_id,frequency pair) representing each document in the corpus.

Transformations and modeling: Given the bag of words that you extracted from the document, you create a feature vector for the document, where each feature is a word (term) and the feature's value is a term weight. The term in our case is a TF-IDF value. The entire document is thus a feature vector, and each feature vector corresponds to a point in a vector space. The model for this vector space is such that there is an axis for every term in the vocabulary, and so the vector space is V-dimensional, where V is the size of the vocabulary. The vector should then conceptually also be V-dimensional with a feature for every vocabulary term. However, because the vocabulary can be large (on the order of V=100,000s of terms), a document's feature vector typically will contain only the terms that occurred in that document and omit the terms that did not. Such a feature vector is considered sparse.

Method:

Off-the-shelf search and indexing methods face challenges due to neologisms and intentionally over-generalized expressions, because the applicant of the patent wants to claim a big chunk of innovation to their name.  

Evaluation:

We need to assess how good our semantic search system system is in terms of returning relevant documents that fulfil a specific information need.Real world systems are challenging to evaluate where user's information need is dynamic. In practice to compare different retrieval modeling strategies we aim for relative performance evaluation between systems. Its good to establish a baseline and then use the scores to rank the modeling approaches(or IR systems) relative to each other. It is also important to understand the information needs of professional users using the patent search service who typically have a e lower tolerance to errors compared to normal web search users.  

Trippe and Ruthven  for instance, answer the very fundamental question of “What is success?” with risk minimization. The argument is that what the searcher is ultimately trying to achieve is to minimize the risk of having a patent application rejected, infringing on someone else’s in force patent, being excluded from a market because of a new patent, etc. They map the fundamental metrics of IR effectiveness, precision and recall, to the different search tasks and how they are ordered on the “risk scale.” They observe that precision is more appropriate as a match, insofar as it orders the tasks in the same way as the importance of risk minimization.

Simulating user Information Needs: Generally speaking in a lab setting we follow the Cranfield approach to measure the effectiveness of an IR retrieval strategy. This requires the use of test collections  can be used to create an experimental setup where various modeling approaches can be benchmarked for comparison. Test collection consists of a document collection, a test suite of information needs, expressible as queries and a set of relevance judgments, standardly a binary assessment of either relevant or non-relevant for each query-document pair[7]. Patent research community came up with CLEF, TREC, and NTCIR evaluation campaigns.

Sanity Checks (when test collection isn't available):

About DeerWester collection: This is a tiny corpus of nine documents, each consisting of only a single sentence.


>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

We use LSI to extract two topics out of this corpus.  After the model and index has been created on this dataset we perform the following query:

SearchQuery: "Human Computer Interaction"

family members Test:

Patents often have "family members" which share the same priority date, the same description text but have different claims text.  While the claims are different between family member to family member, they are very likely to be very very semantically similar.For any size corpus, submit the full text of each patent document in the corpus.For each search identify what % of family members are returned in

  • #1 search result
  • Top 20 search result
  • Top 21-100
  • Not in top 100


Film Unit with lens:
This test might end up being a bit more of a heuristic rather than statistical test but I think it is an important test for any successful algorithm to pass.This Chinese patent, when translated to English uses the word "camera" only as an adjective or in combination with the more esoteric known "camera obscura". The word "Camera" is never used in the claims. Rather, the primary noun in the document is the “film unit with lens" which is essentially a "camera" as we know it but is never referred to in that way in the claims set.

What questions are you trying to answer?Here's the test:"Any successful semantic search engine should return "camera" patents when this chinese patent is used as a search seed." For example, it should return many patents owned by  Kodak, Canon, Fuji Film, etc.Our current search engine does a good job of this.


Performance Evaluation:

Apart from search accuracy we have to do performance evaluation where we ask: How long does it take to train/model, index the data and secondly what is the Query performance(i.e how long does it take to process one or more queries)? Some of these can be seen as non-functional requirements.

Basic Experiment:

To evaluate the performance of our system we replicated this experiment done by gensim authors : https://radimrehurek.com/gensim/wiki.html, where we’ll apply LSA to a corpus containing full set of articles contained in Wikipedia.

Differences in dataset:

  • Gensim Authors used a corpus that contains 3.9M documents, 100K features (distinct tokens) and 0.76G non-zero entries in the sparse TF-IDF matrix. The Wikipedia corpus contains about 2.24 billion tokens in total and the size is 8GB compressed.
  • We used this dataset: https://dumps.wikimedia.org/enwiki/20170820/  which is about 5 Million documents and about 46 GB of raw text  - For comparison, the size when compressed is 13 GB.

Parameters(same for both experiments):

  • A LSI model was created with 400 topics with 2^17 vocabulary size

Hardware Differences:

Gensim Experiment - Hardware Details(2012):

  • MacBook Pro, Intel Core i7 2.3GHz, 16GB DDR3 RAM, OS X with libVec.
  • In distributed mode - four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM with ATLAS)

Our Hardware Details:

  • Experiment was run on 5 instances of r4.4xlarge.
  • They were using ATLAS (Designed for Matrix operation on large dataset) Our system supports ARPACK but we never used it in this evaluation. Using Arpack can lead to 2-3x performance gains at minimum.
Step 
Gensim 
Our Implemetation 
Data Cleaning + PreProcessing (results in TF_IDF representation) 
9 hrs (Can only run on MacBook)
~30mins (Spark)
Modeling
5.25h(on Macbook)  or 1 hr 40 mins(on Cluster)
~ 2hrs (Spark)
Indexing
N/A
~3hrs  (Gensim)
Total
~14
~5 hrs 

Conclusion:

In this project we built a semantic search engine allows us to understand user's intent through contextual meaning. We started by discussing some of the specificities of the patent domain, its rules and customs, and learned how patents are formed as documents, pre-processed and modeled. The core technique behind LSA is singular-value decomposition in which we decompose a term document matrix into a set of orthogonal vectors.


References:

[1] http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf

[2] https://lirias.kuleuven.be/bitstream/123456789/321960/1/MSI_1114.pdf

[3] http://pages.cs.wisc.edu/~jerryzhu/cs769/text_preprocessing.pdf

[4] detailed discussion on advanced topics:  Moens, M. F. (2006). Information extraction: Algorithms and prospects in a retrieval context (The Information Retrieval Series 21).

[5] Paper:  'So many topics, so little time':
http://sigir.org/files/forum/2009J/2009j-sigirforum-roda.pdf

[6] Patent Information Retrieval:  http://www.ir-facility.org/c/document_library/get_file?uuid=a176a998-bfc6-41a4-b0d2-852784e3e720&groupId=10156

[7] http://www.informationr.net/ir/18-2/paper582.html#.WHSxs7Z96QM

[8] https://www.ccs.neu.edu/home/vip/teach/IRcourse/IR_surveys/fntir_2013.pdf

[9] http://patft.uspto.gov/