Part-of-speech tagging, Question answering, Named entity recognition, Speech recognition, Text-to-speech, Language modeling, Translation, Speech-to-text, and Topic modeling are only a handful of the numerous tasks that fall within the broad category of NLP. Topic modeling is the process of examining a text collection’s contents from a course perspective.
Typical language in patent filings is legal and highly technical, with context-sensitive phrases that may have meanings very different from everyday speech. Searching through the corpus of more than 100 million patent papers can be time-consuming and lead to numerous missed results because of broad and non-standard terminology usage. Due to the patent corpus’s continued expansion, there is a need to create more beneficial NLP models for this field.
The Patent Phrase Similarity dataset is a novel human-rated contextual phrase-to-phrase semantic matching dataset. We offer granular rating classes similar to WordNet, such as synonym, antonym, hyponym, holonym, meronym, and domain related, in addition to similarity scores normally included in other benchmark datasets. According to preliminary findings, models that have been fine-tuned on this new dataset outperform conventional pre-trained models.
The Dataset of Patent Phrase Similarity
The researcher developed the Patent Phrase Similarity dataset, which contains numerous samples, to help train the newest generation of cutting-edge models. Many NLP models have trouble with data containing unconnected phrases with similar keywords. Many cases of antagonistic keyword matches that match unrelated phrases may be found in the Patent Phrase Similarity dataset. The dataset consists of 48,548 items with 973 unique anchors and is split into training (75%) and validation (5%) groups.
Establishing the Dataset
To create the Patent Phrase Similarity data, we first go through the 140 million patent documents in the Google Patents corpus and automatically extract key English phrases, most of which are noun phrases (for example, “fastener,” “lifting assembly,” and “ink printing,” among others) and useful phrases. Then, we randomly choose about 1,000 of the filtered phrases—which we refer to as anchor phrases—that have been kept after being filtered and kept in at least 100 patents. We locate each anchor phrase’s corresponding patents as well as all of their CPC classifications. The context CPC classes for the particular anchor phrase are then chosen randomly from a set of up to four matching CPC classes.
We employ two techniques for pre-generating target phrases: partial matching and a masked language model (MLM). We choose phrases at random from the entire corpus that only partially match the anchor phrase (e.g., “abatement,” “noise abatement,” “material formation,” and “formation material”) to do partial matching. To do MLM, we choose phrases from the patents that contain a specific anchor phrase, mask them out, and then use the Patent-BERT model to forecast candidates for the text that has been masked. All of the sentences are then cleaned up, including lowercasing, punctuation removal, and the elimination of some stopwords (such as “and,” “or,” and “said”), before being submitted for evaluation to professional raters. Each phrase pair is evaluated separately by two raters who are experts in the field of technology.
Additionally, each rater creates brand-new target phrases with various ratings. In particular, students must come up with some unrelated, low-similarity targets that only partially match the original anchor and some high-similarity targets. The raters convene to discuss their ratings and determine the final ratings at this point.
The U.S. Patent Phrase to Phrase Matching Kaggle competition uses the Patent Phrase Similarity dataset to assess its performance. About 2,000 contestants from all over the world entered the challenge because it was so well-liked. The highest performing teams successfully applied several strategies, including ensemble models of BERT variations and prompting (see the complete discussion for more details). The top outcomes from the competition are displayed in the table below, along with several ready-made baselines from our study. For downstream models to distinguish between various similarity ratings, the Pearson correlation metric was utilized to calculate the linear correlation between the predicted and actual values.
The baselines in the study are zero-shot because they use commercial models without further adjusting them for the new dataset (we use these models to embed the anchor and target phrases separately and compute the cosine similarity between them). The outcomes of the Kaggle competition show that by utilizing our training data, one can significantly outperform current NLP algorithms. By comparing the scores of one rater to the sum of the scores of the two raters, we have also approximated human performance on this task. The findings show that, even for human experts, this is not a particularly simple task.
Pearson correlation for Model Training
|Kaggle 1st place single||Fine-tuned||0.87|
|Kaggle 1st place ensemble||Fine-tuned||0.88|
Performance of well-known models using zero-shot (zero-tuning), models modified using the Patent Phrase Similarity dataset as part of the Kaggle competition, and single human performance.
Final Thoughts and Future Work
The patent corpus can be used to develop machine learning benchmarks that are more challenging. For instance, the C4 text dataset’s extensive patent filings are utilized for training the T5 model. The BigBird and LongT5 models also use the BIGPATENT dataset.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Patents Phrase to Phrase Semantic Matching Dataset'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, dataset and reference article. Please Don't Forget To Join Our ML Subreddit