What is SoftMatcha?
SoftMatcha is a large-scale text search system designed for soft (i.e., semantic) matching matching over massive corpora.
Unlike traditional keyword-based tools such as grep or exact n-gram retrieval systems like infini-gram, SoftMatcha enables retrieval based on semantic similarity rather than exact surface forms.
SoftMatcha 2 Latest Preprint
A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
Achieves fast semantic searches on trillion-scale corpora using a suffix array and corpus-aware pruning with new support for insertions and deletions in query patterns.
SoftMatcha ICLR 2025
A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches
Relaxes lexical searches with word embeddings while preserving inverted-index efficiency.