Challenges in Implementing a Full-Text-Search Engine

(bhavaniravi.com)

121 points | by bhavaniravi 1672 days ago

5 comments

misterman0 1672 days ago
FTS is dead. It didn't use to be dead but it is surely dead today. For years and years and years we were told to use stop words, integrate a stemming library, make sure term weights are normalised with TF-IDF, maintain a synonyms dictionary and use an index that keeps track of term positions to be able to cater for term proximity and if you did all that the relevance of your search results would be top notch. Top notch! Today none of that is relevant. Today you need to provide semantic search ("SS") or your search will be considered broken because the results will truly suck. To provide SS you need ML.
FTS is still commonly included in e-commerce where I guess it's not quite dead but only because truly relevant search results are irrelevant to retailers' bottom line.
SS is at least one magnitude more complex to set up, especially if you want to be able to refresh your "index" and even more so if your data is big. What a typical back-ender would wip up in a couple of days using ES/Solr you now need a proper Math Dev to do and for the MD's model to become useful you need them to hand it over to a distributed systems expert.
SS through ML is commonly a nasty, duct-taped work-flow that at best results in a system that looks more like a POC or DEMO than a proper production system, unless you work at Bing/Google that is (probably, but I have no hands-on experience of those systems).
I've been trying for years, I'd say at least ten, to try to "commoditize" this work flow, make it simpler, more usable for generalist devs (not just ML people) but no matter what I do I keep getting crushed under the weight of the data. To me it seems search is dead and we haven't appointed a new king.
[-]
- ObserverEffect 1672 days ago
  Disclaimer: I work at Algolia (https://www.algolia.com/), a hosted search as a service API.
  While I agree that building a great and relevant search experience with a Lucene-based engine requires lots of extra time and effort to get right, there are other non-TFIDF based solutions that provide a much faster path to great relevance with far less effort (https://blog.algolia.com/inside-the-algolia-engine-part-1-in...), and it's possible to have semantic ranking without too much machine learning (https://blog.algolia.com/promote-search-results-query-rules/). Not to discount the value of machine learning - we're finding that for specific usecases ML can be a very valuable way to help surface more pertinent content for individuals based on their profile/preferences etc. (https://blog.algolia.com/personalization-announcement/).
  This may be along the lines of what you mentioned around "commoditizing" complex traditional search workflows. I'd be curious to hear more about what kind of use-cases you think are trickiest without SS.
  [-]
  - misterman0 1672 days ago
    Although I'm a big fan of Algolia search (because it's freakin' fast) I happen to know little to nothing of your search model other than what I have learned from Algolians chipping in right here on HN.
    I used to be quite impressed with Lucene, even at version 1.0 (when a fuzzy search meant a full table scan), then watched in joy when they conquered the search market, before realizing how it struggled (and still does) with, well, I hate to say it (because I'm usually ridiculed when I bring this up), y'know, big (-ish) data. The proposed and popular solution: sharding the data onto a cluster of machines.
    Algolia seems to be a focused, streamlined and more efficient ElasticSearch, at least in the FTS use case.
    I've worked almost exclusively in e-com for ~20 years. Algolia FTS+personalization seems to fit the e-commerce use case pretty darn well.
    I wonder, regarding "Algolia Query Rules" (which also seems like a real killer-feature for e-commerce):
    >> automatically transforming a query word into its equivalent filter (“cheap” would become a filter price< 400...
    How do you translate "cheap" into "price<400"? By maintaining a dictionary? Also, what if some people think 400 is quite expensive?
    I want to build or implement a search engine that is inherently self-maintained in the same way you and me are self-maintained. As humans, however, we do have a serious flaw. In order for us to maintain an index of our knowledge we need to sleep. To start with I'd like to try to mimic that construct, then move past it.
- AznHisoka 1672 days ago
  "Today none of that is relevant. Today you need to provide semantic search ("SS") or your search will be considered broken because the results will truly suck."
  I agree semantic search is useful, but what you're proposing sounds vague, like black box magic.
  To provide semantic search, you still do the dirty things you just mentioned: integrate a stemming library, integrate synonyms and a huge corpus of 2+ words topics (ie. when someone searches for "big data", you should always return documents with those terms together, and never apart). You need those things. ML might help you generate them, but it isn't magically going to tell you X, Y and Z documents should also be returned for a query, even though it doesn't contain that term.
  [-]
- arafalov 1672 days ago
  > SS through ML is commonly a nasty, duct-taped work-flow that at best results in a system that looks more like a POC or DEMO than a proper production system, unless you work at Bing/Google that is (probably, but I have no hands-on experience of those systems).
  This is being worked on in Solr community. See an in-progress book: https://www.manning.com/books/ai-powered-search And related issue for an example of work: https://issues.apache.org/jira/browse/SOLR-9418
- raz32dust 1672 days ago
  I've worked in well-known tech companies, and full-text search is very much alive and well. Sure, semantic search is obviously better. But you immediately add a ton of value to many use cases just with FTS. And upgrading to SS is, in several cases, not important enough to warrant the complexity.
- yxhuvud 1672 days ago
  > What a typical back-ender would wip up in a couple of days using ES/Solr you now need a proper Math Dev to do and for the MD's model to become useful you need them to hand it over to a distributed systems expert.
  This in itself is a good enough reason to be a sufficient argument to refute your main claim that FTS is dead. It will not be dead until an alternative is sufficiently commoditized so that regular backenders can set it up. It may be edged out of certain use cases where it is mission critical to get fantastic results, but there are MANY places of use where good old boring FTS is plenty good enough. Otherwise it wouldn't have been used in the first place.
- hdfbdtbcdg 1672 days ago
  FTS is fine for so many applications. Many many business cases are solved by elastic search in a couple of days with no need to use ML.
- marcusae313 1670 days ago
  I agree with you for the most part. There are companies providing an extensible and open source platform for incorporating models into production systems and even helping you build those models, like Lucidworks (my employer). Algolia platform is not powered by any ML afaik because their system is very closed and not easily tuned.
- fulmicoton 1671 days ago
  Can you point us to one search engine you know and we know which uses semantic search?
- ddorian43 1672 days ago
  Have you seen vespa.ai ?
  [-]
  - misterman0 1672 days ago
    No I haven't so thanks for the link. It looks very good which makes me cautious. Is it a pig with make-up, I wonder. What's your experience from using vespa.ai?
    [-]
    - j-e-k 1672 days ago
      Extremely positive.
      Very high performance, and much deeper support for tensors than elasticsearch has with with its new dense_vector type. Sadly they both have the same "gotcha" in that it is only really practical to run any _decent_ ranking model as a second phase re-rank to avoid linear scans.
- xhgdvjky 1672 days ago
  I agree that FTS is not optimal, but it uses core CS tricks so I think it's worth learning. Word vectors are also intersting as a concept on their own.
theandrewbailey 1672 days ago
I've used the full text search feature in Postgres (even before then, I was vaguely familiar with the topics covered here). It worked unless you misspelled something, or split/merged compound words. Trigrams solved that. Whenever I get around to upgrading, I'd love to use the websearch_to_tsquery function.
https://www.postgresql.org/docs/current/textsearch.html
https://www.postgresql.org/docs/current/pgtrgm.html
[-]
- SigmundA 1672 days ago
  PG really needs better ranking at least TF-IDF but also BM25.
  There has been some work but not sure when it will be stable, needs a new kind of index:
  https://github.com/postgrespro/rum
ddorian43 1672 days ago
This is extremely high level. A nice view of how lucene internals work by core committer Adrien Grand:
https://www.slideshare.net/lucenerevolution/what-is-inalucen...
https://www.youtube.com/watch?v=T5RmMNDR5XI
andrewmatte 1672 days ago
Nice article. I am interested to see this stuff blended with the GPU/ML powered databases rather than the TF-IDF of decades ago - as well as it works
[-]
- m_ke 1672 days ago
  Someone needs to make a DB with first class support for dense feature vectors (embeddings) and approximate nearest neighbor search.
  These two features would allow you to do visual search, semantic text search, recommendations, learning to rank and etc.
  [-]
  - mumblemumble 1672 days ago
    I'd love to have something like that. As far as I'm aware, one big limiting factor is that there aren't currently any great ways to do an index for approximate nearest neighbor search that doesn't require you to keep the whole index in memory. A disk-friendly indexing method would make it just a PostgreSQL plugin away.
    [-]
    - gravypod 1672 days ago
      There are no good exact indexing structures but there are a lot of very high performance approximate NN structures. Facebook has an open source implementation of some of these in a project called faiss [0] which does a relatively good job of this.
      [0] - https://github.com/facebookresearch/faiss
      [-]
      - blr246 1672 days ago
        At Frame.ai, we are using both PostgreSQL and faiss (and other tools) in our stack to do several different kinds of inference tasks on semantic representations of text to help companies understand and act on customer chats, emails, and phone call transcripts.
        We've frequently had the same dream of adding more native support for nearest-neighbor type queries, since that is the workhorse of so many useful techniques in the modern NLP stack.
        Right now, we have lots of dense vectors stored in massive toast tables in PG. It's faster to fetch them rather than recompute them, especially since there are a number of preprocessing steps that limit what we pay attention to.
        The discussion here about full text search versus semantic search is interesting. In our experience, both are highly relevant. Sometimes it's most useful for our customers to segment their conversation data by exact text matches, and other times semantic clustering is most effective. I think there's plenty of reason to offer both kinds of capabilities.
    - thanatropism 1672 days ago
      https://github.com/spotify/annoy
  - donretag 1672 days ago
    Elasticsearch now has that in versions 7.3 and later
    https://www.elastic.co/guide/en/elasticsearch/reference/curr...
    The vectors are only used for scoring, not matching, but they are working on a ANN model for that.
  - xellisx 1671 days ago
    Sphinx Search has an engine plugin for MySQL
avremel 1672 days ago
I wrote an intro to the Lucene scoring model with a python example:
https://github.com/avremel/lucene
Elastic/Solr is a very decent option. Last time I checked, Algolia and other SaaS were too expensive for small businesses.