How fast does interpolation search converge?

(lemire.me)

57 points | by tgymnich 1240 days ago

11 comments

unnouinceput 1239 days ago
As a side note, if you have the possibility to sort data before searching values in it then implement trie:
https://en.wikipedia.org/wiki/Trie
This is the best for data with very few updates/modifications and a lot of queries. One example would be points of interest on a map, like restaurants on google maps. Not like those restaurants gets updated/modified every single day but you do have clients that while using your restaurant finder app they query a lot of them on a given radius around their GPS coordinates.
I had such a project in the past. My client acquired this data, around 3 billion points of interests residing in a PostgreSQL DB that was around 900 GB. While the server was beefed with 128GB RAM, was not nearly enough to have it all in memory and it had a responsiveness of several seconds, way too much for young impatient users and my client was seeing a decline in users. I reorganized data in PGSQL to fit a trie tree and now a query was solved in milliseconds. Accounting all other stuff on top, the app was now able to generate a response under a second. Trie ftw.
[-]
- karterk 1239 days ago
  Tries are insanely good for implementing fuzzy search as well (say based on a Damerau–Levenshtein distance).
  Using this approach for a typo-tolerant instant search engine that I am working on: https://github.com/typesense/typesense
- tpolzer 1239 days ago
  If you're already using Postgres, isn't that precisely what GiST indexes are meant for?
  [-]
  - unnouinceput 1239 days ago
    Yes, they are. But those are general approach. While a good approach my implementation was custom made for the client's need, allowing to achieve a better performance.
ntonozzi 1240 days ago
A related technique is generating an index using a machine learned model (well explained here: https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...). In interpolation search, we just use a simple linear model. Kraska and co. use much more complex models to beat BTree indexes. However, the Recursive Model Indexes (RMIs) that they published (https://github.com/learnedsystems/RMI) use Rust to a C++ model that is exported as a library, so it definitely doesn't seem ready for production yet!
rcthompson 1239 days ago
The mention of log(log(N)) reminded me of something entertaining I once heard from an algorithms professor: "log(log(N)) is less than 10".
[-]
- carlob 1239 days ago
  heh, log(log(number of atoms in the universe)) < 6
nn3 1239 days ago
I think Microsoft uses with btrees:
https://github.com/microsoft/ALEX
Of course they call it "ML", but it's just a linear interpolation for faster tree search.
[-]
- ntonozzi 1239 days ago
  Very cool, seems much higher quality than the RMI library. I’m hoping more of these learned data structures make their way into libraries.
Const-me 1239 days ago
One good thing about binary search is memory access pattern.
The element in the middle of the array is tested every single time. Similarly, the elements at 25% and 75% of the array are tested very often, in 50% searches/each.
Modern computers have very slow memory compared to computation speed, and multi-level caches to compensate. Binary search RAM access pattern, when done many times, is friendly towards these multi-level caches. The elements at positions like 25%, 50%, 75% will stay in L1D cache.
[-]
- jonstewart 1239 days ago
  It’s true that those places will get cached, but binary search tends to dance around the address space a bit too much with large arrays. That’s one of the advantages of interpolation search—provided the items are uniformly distributed.
twotwotwo 1239 days ago
In college I played with a variation of this where you fix the size of the collection to a power of 2 (sprinkling in zeroes/repeated values if needed): you make your initial guess for where to look is just by taking the top bits of the searched-for value, then estimate the number of slots off you were from the top bits of the _difference_ between the value you found and the one you want (clamping your next guess to the first/last item if you'd fall off the end otherwise). In the concrete example I was playing with, it worked to just stop there then do linear search, but you could be cleverer about how long to iterate and when/how to bail out to something with a better worst case.
As Lemire says the problem (a good "problem"!) is that hashtables and B-trees can be quite fast, robust against different distributions, and you have them right at hand. It'd almost have to be some weird situation like you're handed data in a format that kinda fits the requirements and you're looking for how to work with it in-place.
sillysaurusx 1240 days ago
Neat trick. Possibly a bit brittle, since you have to maintain your knowledge of the distribution somewhere -- i.e. it's "state you have to update or things might break," which is always worrisome.
I wonder what the worst case performance is -- if your guess is completely wrong, will it still converge in no worse than O(log(N))? Do you guess only once, at the beginning, or do you keep guessing where to go based on the distribution?
If you only guess once, at the beginning, then there's probably no risk. But if you keep trying to aim for the wrong target every iteration, I wouldn't be surprised if you get pathological behavior.
It would be kind of fun to try to calculate or simulate how bad it can get if you keep making incorrect guesses.
[-]
- mehrdadn 1240 days ago
  > if your guess is completely wrong, will it still converge in no worse than O(log(N))?
  No, but there's a simple but powerful technique you can use to ensure this from the outside. See if you can figure it out. (If you don't, look up introsort.)
  [-]
  - usmannk 1240 days ago
    Is the trick quit and start doing a binary search?
    [-]
    - smallnamespace 1240 days ago
      Ooh, or what about alternating interpolation and binary search steps?
      Interpolation steps make quick progress if your distribution is correct.
      [-]
      - sritchie 1239 days ago
        This sounds like Brent’s method for root finding! Also very similar is his method for Unitarians minimization. I ported this into Clojure recently from Python / FORTRAN... not quite literate programming but well documented if you want to give it a peek. https://github.com/littleredcomputer/sicmutils/blob/master/s...
      - sillysaurusx 1239 days ago
        That's actually quite clever. I like it.
        I can't figure out the puzzle's answer. Hopefully they'll post the solution someday.
        [-]
        mehrdadn 1239 days ago
        Yup that was the answer :-) interleaving the two lets you get the best of both worlds at the cost of a constant-factor penalty. This is a pretty clever & general-purpose meta-algorithm that works for pretty much any problem where you have competing algorithms with different strengths.
        Quitting and starting over as GGP mentioned is another option that also works for this problem, though it might not work as well for other problems where you can't put a nice bound on the number of steps.
        FWIW there's also a very dumb (or smart I guess, depending on your point of view) solution, which is to just have 2 threads run simultaneously, and report the result of the first one. It might make sense for some problems. And in any case, this is basically equivalent to what you're doing on 1 CPU with the previous technique.
        [-]
        whatshisface 1239 days ago
        A strategy in use since the 1970s is to quit interpolation search when it starts picking values too close to the edge of the range, switching to binary search afterwards. That's a well known technique for finding the roots of polynomials.
        >This being said, I am not aware of interpolation search being actually used productively in software today. If you have a reference to such an artefact, please share!
        This is also my response to that note in the article, interpolation search has been well-known in the numerical methods world since ancient times. ;)
brian_herman 1239 days ago
I converted his code to python. https://gist.github.com/brianherman/7e58b48ddbb7060663139416...
pfdietz 1239 days ago
The van Emde Boas data structure lets you do searches in sets of integers in O(loglogn) time, if the integers have O(log n) bits.
whatshisface 1239 days ago
You don't want to pivot on the average, you want to pivot on the median. You will gain the most information with every step if 50% of the probability mass is above your guess and 50% is below.
jqpabc123 1239 days ago
The problem I see is the assumption of a uniform distribution.
The problem spaces I normally deal with involve clumps of data, not really uniform except within each clump.
[-]
- kzrdude 1239 days ago
  is there something that's better than interpolation search and binary search, for that kind of data?
  [-]
  - jqpabc123 1239 days ago
    It's often possible to do "better" by tuning something toward the data set. Finding something that is always better is hard.
    One thing I have done is a double binary search.
    Store prefixes and suffixes seperately. A binary search of the prefixes identifies the suffix clump to search with a binary search. This involves a slight increase in storage --- each prefix needs a suffix pointer.
    But maybe I will now try an interpolation search on the suffixes.