Sonic: Fast, lightweight and schemaless search back end in Rust

(github.com)

238 points | by louis-paul 1860 days ago

8 comments

manigandham 1860 days ago
Always nice to see more alternatives to Elasticsearch. That project could be so much better with some proper planning and focus.
There's also Toshi: https://github.com/toshi-search/Toshi which is built on top of Tantivy: https://github.com/tantivy-search/tantivy
And for C++, there's Xapiland: https://github.com/Kronuz/Xapiand
And for Go, there's Blast: https://github.com/mosuka/blast built on Bleve: https://github.com/blevesearch/bleve
mrec 1860 days ago
Huh. When this was initially posted it had a weird and commercially restrictive license, but it looks like that's been reverted, possibly after (polite) discussion on /r/rust. It's MPL 2 now.
https://github.com/valeriansaliou/sonic/issues/52#issuecomme...
javitury 1860 days ago
Performance figures are awesome. Also language support is great.
> Sonic only keeps the N most recently pushed results for a given word
This index discards old entries. This is fine for messages, in which aging items lose relevance. Yet the developer uses it for a help desk, which I think should give equal importance to all items.
In this area I would say the main comperitor is Groonga. It can be integrated into PostgreSQL with the PGroonga extension, and it indexes all of the data. However it consumes way more ram.
mellab 1860 days ago
Wow. I’ve spent the last three weeks building a custom search solution in Kotlin - in my case I’m using tokenizers from Lucene and using a radix trie as an index. I actually looked at using Bleve (another rust search lib) initially but it didn’t have the right language support
Glancing over this it looks like a nearly perfect fit for my use case I just wish I had seen this a couple of weeks earlier!
[-]
- networkimprov 1860 days ago
  Bleve, from Couchbase, is in Go :-)
  https://github.com/blevesearch/bleve
- O_H_E 1860 days ago
  And that's why I believe that the search/discovery problem is not solved yet by google
  [-]
  - StavrosK 1860 days ago
    Far from it, search sucks nowadays. Just think about how hard it is to find stuff from Twitter, reddit etc, even though that's where most of the content is.
    And if you don't know what you want exactly, you can't find anything.
  - manigandham 1859 days ago
    Google is a public service to search the internet. This product and thread is more about adding search functionality to other applications and private data.
FridgeSeal 1860 days ago
Talk about perfect timing!
I was looking for something just like this for a project in my team. We had been using this setup where a huge chunk of the data was being stored in triplicate: some of it in ES, some more of it in another database and finally the whole dataset in our data warehouse.
Hopefully I can use this to only provide the index + full text capability and just use the warehouse itself as the main db because the query performance is similar enough and the warehouse is criminally underused for what we pay for it.
[-]
- winrid 1860 days ago
  What's preventing you from using ES for everything? Slow writes?
  [-]
  - FridgeSeal 1860 days ago
    Write speed is fine, it’s more the fact that the dataset is reasonably large, and to run an instance with enough capacity and nodes (even with spill to disk), is silly expensive.
    [-]
    - mverwijs 1860 days ago
      Being in a simular situation what do you consider "reasonably large"?
      [-]
      - FridgeSeal 1860 days ago
        250-300GB.
        Not large by absolute standards sure, but large enough to cause issues.
        I’m sure there’s some kind of solution that involves re-architecting the ES cluster and indices and re-architecting the data flows and stuff. But if our options are go through all that, or seriously slim down our architecture and costs by just running Sonic + our data warehouse, I’m definitely going to give it a go. After all, worst comes to worst we can go down the re-architecting ES route if Sonic doesn’t work out.
        ¯\_(ツ)_/¯
        [-]
        Xylakant 1860 days ago
        I’d be curious what your expectations and constraints are, but from my experience of running clusters in the double digit TB-Size my ballpark figure for that amount of data would be 2 medium size data nodes and a small tiebreaker. Alternatively, if you can live with the reduced resilience and availability, even a single node might just do. Depends on the expectated churn though, ES really does not like document updates.
        [-]
        cinbun8 1859 days ago
        That does not sound like a good idea. You can't even maintain a quorum of 2 replicas with n=3 on a cluster like that. Losing one data node would be disastrous.
        [-]
        Xylakant 1859 days ago
        That’s really not how ES replica works. The quorum is formed on the master eligible nodes (hence a tie-breaker) and is only required to elect a master. The elected master designates a primary shard and as many replica as you configure. However, replica shards are replica and may lag. There’s no read quorum or reconciliation or anything happening. If a primary fails, an (in-sync, depending on the version of ES) replica is auto-promoted. The master keeps track of in-sync replica and you can request that writes are on a number of replica before a write returns, but still, no true quorum.
        You can absolutely run 2 data/master-eligible nodes plus a single master-eligible tie-breaker node as a safe configuration. The only constraint is that you should have an uneven number of master-eligibile node to avoid a split brain. You also need to understand what the resilience guarantees are for any given number of replica (roughly: each replica allows for the loss of a single random node) and how many replica you can allocate on a given cluster (at most one per data node). That would allow you to run a 2-datanode cluster in a configuration that survives the loss of one node.
        [-]
        cinbun8 1859 days ago
        I was not saying that's how they work. Most prod clusters are configured for high availability and multiple replicas. They have at least 3 nodes and 2 replicas configured for the index. Sure you can run this configuration, but do you really want all your traffic to hit this one instance when the other one goes down?
        [-]
        Xylakant 1858 days ago
        > Most prod clusters are configured for high availability and multiple replicas
        I've been doing ES consulting and running clusters since 0.14 and I see very few clusters that run more than a single replica. Most 3 node clusters I see are run with three nodes because you can then have three identical configurations at the cost of throwing more hardware at the problem.
        > but do you really want all your traffic to hit this one instance when the other one goes down
        Whether that's a problem really really depends on whether your cluster is write-heavy or read heavy. Basically all ELK clusters are write heavy and in that case, loosing one of two nodes would also cut writes in half (due to the write amplification that replica cause). Other clusters just have replica for resilience and can survive the read load with half of the nodes available. Whether that is the case for your cluster(s) depends - that's why I explicitly asked what constraints the OP had.
        fnordsensei 1859 days ago
        From what I've learned, running any cluster on fewer than four nodes is not really recommended.
        [-]
        Xylakant 1859 days ago
        I’ve run quite a few cluster on such a configuration or alternatively 3 data/master-eligible nodes. It’s a safe configuration unless you manage to overload the elected master. But if you’re fighting that issue, you’ll have to go beyond 4 nodes and have a triplet of dedicated master-eligible nodes plus whatever data nodes you need.
        I pretty much specifically avoid 4 node clusters. You’d have to either designate 3 of the four nodes as master-eligibile with a quorum of 2 or have all of them master-eligible with a quorum of 3. Both options allow for failure of a single node before the cluster becomes unavailable. Any other configuration would either fail immediately on a node loss (quorum 4) or be unsafe (quorum of 2, allows for split-brain)
        I’d much rather opt for 4 data/master eligible nodes plus a dedicated master eligible node with a quorum of 3.
        You also need to pick the number of replica suitably: each replica allows for the loss of a single random(!) data node while retaining all your data. Note that if losses are not random but you want to safeguard against loss of a rack or an availability zone or such, configurations are possible that distribute primary and replica suitably (“keep a full copy on either side”)
        FridgeSeal 1859 days ago
        Is that double digit TB on ElasticSearch?
        [-]
        Xylakant 1859 days ago
        Yes, certainly.
- dannycastonguay 1860 days ago
  Curious to know if you considered https://www.algolia.com/ as well?
  [-]
  - mellab 1860 days ago
    Algolia is a beautiful product but it’s expensive. At just 1 mil items you’re already paying 500 a month
marmaduke 1860 days ago
This looks like a breath of fresh air. Elasticsearch won’t even start if it can’t preallocate 2 GB
[-]
- Xylakant 1860 days ago
  That’s just false. The default config might set the JVM HEAP to 2GB (though I’m fairly certain it’s 1GB) but ES will start up with half a GB of heap with no issue.
  [-]
  - marmaduke 1859 days ago
    It was the default behavior on my install, documented recommendation is half the system memory and the logs don’t provide useful info when the heap allocation fails.
    Perhaps the upstream default is 1 GB but this was not the default, and there is much confusion on setting these correctly
    https://stackoverflow.com/a/40333263
    Not everyone has time to dig into it; of course if you did that’s good for you
    [-]
    - Xylakant 1859 days ago
      What your system install does and whether the default is good for what you intend to do is a bit different from what the system allows. If you want to use a piece of software, you’re expected to somewhat understand what it does. At least reading the “setup” chapter of the documentation which explains which ENV var to set should be recommended. (I specifically link to the 1.7 docs to show that it’s been in there for quite some time) https://www.elastic.co/guide/en/elasticsearch/reference/1.7/...
      The recommendation is btw: at most half of the systems memory, no more than 31g (due to how the jvm compresses heap pointers), and keep at least as much memory available for disk buffers/caches as you allocate to the heap.
sidcool 1860 days ago
I would like to understand the code for the project. What approach would help?
[-]
- d33 1859 days ago
  Personally, I would advise finding an itch to scratch - something you'd like to see improved. Then try to understand the code from this perspective - where would you put the functionality, which pieces would it be connected to? Try to follow the lead, make notes as you read the code. You'll eventually get a feel of the infrastructure, going from the deal to the big picture - this is my default way of navigating projects.
  If at any moment you feel lost, look through related issues and merge requests, as well as Git history. Perhaps you'll see how things get changed in the project, patterns intrinsic to it.
  Also, keep in mind that you can always try to contact the community/author or invite someone to try figuring out your goal with you - once you collaborate, you'll get a solution tailored to the way you think. It does engage other people, but it also makes coding social and (at least to me) more satisfying.
  Let me know what you think of this approach!
StavrosK 1860 days ago
This looks pretty cool! I'd like to try it, but it looks like I'll have to wait for the Python client library first.
[-]
- MosheZada 1858 days ago
  I just wrote a basic one https://github.com/moshe/asonic
  [-]
  - StavrosK 1858 days ago
    That looks great, thank you! I'll try it out and send you any feedback I have. Also, you might want to link to Sonic in the README, for people unfamiliar with it.
    [-]
    - MosheZada 1858 days ago
      Thanks, feedback is appreciated. I've added a link back to Sonic
- marmaduke 1860 days ago
  The protocol description makes it look fairly trivial to script,
  https://github.com/valeriansaliou/sonic/blob/master/PROTOCOL...
- deadwisdom 1860 days ago
  Wanna build it with me?
  [-]
  - arcticbull 1860 days ago
    The Rust Python bindings are actually pretty good, I hacked together a project to let you deploy Rust micro services in Lambda via Python module bindings a while back.
  - StavrosK 1860 days ago
    Normally I would, but I don't have an immediate use for this, which kills my motivation
- 21stio 1860 days ago
  a grpc endpoint would be nice