Another instance of the Wikipedia page for a product  being more useful than the main site to describe it:
* MemSQL is a distributed, in-memory, SQL database management system.
* It is a relational database management system (RDBMS).
* It compiles Structured Query Language (SQL) into machine code, via termed code generation.
* On April 23, 2013, MemSQL launched its first generally available version of the database to the public.
* MemSQL is wire-compatible with MySQL.
* MemSQL can store database tables either as rowstores or columnstores (The OLAP vs OLTP part I guess).
* A MemSQL database is a distributed database implemented with aggregators and leaf nodes.
* MemSQL durability is slightly different for its in-memory rowstore and an on-disk columnstore.
* A MemSQL cluster can be configured in "High Availability" mode.
* MemSQL gives users the ability to install Apache Spark as part of the MemSQL cluster, and use Spark as an ETL tool.
The main value proposition seems to be the distributed nature, which probably makes it easier to setup out of the box than, say, trying to setup a cluster MySQL or PostgreSQL databases which are not "natively distributed". Also, probably most useful when the data is "big enough" vs resources available on any single server or when reliability is very important.
I've been using MemSQL in development since March 2017. I've seen it evolve into one of the fastest databases for columnar storage and analytical workloads. Having in-memory rowstore as well removed the need to have Aerospike in my infrastructure and simplified the whole stack.
This announcement, that it's now free to 128GB pretty much saved me from having to do a kickstarter to raise funds for my little SaaS project.
If anyone from MemSQL is reading, really thanks for doing this. I think it's a very shrewd move. Should definitely see an upswing of potential customers adding it to their stack and when they grow, become enterprise customers.
I am the Director of Product Management for MemSQL.
The 128 GB limit applies to the whole cluster. So if you have two nodes in the cluster they would each have to be 64GB or less. If you have four nodes they would all have to be 32 GB or less. To have a highly available system we recommend 4 nodes (a master aggregator, a child aggregator and two leaf nodes). You can read more about the cluster architecture here: https://docs.memsql.com/concepts/v6.7/distributed-architectu...
A typical setup for HA would be 2 aggregator nodes and 2 leaf nodes. You can allocate the memory however you want. For example, you could give 32GB to all 4. Or give 16GB to each aggregator and 48GB to each leaf. (I'm a MemSQL Product Manager by the way.)
I tried to come up with a smart and polite comment but I can't. The target audience for memsql aren't developers or engineers. It's the management that has no idea about IT. I don't like closed-source solutions. I don't want to book a demo. I want to be able to read the source. I want to install it, use it, benchmark it, be sure that the results are 100% accurate. Sadly, I see this post as marketing ploy and I can't find any nice words for this product.
You can install it, use it, benchmark it, and check everything yourself.
It's not open-source, but there is plenty of closed-source proprietary software, and plenty of buyers who care about solving their problems and paying money to get that done (and ensure the vendor stays alive).
If you don't want to use a closed-source product then that's your prerogative, but I don't see how you're making a dev/engineering decision by ignoring a product because of that.
because the source is important, it tells the buyer they have a future if they no longer like your services; closed source means they are locked into whatever fresh hell might come upon your company which in turn unleashes fresh hell upon their decision to buy in.. in 2018+ it's just a smart decision to make
The vast majority of companies can barely build their own products, let alone study the source code of complex 3rd-party software. A distributed SQL database is on the extreme end of knowledge required to even understand it, so I'm not sure how open-source is going to help you.
As a counter-example, rethinkdb is open-source but the company failed and nobody cares about using it anymore. What would you do with that? Start building new database features yourself? Or just get your data out and move to a different system?
You will be able to at least maintain it, fix bugs, security issues. Maybe even start working on new features, promote your fork, revive some of the community, find people with relevant expertise, etc.
Databases are so lock-iny and critical that it's only natural for closed source database startups to be considered too risky to touch.
Possible doesn't mean realistic. As stated, 99.99% of companies are not going to come close to understanding, forking, and running their own build of a database.
It's better to practice proper vendor management and weigh all the risks and realities instead. If you're not more capitalized and viable then your vendor, then you have more important things to worry about then your vendor disappearing overnight.
/Widely used/ open source projects will find continued support; the user base of a project is a crucial part of the decision. Rethinkdb had only one serious site, who found it easier to port to a thin layer on top of Postgres when rethink collapsed.
I recently was able to unfuck my way out of a 300GB data loss resulting from a failed DB upgrade. By looking into the commit history of PostgreSQL, finding the commit with the PG_CATALOG_VERSION I needed, and compiling from that revision, I was able to re-run the upgrade with the parameters I needed. I'm not sure what I would have done if that had been MS SQL Server or something else.
That hasn't been my experience, at least not on any suitable time scale.
I strongly suspect that the vast majority of those of us who have worked somewhere "not more capitalized and viable" than the vendor share that experience.
Even when a vendor's support engineer is fully capable of solving the problem, the sense of urgency can't reasonably be expected to match that of a much smaller customer facing potentially catastrophic data loss (or other existential-threat-level consequences).
> Vendors have support plans and SLAs so if you need 24/7 support then make sure that is indeed what you're paying for.
Those are totally useless during an existential crisis without associated indemnity (which any vendor would be crazy to provide) against loss due to failur to perform.
> I do not see how having spare engineering talent capable of reading, editing and running a custom database build is the more realistic or faster option for any business in case of issues.
I don't see how it isn't, considering that "custom database build" could be so simple as to be trivial. In the GP's case, it was merely using a specific version.
Even the characterization of the required engineering talent as "spare" seems incongruous, as, in small companies, the talent requird to handle unexpected problems with technlogies fundamenta to running the business is essential, not superfluous.
The proof is in the pudding. Plenty/most serious users of open source databases become customers of the relevant companies. The commercial license is usually a small fraction of the potential cost of an outage.
While plenty of closed source software systems are still being widely used, I think the world is dramatically moving towards open source. The world has changed and I think these expectations of being able to leverage an open source software is here to stay. This despite the fact that a vast majority may never actually read, fork or modify the source code.
Disclaimer: I am a product manager at MemSQL, so I may be biased :)
There are a variety of ways to try out MemSQL yourself such as installing on Linux, Windows, Mac, AWS, etc, and maybe I am biased since I was an engineer before I became a PM, but we optimize our product currently exactly for technical people such as IT, devops, and of course, engineers. For a list of installation guides, check this link out: https://docs.memsql.com/guides/latest/install-memsql/
Take a look at our docs (docs.memsql.com) and you will see that we all actually are just a bunch of engineers and people with a technical background. Are there certain technical topics you feel are unclear here? I'm also happy to chat privately.
If you still feel this product isn't right for you, that is fine -- MemSQL's focus on query speed may not be for everyone. However, with this release of having a free product for people to try out, we definitely optimized exactly for people that want to try the product out :). I'm actually surprised you mention our product isn't for engineers/technicalPeople, because from our field of view, we actually sometimes see MemSQL as too technical, hence why we focused on usability in this release, ha!
Hope that answers some doubts you may have -- thanks for the comment.
Guys, I believe you are engineers that are doing an amazing job. But the whole "free up to 128GB" promotion is just.. wtf. And the website revolves around attracting IT managers, not engineers that make educated decisions related to project(s). I make decisions based on calculations, not based on shiny websites. I'm not undermining your product. It appears to be amazing. I'd love if it were free and open source. Heck, we'd probably spend a ton of money on it for paid support, scaling planning, consulting, deployment and what not. I just hate your business model, that's all. Let me have your program running without constraints! And because of it, I can't see a reason to use it. I choose to explore other venues that went down the open source route. I'm quite okay with not running the fastest option. No hard feelings, I sincerely wish you make a huge dent in this area and make a ton of money!
if you stored data on disk compressed in MemSQL's columnstore, you would use that 128 GB of RAM for query execution. on-disk data storage would not be limited. if that's not a productive system, then I must have imagined the whole data warehouse and data mart market
What is the benefit of using MemSQL over some other free in memory databases like Apache Ignite ? I see that they have better documentation and support (edit: + competition on Codeforces which winners rarely receive their T-Shirts). What about other things?
Those are some nice features. Unfortunately, your DBMS doesn't do some basic things like return consistent results for a simple SQL query with a group by and having clause. I admit this might be a configuration issue on my company's end, but if so, that is a terrible configuration option and should be hidden away, opt in only, with a huge wall of warnings so people don't actually enable it except in extreme circumstances.
I've seen you mention these inconsistent results twice here in this this thread, but have worked at MemSQL for 5 years and never heard of such an issue. Have you reached out to see if maybe your query / data is not what you expect? I've seen inconsistent results only once, and it was because the default date formats across RDMBSs were different (and was not anticipated).
How would the query/data not be what I expect if If I'm writing the query myself, and looking directly at the sql table definition to create it?
Beyond those considerations, why would the same exact same query (executed several times in rapid succession from the console) produce vastly different results? Also, I should clarify, rewriting the query from "select ... from xyz group by ... having ..." to "select ... from (select * from xyz where ...) group by ..." made the inconsistency goes away, without changing the filtering clause. That does not inspire confidence.
To close the loop on this one. We looked at the query and strictly speaking we should be rejecting it b/c HAVING clause is referencing a column that's NOT in group by and NOT an aggregate expression. The query shape is:
select count(*), a from T group by a having b > 0
In this case b is not allowed to be part of having by ANSI standard.
We let it run b/c some customers migrate from MySQL and MySQL allows this query. You can set MemSQL to be strict about it by setting this variable:
Thanks for taking a look at it. Your position is perfectly reasonable, but given the fact that (at least in my case) the results I got back were subtly wrong, and there's a good chance someone wouldn't notice, it might be a good idea to default this off if it isn't already, with a really stern warning in the config.
If you’re hitting the disk, aren’t you losing some of the advantages of using an in-memory database in the first place? Or would it still be more performant than a traditional RDBMS due to optimized in memory data structures?
Apache Ignite is an in-memory data grid that supports persistence and overflow-to-disk. It primarily started as a cache and now has a full key/value store with SQL-92 on top, but isn't a full relational database. Instead it has other features like distributed data structures, messaging, and is more about connecting your applications together. Easier deployment model with all nodes being identical.
MemSQL is a distributed full-featured relational database that has in-memory rowstore and on-disk columnstore tables with rich support for SQL, fulltext search and JSON. It's a fast RDBMS and does really well with analytical queries.
Do you need a fast cache, key/value, messaging system? Or a RDBMS with fast OLTP + OLAP capabilities?
"Basically like" is very different from "engineered for"
Column-oriented storage itself is many times faster for analytical queries, even if on disk, and combined with the other optimizations of MemSQL will get you far better performance. Along with all the data being able to constantly undergo transactional updates.
I don't understand the high frequency use case they describe. High frequency trading is something very different from "12,000 transactions a minute". What is exactly the use case? Pre-deal checks? Post-deal checks? Book replay? Or is it just a simulation? It's not very clear.
No real HFT system is doing a * database transaction * in the critical trading path. HFT systems are not built like web applications. They are typically built as a tight event loop, reading market data packets directly from the network card, doing a tiny bit of computation and then writing to a userspace TCP stack for order entry.
I guess you could be using MemSQL for post-order or trade analysis but then it would probably overkill since a lot of that can be done considerably slower.
I work for a dark pool ATS that is hit by HFT firms, and we routinely see flows greater than 12k transactions per minute. Ive been benchmarking a variety of compilers, db libs, drivers and platforms. So far, best perfomance ive gotten, single threaded, is I can write a single order to a man store table in MSQL in about 500 microsecs (that was from a .net core app running directly on the same server as MS SQL, ive been able to get comparable performance from a C++ app running on Linux with kernel bypass network IO). Mind, ive not tried to optimize the DB at all, this is purely comparing DB APIs. Worst Ive seen, all other things being equal is about 800 micros.
Can't share the code or schema, but can give a rough approximation of the setup.
We're experimenting with MSSQL's memory optimized tabled and native compiled stored procedures. My timings today, I was getting one call to our stored proc in the 300-400us range, that was inserting one record each into 2 tables.
Test setup for all of my scenarios are do all of the same DB ops, I'm alternating which libs I'm using. Best performance so far ive been able to get from linux talking to MSSQL has been using OTL on top of unixodbc with MS Driver 17. Mind these are physical servers sitting a few feet from a shared router.
One thing I would like to see is the commitment to keep it free till X years at least. OK, MemSQL has "decided" to give it for free up to 128GB of RAM usage now. But they can "decide" 2 years later that they want to charge for any MemSQL usage. Then what options the "small" businesses, who have adopted MemSQL, have?
When it comes to commercial products, it is generally good to check if you can afford their paid offering, then only make the tool core part of your infrastructure. For databases, it's better to stick with fully open source popular options from a long-term perspective.
Distributed. Scalable. OLTP + OLAP queries. High availability. Very fast performance for reads and writes.
They are entirely different systems. Sqlite is meant for self-contained applications that need some relational data persistence with a single file for storage, not for accessing as a central database with many clients storing TBs and scaling across nodes.
There are actually some patches in the works (currently used by LXD from memory) that allow sqlite to be distributed. I'm quite hazy on the details, but apparently this is part of what enables LXD's clustering.
What happens if I set hard limit on MemSQL process to use 128GB RAM and it decides it needs more? Will it die demanding more memory or will it behave like traditional rdbms and manage buffers to work within limits?
The best thing to do in this situation would be to set the maximum memory on the MemSQL server via the `maximum_memory` system variable. The server will then internally manage it's memory usage against this upper bound and fail only specific operations which can't be performed without additional memory.
I mean that an operation that needs more memory than is available will fail, but the server itself will remain operational. An simple example would be loading more data into an in-memory table (row store) than there is available memory. If you're using the columnstore then your storage won't be limited to memory.
The bit I can't see on the FAQ is about how it is as fast as RAM but protects against data loss using disk.
I think most engines guarantee Durability by assuming that once on disk, it won't go anywhere but if it's in RAM, it is susceptible to power outage? If it gets written to disk, it's not as fast as RAM?
MemSQL has two storage modes Rowstore and Columnstore. The Rowstore is "in-memory" and the columnstore is "on-disk" but those are oversimplifications. The rowstore data is stored in memory but we keep a snapshot of the data on disk. We also keep the transaction log (a record of all changes since the snapshot was taken) also on disk. So queries can be satisfied fully from memory (because that is where the current data lives) but writes go to memory and to the transaction log on disk. If the machine reboots then the snapshot is loaded from disk back into memory and the transaction log is replayed. When that is complete you are back to where you were when the machine rebooted with no loss of committed data. Columnstore data is always stored on disk although we use a row store in front of the column store that is hidden from the user but acts as a buffer of sorts so that writes in the column store can be pretty fast. More details on how the columnstore works can be found here: https://docs.memsql.com/concepts/v6.7/columnstore/#how-the-m...
The concept in databases that you’re looking for is called group commit. Transactions running concurrently batch writes together before flushing to disk. This means the minimum latency for committing a transaction is the speed of fsync, but the throughput can be as high as the disk bandwidth.
There are other reasons why an in memory database can run faster than a disk based database, such as not having a buffer pool manager, but I don’t think that’s what you were worried about.
> You can do almost anything with MemSQL, using the free tier, that you can do if you have an Enterprise license, including capabilities and production use. The differences are that you can only configure the free tier of MemSQL to use up to 128GB of RAM usage, and support is only community support; for paid MemSQL support, you need an Enterprise license.
The schemas are the same. Tables behave the same way logically, and its just performance and physical semantics that are different.
We would have recent data in rowstore and move older data into columnstore. You can easily join/union between both for queries. Also some constantly changing data (like budget counters) would always remain in rowstore with many lookups and updates per-second.
MemSQL is an "HTAP" (hybrid transactional-analytical processing) database. They have an in-memory row store table engine for the transactional queries and a disk-backed column store table engine for the analytical queries.
Personally, I would not use memsql as a first line RDBMS. It lacks too many useful features, and it doesn't return consistent results for all queries (I've found inconsistencies when using group by with having, for instance).
Been using MemSQL for a enterprise-level financial services client since mid-2017. We have this in production and are running it in a multi-TB cluster. We've not seen ANY of these issues here and are heavy, crazy query users. Their support has been nothing but on-the-spot and very helpful.
I totally admit that the inconsistency might be a result of misconfiguration by maintainers at my company. That being said, being able to shoot yourself in the foot so subtly and badly via configuration seems like a pretty strong anti-feature.
Since you aren't sharing any actual details and are switching between blaming the product to blaming your devs and config (where no product can magically keep you from breaking settings), your comments come across as rather disingenuous.
Why not share a clear example of exactly what happened, or post on their forum with details, so we can all judge for ourselves?