Faster R with FastR


162 points | by nirvdrum 247 days ago


  • WhompingWindows 247 days ago

    "Moreover, support for dplyr and data.table are on the way. "

    Well, I can't really use it in my day to day work, since that almost always involves cleaning and munging via one of those two packages. And it's not like ggplot2 is where my R code is most delayed, usually I'm working on aggregate data or perhaps a very much smaller analytical dataset which requires much less speed for plotting. My hang-ups are in initial munging phases where the data is still very large, which often calls for data.table over dplyr due to the latter's much slower performance.

    • ekianjo 247 days ago

      Yeah, data.table provides already significant speedup vs dplyr - so much that the "better" syntax of dplyr makes no sense anymore when you have to deal with very large datasets. But maybe FastR can somewhat change that?

      • baldfat 247 days ago

        Wait so the time difference of you running your code is longer then working with a "better syntax?"

        I spend hours cleaning up data and only have to run the code once (I normally save the output to a feather and then work with a separate file from there).

        I still believe that the 'tidyverse' is hands down the best thing that has happened to R and is the whole reason why R has grown so fast.

        • WhompingWindows 247 days ago

          Sometimes it can take 12 or more hours to run the code on the millions of observations. There's also competition from other researchers who use computational resources, which can mean I have to leave something running for hours due to the server being heavily queried. My workflow also doesn't allow easy interruption of the execution, sometimes it has to execute fully incorrectly before I can change an error or parameter.

          • baldfat 246 days ago

            I would then say your using the wrong tool for your problem? I can't imagine 12 hour runs. I would imagine Spark is a better bet or is that not an option?


            • semi-extrinsic 247 days ago

              I dunno how large your data set is, but I just set up for work a 16-core Threadripper workstation with 32 GB RAM and 1 TB M.2 SSD for approx. $2500. If it can regularly save you hours or days of waiting, getting something equivalent should be a no-brainer.

        • tfehring 247 days ago

          How large are we talking? I haven't had any problems with dplyr performance as long as my data fits in main memory. (I have 16GB, so that means single digit GB data frames at most - I realize that doesn't qualify as "very large".) It does slow down considerably for larger data sets, but I assumed that that was because it was hitting the pagefile.

          • minimaxir 247 days ago

            I have had the same experience with dplyr.

            In the event that the data doesn't fit into memory, it's better to preprocess w/ SQL at the data-store level. There hasn't been a case where I'd need to feed massive amounts of data into a ggplot2 visualization unaggregated.

          • steve_s 247 days ago

            FastR doesn't alter the semantics of R, so when dplyr copies a vector in GNU-R then FastR has to copy it too. However, FastR does use reference counting (not sure if that's turned on in GNU-R 3.5.1 now) so it may avoid some unnecessary copies.

            • jsmith99 247 days ago

              You can use dplyr syntax on data tables, usually with data table speed, especially if you load dtplyr.

          • RosanaAnaDana 247 days ago

            Yeppo. Also, for myself at least in the geospatial realm, I need raster, rgdal, sp, sf, and parallel. The primary allure of R (imo) is the thousands of packages that allow you to quickly and easily implement whatever you want to do. Combine those with data.table and parLapply, and you're off to the races.

          • claytonjy 247 days ago

            Maybe 3-4 years ago there was a big push to speedup R by replacing the runtime; at least 3 competing replacements were talked about pretty actively. None of them achieved much mindshare. R trades runtime speed for dev speed, and we juice performance by writing slow stuff in C++ and linking Intel's MKL. The RStudio folks are also making the low-level stuff faster and more consistent through the r-lib family of packages, which are awesome.

            Big barriers to adoption here: not a truly drop-in replacement, R people have an aversion to Java (we've all spent hours debugging rJava; luckily most of those packages have been rewritten in C++ now), and nobody likes Oracle.

            I think the best-case scenario here is that progress on FastR pushes the R-Core team to improve GNU-R.

            • truculent 247 days ago

              I never fail to be amazed at all the work the RStudio et al. team do to push R towards the wonderful programming language/environment it could be, rather than what it has been.

              • digitalzombie 247 days ago

                They recently added terminal to RStudio. I'm so happy not switching between two app Iterm2 and RStudio.

                • truculent 246 days ago

                  Yep. The python support is starting to get pretty decent as well. I much prefer Rmarkdown for R and python (or both at the same time!) for e.g.

                • claytonjy 247 days ago

                  I'm in the same boat, and would have gladly left R years ago if not for all their efforts

                • gameswithgo 247 days ago

                  > R trades runtime speed for dev speed

                  This claim is made about a lot of things, Ruby, Python etc. I think the important point is it that there is no trade going on. It just that these things are all slower / less efficient than they need to be.

                  • claytonjy 247 days ago

                    Maybe that's true, but I think Julia is the first effort to prove that out in the numerical/statistical world, and while lovely the ecosystem is far behind because of how much newer it is.

                    • gameswithgo 247 days ago

                      javascript showed that dynamically typed languages can be jitted well. It is just hard, and we spread our efforts over so many languages they don't all have the resources to do it.

                      • jabl 246 days ago

                        What Julia showed is that if you carefully design the language with JIT in mind, the task is MUCH easier.

                        Julia gets very good performance without the massive manpower that has gone into Javascript VM's.

                        • pjmlp 247 days ago

                          SELF and Dylan were there first.

                          • claytonjy 247 days ago

                            sure, but there's plenty of other reasons why JS isn't a contender in this interactive-data-analysis space

                            • gameswithgo 247 days ago

                              oh for sure, but for Python/R the barrier to speed isn't any of their important productivity features (as far as I know) but just a high quality compiler/JIT

                              If I was Lord Of Computing I wouldn't let languages out of beta until they had a high quality compiler or JIT. Turns out I am not though.

                    • WorkLifeBalance 247 days ago

                      There's also microsoft's R-Open ( which I've found is faster than the out of the box R since it supports better multi-threading of commands.

                      • claytonjy 247 days ago

                        IIRC most of that is because they use Intel's MKL and a better BLAS; if you like docker, using the Rocker containers uses the better BLAS, and I think adding MKL isn't too hard either.

                    • ellisv 247 days ago

                      This article compares FastR to GNU-R v3.4.0 -- but there were some important changes in v3.5.0 (see

                      I'm not even sure GNU-R is the most important comparison (although it is an important comparison). How does it compare to R with Intel MKL? How does it compare to other (faster) languages?

                      • steve_s 247 days ago

                        FastR also uses native BLAS and LAPACK libraries. It should be possible to link it with Intel MKL as well.

                        We didn't want to include comparison to R-3.5.X, because FastR itself is based on the base library of 3.4.0, but the results for GNU-R 3.5.1 almost the same as for R-3.4.0.

                        AFAIK ALTREP is not used that much yet inside GNU-R itself. They can now do efficient integer sequences (i.e. 1:1000 does not allocate 1000 integers unless necessary), which would save a little bit of memory in this example, but that's about it. FastR also plans to implement the ALTREP interface for packages. Internally, we've been already using things like compact sequences.

                    • droelf 247 days ago

                      There is also the xtensor initiative which aims to provide a unified backend for array / statistical computations in C++ and then makes it pretty easy to create bindings to all the data science languages (R, Julia and of course Python). Usually, going to C++ provides a pretty sizeable speedup.


                      Disclaimer: I'm one of the core devs.

                      • claytonjy 247 days ago

                        This is very interesting! Have you gotten any buy-in from the wider R community, is anyone rewriting their packages atop xtensor? Does R 3.5 and ALTREP make such a transition any easier?

                        • droelf 246 days ago

                          I actually can't tell, but it has not yet been significant. It takes quite a bit of time to really get a library like this started. So far we've mostly dealt with people who are using xtensor from C++ or bind it to Python.

                          We've mainly gone through RCpp for the R language, and that has been working great. I don't know about changes in R 3.5 or ALTREP. Is there something we should know/change for it?

                      • lottin 247 days ago

                        I recommend watching this video - Making R run fast


                        It's a little disappointing, because the conclusion is that R will probably never "run fast", but very interesting nonetheless.

                      • truculent 247 days ago

                        At this point, the tidyverse packages probably cover >90% of my data analysis workflow, so it'd be great to see all of those compatible with FastR. I'd guess tidyr and dplyr would be the trickiest, and dplyr is already being being worked on!

                        Great work, thank you for sharing.

                        • steve_s 247 days ago

                          FastR can actually run all tests of the development version of dplyr with a simple patch. We're working on removing the need for that patch altogether.

                          data.table is a different beast and we will probably provide and maintain patched version for FastR. They do things like casting data of internal R structure to byte array and then memcopy it to another R structure. This is very tricky to emulate if your data structures actually live on Java side and you're handing out only some handles to the native code.

                          • truculent 246 days ago

                            That's awesome! Personally, I don't use data.table much/at all, so (selfishly) that's not an issue for me.

                        • tofflos 247 days ago

                            Context ctx = Context.newBuilder("R").allowAllAccess(true).build();
                            Value rFunction = context.eval("R",
                                    "function(table) { " +
                                    "  table <-;" +
                                    "  cat('The whole data frame printed in R:\n');" +
                                    "  print(table);" +
                                    "  cat('---------\n\n');" +
                                    "  cat('Filter out users with ID>2:\n');" +
                                    "  print(table[table$id > 2,]);" +
                            User[] data = getUsers();
                            rFunction.execute(new UsersTable(data));
                          The example above combined with "JEP 326: Raw String Literals" and an IDE that understands Java with embedded R code would be cool to play with.
                          • ubiyubix 247 days ago

                            The thing I miss most in R are 64 bit integers. I am aware of the bit64 package, but I would prefer native support.

                            • ajay-d 247 days ago

                              This is true. Even if you manage to build 2 billion + matrices, with bit64, I don't know any modeling packages that can handle those objects.

                              • amelius 247 days ago

                                Can't you use floats with a large mantissa instead?

                                • chrisseaton 247 days ago

                                  That's going to be less than 64 bits of usable space isn't it? I think the largest integer you can fit in a float precisely is 56 bits.

                                  • amelius 247 days ago

                                    Yeah, but it's still better than a 32 bit integer, I suppose.

                              • simondanisch 243 days ago

                                If anyone wants to reproduce the benchmarks, I put them into a reproducible article and added a Julia baseline:

                                • shelajev 247 days ago

                                  The last graph is a bit hard to read with the log scale. It's 10x improvement from GNU-R to FastR+rJava and another 10x with the native GraalVM interop.

                                  • lliamander 247 days ago

                                    I've actually tried porting some existing R applications that are currently run with RApache to Graal to try and get simpler deployment and better/more consistent operational support. Unfortunately at the time the gsub() function was broken, and that broke some of our core logic.

                                    Hm... looks like the issue may have been fixed. I'll have to try again.

                                    • steve_s 247 days ago

                                      Plese open an issue on GitHub if you encounter any more problems with gsub or anything else.

                                      • lliamander 247 days ago

                                        Next time I try it, if it's still an issue then I will report.


                                    • nerdponx 247 days ago

                                      It'd be great to have something like Numba for R, where you can write a restricted subset of R and have it JIT compiled to native code.

                                      That, or something like Cython where, instead of writing inline C++, you translate a restricted subset of R to C, which is then compiled.

                                      • ChrisRackauckas 247 days ago

                                        I think you could get a lot by chopping out R's non-standard evaluation. It's described pretty well here:


                                        Functions in R are not referentially transparent, so replacing an argument with its value is not necessarily the same. That is a clear restriction on optimizations. If you would want to choose a restricted subset of R to speedup, then this would be a good candidate to cut out since the standard place to compile is at the function level (Numba, Cython, and Julia all do it at functions).

                                        • claytonjy 247 days ago

                                          I'm not sure this is right; the NSE stuff tends to be at the shell, the user-facing API. The workhorse functions generally are referentially transparent, and writing pure functions is both natural and recommended in R. The slow parts are deeper than the NSE, so removing NSE wouldn't open up much room to optimize.

                                          I suspect pass-by-value is a much bigger barrier to speed in R than non-standard evaluation.

                                          • ChrisRackauckas 247 days ago

                                            Oh yes, I forgot about its pass-by-value. Removing pass-by-value is a double edged sword though. I generally dislike it, but you have to admit that having everything pass-by-value is much simpler to a non-programmer. If you chop that out then the "fast R subset" suddenly can act very differently. In order to really write efficient code you'd want to start making use of mutation on this fast part. This means throwing a macro on some array-based R code won't really be automatic: it would need a bit of a re-write for full speed but the re-written version would be incompatible with pass-by-value semantics. This is quite an interesting and tough problem to solve. I think it might be better to keep things pass-by-value and try to optimize pure functions.

                                            • nerdponx 247 days ago

                                              What about copy-on-write semantics? Or is that not a big deal (since you can just "not do it").

                                          • thanatropism 247 days ago

                                            That R is still around while not enjoying the wide array of benefits of general-purpose programming languages is impressive. It must truly have pluses that Python users don't even dream about.

                                            E.g. can you quickly spin up a REST-like HTTP interface for your goods?

                                            • ChrisRackauckas 247 days ago

                                              RStudio is pretty amazing for interactive statistical work. Also, A lot of open source developers tend to ignore Windows, but the less technical users are on Windows, and so proper Windows support is a key win. R's CRAN has a very clean documentation system and the setup for packages ensures that most things work on Windows (Windows CI is required). Also, its non-standard evaluation and associated metaprogramming is very integrated into the language, so you can build very intuitive APIs. Most users wouldn't know how to program what you just did, but that doesn't matter since the workflow for the average R user is "package-user" not "package-developer". So while R does have quite a few downsides, there's a lot that other general-purpose programming languages can pull from it.

                                              • nerdponx 247 days ago

                                                E.g. can you quickly spin up a REST-like HTTP interface for your goods?

                                                On the contrary, it started life as a Bell project called S, more or less a math/stats DSL. It was implemented in GNU as R, and R became one of many competing "stats packages" you may or may not be familiar with: SAS, Stata, SPSS, etc.

                                                While it can be used for general purpose programming, its main advantage is that it is still primary a math, statistics, and data analysis DSL at heart. The concept of a "data frame" (which you are familiar with if you've used Pandas) as a data structure originated, as far as I can tell, in R. Data frames are built into the language, and the language offers custom syntax support for them.

                                                Also, the standard library is full of high-quality statistics tools. Fitted model objects have handsome, human-readable string representations. The formula DSL is elegant and convenient. Manipulating data (replacing missing values, etc) is easy and relatively concise. Math and linear algebra is similarly and it is linked to BLAS so it's pretty fast. Plotting is built into the language and it's pretty intuitive, even if the defaults aren't that pretty. The language is also fully homoiconic and wildly dynamic, allowing you introspect and modify pretty much any chunk of code.

                                                And all that's just in the standard library. The package ecosystem is downright enormous. You can write R packages in C/C++ just like in Python if you need something to go fast, aided by Rcpp. There's Shiny, which is a self-contained HTTP server for data-driven web applications. GGPlot2 was a minor revolution in elegant data visualization. The Tidyverse package collection was similarly mold-breaking by letting users write organic "data pipelines" instead of imperative code. Caret is at least as good as Scikit-learn for general-purpose machine learning. XTS takes the pain out of time series manipulation and modeling. Data.table can efficiently join and subset billion-row datasets in memory using indexes. The list goes on.

                                                Long story short:

                                                    - domain-specific niceties
                                                    - batteries-included standard library that mimics features found in big monolithic stats packages
                                                    - has general-purpose programming capability
                                                    - extensible in C for speed
                                                    - built-in plotting that's not perfect but it's pretty good
                                                    - huge package ecosystem.
                                                • claytonjy 247 days ago

                                                  > Caret is at least as good as Scikit-learn for general-purpose machine learning

                                                  Oh how I wish this was true! Luckily RStudio hired the author of Caret to develop a family of smaller tidy modeling packages (, and with recipes we're finally close to having something like sklearn's Pipelines, which IMO is one of the best parts of sklearn.

                                                  • nerdponx 247 days ago

                                                    True, the pipeline is a great feature. I haven't used tidymodels yet but it looks like the start of a great ecosystem. I do remember seeing Broom at a talk a couple years ago and thought it was a nice idea.

                                                  • thanatropism 247 days ago

                                                    That's interesting. I used to be a professional user of Stata, really day-to-day stuff; but I never saw R positioned as an alternative to Stata.

                                                    • nerdponx 247 days ago

                                                      I only used Stata in school but that's how it turned out for me. "Why learn Stata, SAS, or SPSS when I can just use R?" It made no sense to me (and still doesn't, honestly).

                                                      • jasonpbecker 247 days ago

                                                        Tons of former Stata users are now R users, especially over the last decade. Stata pretty much lives in Econ departments now.

                                                    • darkhorn 247 days ago

                                                      > E.g. can you quickly spin up a REST-like HTTP interface for your goods?

                                                      With R? Why would you want to do that with R? R is not suitable as a web server. May be you can write a package for that using C. There are 13170 packages for R. ın fact 99% of R consists of packages. You don't sit and write web server with R.

                                                      R is used for statistical data analyses. I was using R to get the most occurring error in Apache/PHP error logs, only with 2 lines of R code.

                                                      • peatmoss 247 days ago

                                                        I dunno, I was able to cobble together a timeseries forecast API using the plummer and forecast packages in an afternoon that a product team was then able to work against to create demos for customers. Yeah, they’d probably eventually want to rewrite the API to be “production ready.” But on the other hand, for prototyping and getting to show something real to prospective customers? Pure dynamite.

                                                        Even then, if the stats being done in the background were hard to reimplement, I suppose plummer & R could still work with the right cloud / load balancing infra. Might end up being more expensive than it needs to be in the final iteration, but in the meantime money could be flowing in and customers gettin’ happy.

                                                      • RosanaAnaDana 247 days ago

                                                        The big pluses are the huge range of libraries that make developing analyses easier, faster, and more reproducible. Python has some fine libraries, but its leagues behind whats available in R.

                                                        • SubiculumCode 247 days ago

                                                          I use R like I use bash for neuroimgaing analysis: I utilize a whole lot of powerful/specialized command line executables (e.g. R lmer|e.g. neurogimaging AFNI) the outputs and inputs of which I link together into a pipeline using R/Bash utilities.

                                                          • SubiculumCode 247 days ago

                                                            admittedly there are tools like nipype that use pythonto create an interface for those different neuroimaging tools, but most of the time bash scripting works perfectly reasonably for this.

                                                          • nirvdrum 247 days ago

                                                            The article makes mention that FastR supports GraalVM's polyglot mechanism. One possible option for your task is you do your data analysis with FastR and render it with Node on Graal.js or Sinatra on TruffleRuby. At first blush this might not sound all that different from CGI of yore, but the key thing is all Truffle-based languages can optimize with one another. So, when your web server endpoint gets hot, Truffle's PE can inline nodes from FastR and JIT the whole thing with Graal.

                                                            You get to use the best language for the task at hand and don't have to worry about performance penalties for doing so.

                                                            • closed 247 days ago

                                                              In answer to your question -- my sense is that you can spin up super nice dashboards using shiny, and those will be opinionated HTTP interfaces. If you want to combine the flexibility of a bonafide web framework, and R shiny dashboards, you're going to have a rough time. R shiny itself has a pretty rough HTTP implementation built in.

                                                              So I'd say the answer is yes, and you'll have a good time as long you only need the HTTP interface to do certain things (responsive dashboards; and do them well!).

                                                              Webserver implementations exist in R, but don't have near the time / attention put into them as with Python.

                                                              • claytonjy 247 days ago

                                                                Yes, R has the now-RStudio-supported plumber package (, roughly flask for R.

                                                                There's also opencpu (, though the pros/cons of one vs the other has never been clear to me.

                                                              • Bootvis 246 days ago

                                                                Shameless plug:

                                                                Doing something like that is definitely possible, all the parts are there and work well. Shiny gives you a lot out of the box, is great for prototyping and can be customized. I’ve been working on a less opionated package that isn’t ready for anything but gives an idea of what would be possible:


                                                            • ufo 247 days ago

                                                              IIs there any information about how Graal+FastR are right now with respect to memory usage and warmup speeds? Are these benchmarks for total wall time or just the post-warmup speed?

                                                              • steve_s 247 days ago

                                                                There is a plot of warm-up curves for this specific example. Search for "To make the analysis of that benchmark complete, here is a plot with warm-up curves".

                                                                However, it is true that the warm-up and memory usage are something we need to improve. We're working on providing native image [1] of FastR. With that, both the warm-up and memory usage shold get close to GNU-R.