Python or Clojure for Data Analysis?

I’m starting a new role that will be heavy on data analysis. In the past I’ve focused on web dev/CRUD apps using Go, JS, PHP, other web stuff. All that to say, this will be new territory for me. Conventional wisdom seems to be that Python is the tool of choice for data science type roles, along with pandas/numpy and Jupyter notebooks. That or maybe R or Julia. On the other hand I hear about how Clojure is great for data munging type tasks. I have no experience with lisps but I’m interested in dipping my toe in the water if it will be a good fit for what I’m going to be doing. Which one in your experience will pay off more in the long run? Python with its tooling and community or Clojure’s language design?

16 points | by s_c_r 1350 days ago

8 comments

  • daslu 1350 days ago
    Best wishes for the new role!

    Eventually, it may be a good idea to try both Clojure and Python.

    Personally I find Clojure's approach towards data very refreshing. It does require an open mind and a mindset different than usual. Eventually, this can bring joy, simplicity and power.

    This article by Chris Nuernberger nicely explains what it is about: https://cljdoc.org/d/cnuernber/libpython-clj/1.2/doc/so-many...

    Clojure's community is certainly smaller than Python's, but some say it is very friendly.

    Below are some beginner-friendly places to chat about it. If you wish, let us chat there, dive into the details, and think how you could begin exploring.

    Clojurians Zulip https://clojurians.zulipchat.com and especially the data-science stream: https://clojurians.zulipchat.com/#narrow/stream/151924-data-...

    Clojureverse https://clojureverse.org

  • Jugurtha 1350 days ago
    Hi,

    Congratulations on your new role. Are you joining a team, or are you the team? If you're joining a team, then you'll probably use what they're using and learn their tooling before you could endeavor to improve it.

    You're doing it in a professional context, so it will be Python. Many blog posts and articles on popular medium websites address shiny new things, but most of these posts address one of two scenarios: portfolio/toy projects, a project with one individual working on it, a project with data that fits on disk and RAM, and/or a Kaggle project where a good part of the heavy lifting has been done for you (data acquisition, cleaning, feature engineering, metric identification) which never happens in real life because that's what you're hired for in the first place.

    A big problem in this field is the fragmented tooling and experience, which means you have to weave tools together, unless the team you're joining has it figured out and have internal tooling dialed in. Python dominates. I'm sure other languages are used at other ML shops (we have used Scala in some of our projects) but I think in your situation, there's no need to complicate things.

    Then again, that is just an opinion. It is not the right answer. The goal is to deliver value.

    All the best,

    • s_c_r 1350 days ago
      Thanks for the insight! I am the team--everything before was done quite manually with spreadsheets. I already work in the organization but have been promoted out of development into this new role so I'm blessed to have a large amount of flexibility. It does seem like Python is the right tool for the job. Your perspective confirms my hunch.
      • s1t5 1350 days ago
        > I am the team--everything before was done quite manually with spreadsheets.

        In that case I would pick between Python and R. R might even win out slightly over Python for your use case. Definitely not Clojure, Scala or even Julia.

  • nikonyrh 1348 days ago
    I have used both professionally at a senior data scientist role so I feel like pitching in. Perhaps due to my background coming from Matlab I never got too keen on dataframes (be it Pandas or whatever Clojure has to offer). Instead I use matrices for homogenous data or whatever hashmap-of-list-of-sets describes more complex data. When your data is already in a CSV format and you want to do basic analysis on that or fit mathematical models I highly recommend the Python / Numpy / Pandas / Scipy combination. It can be easily extended to which ever direction you want to go, be it PySpark or Keras.

    Clojure taught me a lot about infinite lazy sequences (kinda like Python's generators) and how to model the program as a pipeline. A good analogy is found from shell programming. There you have stand-alone programs which handle individual tasks and you can pipe previous program's stdout into next program's stdin. On Clojure you'd wrinte stand-alone functions which you "pipe" together via "->" thread-first and "->>" thread-last macros. It also ships with several handy functions such as "frequencies", "group-by" and "partition-by". I have ported these and several others to my own Python projects thanks to their versatility and a kind of universality.

    Oh and speaking of macros, if you want to get fancy you can design your own domain-specific-language and express your problem in that, hiding all of the poilerplate under the hood. But to get the highest performance sometimes you need to think whether to use Clojure's immutable datastructures or resort to Java's mutable ones, which could have better performance (or use a library I guess). Well at least on JVM you can do "real" parallel programming, unlike on CPython interpreter due to the GIL.

    Clojure is fun and very educative for all kinds of projects, but on a professional data analysis setting I'd start with Python and if it seems like a bad fit then do a PoC with Clojure. :)

    What a huge topic.

  • whalesalad 1350 days ago
    If you want to learn something new and have cycles to burn on that: Clojure. It’s a great language, but learning it is going to be a slower and more scenic route.

    If you want to get things done: Python. You’ll have no problem getting up to speed based on your past experience, and the ecosystem is orders of magnitude larger than Clojure.

  • aynyc 1350 days ago
    Why Clojure? I honestly never heard of anyone using Clojure as data tools.

    I use Python and Scala. I use Python for mostly small tasks. When I hit large data, I normally use Spark on EMR (PySpark or Scala).

    • jb1991 1350 days ago
      > I honestly never heard of anyone using Clojure as data tools.

      Clojure is one of the most actively-used data analysis languages actually. It is used by many industry for that purpose. Heck even the widely-used Metabase is written in Clojure.

      • tcbasche 1349 days ago
        I don’t believe that. What do you have to back that up? I would have thought Python or Scala - been in the data analysis game for a few years now and have never heard Clojure mentioned once except on Hackernews
    • s_c_r 1350 days ago
      Honestly? I’m interested in it and wanted an excuse to give it a try. No more compelling reason than that.
      • aynyc 1350 days ago
        I think that's the best reason for doing things. Find something interesting and give it a try.

        Using Scala as a functional language has been a fun journey for me. I still code imperatively when solving problems, but when I get a chance to refactor them into functional programming, boy, my mind goes Boom!

        I do hope I get to use Clojure at some point.

    • usgroup 1350 days ago
      I tried ... I hadn’t used a Lisp before and writing Clojure felt like some kind of mind masturbation. Everything was just sooo much more elegant. Having said that it’s awful for data work since it has close to no supported libraries you might be used to in the python eco system.

      Do it just to learn a Lisp ; there is something magical there.

  • dfah 1350 days ago
    Data analysis is alas a big field. I would say you should assess a few factors: 1) how much basic-ish learning in the data science / statistics / ML areas you expect to be undergoing yourself; the Python ecosystem will probably make this much faster 2) how you expect to scale & productionize your analysis tasks (if at all); in my experience Python is a second-class citizen in the Spark world, not far above Clojure's third-class status, and throughput gains from Clojure's native JVM output may outweigh the relative convenience of the PySpark interface. TensorFlow & TFX's interfaces are basically designed from the ground-up for Python. 3) Which major techniques & corresponding libraries you expect to use (e.g. MCMC/STAN, Pyro, TensorFlow, scipy, scikit-learn). Some of these might rule out one language (more likely eliminating Clojure) or the other. 4) How important data visualization will be for you. This aspect of the work will be much easier & richer in Python than in Clojure. 5) What kind of data transformation & validation you expect to do. If this is largely statistical in nature (e.g. rescaling distributions) it's probably a wash. If this is viz-heavy it'd favor Python. If this involves complicated structured data, I'd recommend Clojure.
  • auganov 1349 days ago
    As for Jupyter notebooks, REPL-driven development in Clojure gives you the same ease (arguably better) of messing around with code while also scaling to serious software dev. Though it's not as nice for sharing with others.

    If you're working in an environment where there's a lot of collaboration Clojure might be tough. But if you're actually going to be developing software that relies on data analysis (rather than just doing it as a one off) I think Clojure might be worth considering.

  • aprdm 1350 days ago
    I would 100% go with Python, it has a much bigger tooling and community which makes easier to ask question and to collaborate with people.

    Once you're comfortable with it, then it might be worth exploring other languages that are less known to have a (subjective) better software design.