PandaPy has the speed of NumPy and the usability of Pandas

(github.com)

159 points | by firedup 1524 days ago

15 comments

  • shoyer 1524 days ago
    It's a lovely idea to build pandas like functionality on top of NumPy's structured dtypes, but these benchmarks comparing PandaPy to Pandas are extremely misleading. The largest input dataset has 1258 rows and 9 columns, so basically all these tests shows is that PandaPy has less Python overhead.

    For a more representative comparison, let's make everything 1000x larger, e.g., closing = np.concatenate(1000 * [closing])

    Here's how a few representative benchmark change:

    - describe: PandasPy was 5x faster, now 5x slower

    - add: PandasPy was 2-3x faster than pandas, now ~15x slower

    - concat: PandasPy was 25-70x faster, now 1-2x slower

    - drop/rename: PandasPy is now ~1000x faster (NumPy can clearly do these operations without any data copies)

    I couldn't test merge because it needs a sorted dataset, but hopefully you get the idea -- these benchmarks are meaningless, unless for some reason you only care about manipulating small datasets very quickly.

    At large scale, pandas has two major advantages over NumPy/PandasPy:

    - Pandas (often) uses a columnar data format, which makes it much faster to manipulate large datasets.

    - Pandas has hash tables which it can rely upon for fast look-ups instead sorting.

    • meowface 1524 days ago
      This is why you can never accept benchmarks provided solely by the software creators. Same for accepting studies about a company's product when the company's commissioned and funded the studies.

      It'd be cool if there were neutral third-parties, kind of like Jepsen, that any project could defer rigorous benchmarking to, perhaps in exchange for a flat fee (everyone pays the same fee, no matter how big or small they are).

      • mhh__ 1524 days ago
        I am writing a small benchmarking site (aimed at asymptotic performance rather than singular tasks) which should hopefully have a fairly generic API/Specification, so I'll look into the more "competitive" side of it (Project vs Project as opposed "Uh oh, Commit f4r0adfja has made the build slower and [test 3] slower for large n".

        The logic itself is easy but I'm having to step into the dark arts of DevOps and web-cancer - should be online by the summer.

      • munmaek 1524 days ago
        And then they learn how to game the benchmarks.

        You just can’t win.

        • skrebbel 1524 days ago
          No, because the trick is that you're paying a knowledgeable person to run the benchmark. That person would presumably actively iterate on the benchmarks and try to detect / avoid cheating.
          • kmbriedis 1524 days ago
            People would probably find out what hardware they use for benchmarks and optimize for that, leading to performance decrease for many othes
    • missosoup 1524 days ago
      Not only that, but the benchmarks ignore the cost of translating back and forth between pandas format. By the time you've done a comprehensive benchmark, you'll realize that pandas is actually pretty well optimised and the only opportunity for speedup is writing specialised functions for a narrow set of use cases. This library is a small box of such specialised functions. It will never be able to compete with pandas in general.

      cuDF is a more plausible candidate for replacing pandas in performance-critical scenarios, and even cuDF explicitly aims to supplement pandas rather than replace it.

    • cerved 1524 days ago
      To be fair, it's clearly stated in the readme:

      "The performance claims only hold for small datasets, 1,000-100,000 numpy rows. Pandas perform better with larger data sets, the only functions that improve with a 1000x increase in size is rename, column drop, fillna mean, correlation matrix, value reads, and np calculations even out (np.log, np.exp as well as etc)"

      • shoyer 1524 days ago
        Most of those disclaimers were added tonight, after seeing my comment :) https://github.com/firmai/pandapy/commit/692c968771bb19d4f12...
      • lmeyerov 1524 days ago
        This is a real problem, akin to how modern JITs will have diff run modes for the same code. We're overdue for pandas-without-huge-overhead.

        See this thread for more on why pandas overhead adds up for real settings, and is crazy huge once you go to dask/ray/modin, and even worse, say spark: https://twitter.com/lmeyerov/status/1218093436286296069 + https://twitter.com/lmeyerov/status/1220847229537157121 . For interactive analytics apps, we want scale for bigger workloads, yet also less overhead to achieve < 100ms: ideally moving a slider will trigger lots of calcs over many widgets and their subcomponents, and the underlying stats code looks normal (e.g., df > sql > some imperative/functional lang.) Internally we write C-like buffer manips & hand-written GPU kernels & weird streaming code to achieve < 100ms wall-clock, and are trying to get tech like these up to snuff for interactive data software.

        As per the JIT comment, feels like a matter of time, so happy to see this (and the pressure it creates!)

  • smabie 1524 days ago
    Pandas is usable? I had no idea..

    Pandas is really badly designed, in the same way that most Python libraries are: each function has so many parameters. And a parameter can often be a bunch of different types. Pandas is useful, especially for time-series data, but no one particularly loves it. And, it’s embarrassingly slow. Maybe PandaPy is better, but I doubt it. When you start trying to use Python implemented functions (vs C ones) things are going to get bad no matter what you do.

    Speaking of which, I decided to port over a statistical model for betting from Python to Julia week ago. I’m not done yet, and this is my first major experience with Julia, but it’s been so much nicer than using Python. The performance can easily be 10x-50x faster without really doing any extra work.

    Also the language feels explicitly designed for scientific computing and really meshes well with the domain. Python the language never really was good for this, but the libraries were pretty compelling. Julia libraries have almost caught up (or in some domains, like linear algebra) have actually exceeded what’s available dor Python. Moreover, if you need to, PyCall is really easy to use.

    I’m going to go out on a limb and say that people shouldn’t be using Python for new scientific computing projects. Julia has arrived, and is better in everyway (I’m still unsure about the 1-based indexing, but I’m sure I’ll get over it. 0-based waa never that great in the first place).

    • unishark 1524 days ago
      I also found the "speed of numpy" part of the title amusing.

      Reminded me of a quote that went something like "C has the low-level power of an assembly language combined with the high-level usability of an assembly language."

    • playing_colours 1524 days ago
      Ha, I see Julia Evangelism Task Force is grouping up in another thread! I am waiting for the arrival of Commander socialdemocrat soon :)

      I am a big fan of Julia as well, and it is quite amusing to see recent growing presence of enthusiasts roaming between topics.

    • beowulfey 1524 days ago
      HN hasn’t really taken to Julia yet, but I am also a big fan. I’m not even a programmer really, just had some python experience, and I was able to hack together a program that extracted molecular contact data from protein structures and draw contact maps for them all. It runs in a few minutes for 10000 structures. I did zero optimization on it! The language is incredibly fast and I really think it’s easy to use.
      • ChrisRackauckas 1523 days ago
        I think it's partly just the audience. HN seems to have a lot of professional programmers, which is similar to but not quite the same as computational scientists. That doesn't mean there's no overlap, but for example I haven't met a people (other than one Google person) at a recent physics-informed learning or scientific machine learning conference who is a HN reader, which really explained why none of the stories in those domains have been getting HN traction even though HN seems very ML/AI-positive. Julia is in this weird middle ground where its ML/AI stories seem to get a lot of HN traction as a programming languages topic given the HN audience, but in reality it has a lot more people doing things like research in numerical linear algebra, numerical differential equations, statistics, bioinformatics, etc. but fast methods for block-banded matrices won't hit science press.
    • julosflb 1523 days ago
      Me.I particularly love pandas. Sure its API is quite complex and I need to refer to the documentation more often than I would like. But the doc. is be pretty solid. And the set of features unmatched in the python ecosystem. Nothing can really beats pandas when you need to quickly do some medium sized (ie. in RAM) data exploration in python.
    • woah 1524 days ago
      How’s the package management story on Julia ? Python package management is a fractal of badness
      • ddragon 1523 days ago
        Usually pretty good. The package management is completely centered around Pkg.jl, which is integrated in the REPL and you can also import it to your program for more advanced scripting. If you don't create an environment, everything is added to the global user library, and if you do create it, it will automatically manage your projects dependency files, and each environment/package can have it's own independent versions of each library (so you don't really have dependency hell issues, but you might end up with more disk space due to multiple versions of the same library, although it will respect semver when keeping multiple libraries).

        Pkg.jl is based on git/github with a central registry that can I believe it's automatically updated with new packages using bots. The current version also has native supports for automatically deploying binaries and other stuff like datasets, that can be optionally loaded on demand.

        Most troubles I hear is with more strict enterprise firewall scenarios and perhaps Julia JIT making it compile the libraries every time (though that's not an issue with the package manager).

      • smabie 1523 days ago
        I’m not sure I just do

        using Pkg; Pkd.add(“foo”)

        And it works. I’m sure there’s some dependency file for projects but I don’t know how it works. Thinking about build tools and package management makes me sad, so I try and avoid it as long as possible.

  • fjp 1524 days ago
    Some Python devs seem to pull in Pandas whenever any math is required.

    IMO Pandas documentation somehow manages to document every parameter of every method and somehow it’s almost as helpful as no documentation at all. Combined with the fact that it’s a huge package, I avoid it unless I really really need it.

    A version with human-understandable docs could convince me otherwise

    • powowowow 1524 days ago
      I've found Pandas extremely easy to learn and to use; to the point where I find it confusing to see somebody say that it's not human-understandable.

      If you're reading this thread and wondering if it's easy to hard to use, I suggest taking a look at the docs (https://pandas.pydata.org/pandas-docs/stable/index.html) and making your own decision.

      I find the combination of basic intros, user guides, and the API reference to be extremely usable and understandable; and I am reasonably sure I am human. But opinions may vary.

      • jfim 1524 days ago
        Pandas has enough gotchas that it looks friendly until you hit one of them. Examples of gotchas:

        Want to join two dataframes together like you'd join two database tables? df.join(other=df2, on='some_column') does the wrong thing, silently, what you really wanted was df.merge(right=df2, on='some_column')

        Got a list of integers that you want to put into a dataframe? pd.DataFrame({'foo': [1,2,3]}) will do what you want. What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values. Enjoy debugging your joins (sorry, merges) with large integer values.

        Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

        • qwhelan 1524 days ago
          >Want to join two dataframes together like you'd join two database tables? df.join(other=df2, on='some_column') does the wrong thing, silently, what you really wanted was df.merge(right=df2, on='some_column')

          Simply a matter of default type of join - join defaults to left while merge defaults to inner. They use the exact same internal join logic.

          >What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

          This was a long standing issue but is no longer true.

          >Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

          Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

          • jfim 1524 days ago
            > Simply a matter of default type of join - join defaults to left while merge defaults to inner.

            No, join does an index merge. For example, if you try to join with string keys, it'll throw an error (because strings and numeric indexes aren't compatible).

              left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
              right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something_else": [4,3,1,2]})
              left.join(other=right, on="abcd")
              
              ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
            
            If you try to join with numeric keys:

              left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [10,20,30,40]})
              right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [40,30,10,20]})
              
              left.join(other=right, on="something", rsuffix="_r")
              
                abcd  something abcd_r  something_r
              0    a         10    NaN          NaN
              1    b         20    NaN          NaN
              2    c         30    NaN          NaN
              3    d         40    NaN          NaN
            
            Or even worse if your numeric values are within the range for indexes, which kind of looks right if you're not paying attention:

              left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
              right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [4,3,1,2]})
              left.join(other=right, on="something", rsuffix="_r")
              
                abcd  something abcd_r  something_r
              0    a          1      c          3.0
              1    b          2      a          1.0
              2    c          3      b          2.0
              3    d          4    NaN          NaN
            
            Whereas merge does what one would expect:

              left.merge(right=right, on="something", suffixes=['', '_r'])
              
                abcd  something abcd_r
              0    a         10      a
              1    b         20      b
              2    c         30      c
              3    d         40      d
            
            >> What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

            > This was a long standing issue but is no longer true.

            Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

              pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807]})
              
                          foo
              0  1.000000e+00
              1  2.000000e+00
              2  3.000000e+00
              3  4.000000e+00
              4           NaN
              5  9.223372e+18
            
            It's also a lossy conversion if the integer values are large enough:

              df = pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]})
              
                          foo
              0  1.000000e+00
              1  2.000000e+00
              2  3.000000e+00
              3  4.000000e+00
              4           NaN
              5  9.223372e+18
              6  9.223372e+18
              
              df["foo"].unique()
              
              array([1.00000000e+00, 2.00000000e+00, 3.00000000e+00, 4.00000000e+00, nan, 9.22337204e+18])
            
            >> Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

            > Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

            It's not very pythonic, though. A definition of "all dimensions greater than 0" would've been much less surprising.

            • qwhelan 1524 days ago
              > Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

              It was released in 0.24.0: https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...

              For example:

                  pd.DataFrame({"foo": [1,2,3,4,None]}, dtype=pd.Int64Dtype())
              
                      foo
                  0     1
                  1     2
                  2     3
                  3     4
                  4  <NA>
              
                  pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]}, dtype=pd.Int64Dtype())
              
                                     foo
                  0                    1
                  1                    2
                  2                    3
                  3                    4
                  4                 <NA>
                  5  9223372036854775807
                  6  9223372036854775806
              • jfim 1524 days ago
                Sure, if you specify the type. It's still a gotcha because the default behavior is to upcast to floating point unless the type is defined for every integer column of every data frame, which isn't very pythonic.

                The example with the (incorrect) join above shows how even other operations can cause this type conversion.

                • qwhelan 1524 days ago
                  Yes, there's a lot of existing code written assuming the old behavior. But most code has only a few ingestion points, so it's pretty simple to turn on.
        • squaresmile 1524 days ago
          If anyone else also wonders what does pandas.DataFrame.join do [1]:

          > on : str, list of str, or array-like, optional

          > Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

          > DataFrame.join always uses other’s index but we can use any column in df.

          Interesting choices but luckily I have never been bit by this. Maybe because I first looked up on google on how to do join in pandas. Nowadays, I mostly use pd.merge though.

          [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

        • maxnoe 1524 days ago
          Missing value treatment is one of the major improvements in the upcoming pandas 1.0, including Integer, boolean and string column types with proper missing value support.
        • mcrad 1524 days ago
          So you prefer the word join and merge is tripping you up? It's not sql!
      • teej 1524 days ago
        You can’t pick up a dictionary and determine how difficult it is to speak a language.

        One of the most difficult things about pandas is knowing how to shape your data frame before you plug it in to something. Most of my frustration comes from finding the magic incantation of symbols that will reshape my data frame the right way.

      • anakaine 1524 days ago
        And I also. The other poster, in my opinion, has it backwards. Seeing how each option can be used in a myriad of examples is particularly helpful and saves me much time struggling to work out what the right method/ tool is, and how to use it.
      • edgyquant 1524 days ago
        I learned Python by using pandas so for awhile I was one of the devs you speak of. I can't imagine I'm alone.
      • Seufman 1524 days ago
        Yeah, I agree: it's super straightforward. It's hard for me to understand how people could find it inscrutable / confusing. Pandas is a LIFESAVER for anyone who needs to manipulate datasets programmatically.
        • mattrp 1524 days ago
          Big fan of pandas as well... but I have to admit there are some things at the beginning were like wtf... some of which are now deprecated thankfully. But more on topic to this post I really can’t see why someone would want to solve speed on small datasets by incorporating numpy into a new form of pandas... both projects are so established, why would you attach your project to some other dev who know has to keep pace with numpy and pandas improvements when you could just import pandas, import numpy and be done with it?
    • missosoup 1524 days ago
      Pandas has some of the best documentation that I've seen. Maybe the issue is that the documentation makes a lot less sense without a stats and data background, but expecting pandas documentation to be a stats course is silly.

      Other thing is both numpy and pandas allow for some pretty complex operations by chaining simple commands. The space of what's possible is effectively unbounded, documentation won't help there. Cookbooks cover some of the more common patterns.

      • acomjean 1524 days ago
        I feel like the documentation is great if you already know how it works. Having learned Pandas recently, I'll confess there are parts that are a bit tricky to get past the basics. I know R dataframes which was helpful.

        Pandas is well worth it to learn however if your using python.

      • cerved 1524 days ago
        I imagine OP is referring to things like conditional selection on multiple columns, not by AND && operators but by bitwise & operators.

        I've mostly use comprehensions to filter and manipulate data and I find the Panda API to be a bit clunky and esoteric.

        There are nice built in functions and it displays well in a jupyter notebook but I'm not a fan of the interface.

        • missosoup 1524 days ago
          I mean yes, it's a bit esoteric, but so is every other data manipulation package/language. Look at numpy, dplyr, q, spark, or even sql for similar use cases. There's no easymode way to express arbitrary data operations, especially not when you want them to be performant.

          If you don't need the advanced data manipulation capabilities of tools like pandas/dplyr (and if you're using comprehensions only, you don't), there are much easier to use options like apache nifi. Horses for courses.

    • Grimm1 1524 days ago
      Pandas' type conversions irk me. I get why columns with NaN convert to float from integer but very rarely do I have data that is complete for every column and converting columns that were intentionally integer has caused headaches when that data then goes to other systems such as a sql db.

      It is currently sitting at the center of an ETL system at my work (not my decision) and causes headaches.

      • squaresmile 1524 days ago
        You can use "Int64" for nullable integer data type.

        https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...

        • Grimm1 1524 days ago
          Awesome super useful thank you.
      • abakker 1524 days ago
        I mean, whenever I have this problem, I fall back on my (terrible) SPSS habits and just recode NaN to -999. With integer datasets that mostly works fine. You could come up with some alternative solution, too.

        Personally, I'm an amateur at best, but Pandas has made it possible for me to do some really handy things.

        Personal favorite feature: multindexing on rows AND Columns. Multiindex is such a common pattern, and so poorly handled in things like excel. Pandas really saves a lot of time with re-indexing, pivoting, stacking, or transposing data with multi indexes.

      • qwhelan 1524 days ago
        You should upgrade your version of pandas if possible - that's been fixed for a few versions now.
        • Grimm1 1524 days ago
          Oh interesting I'll have to check, we did an upgrade pass a while ago maybe we just didn't upgrade pandas for whatever reason.
          • qwhelan 1524 days ago
            As mentioned elsewhere in this thread, it's opt-in to avoid breaking existing behavior. But given that ingestion points are easy to identify, it's pretty straightforward to turn on (especially if you have a schema for your inputs): https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...
            • longemen3000 1524 days ago
              I saw in implementation (CSV parser in Julia) were the sentinel value was randomly assigned at read time (if a value in the input was equal to the sentinel value, change randomly).after parsing, the sentinel value would be converted to the appropriate data type (Julia Missing)
            • Grimm1 1524 days ago
              That makes sense and thanks for the info and the link. It will be very useful going forward.
    • mhh__ 1524 days ago
      I end up using it because it's there for the most part. I use python for hacking stuff, and not because it's fast but because the libraries exist.

      I am almost repulsed by the language and now you've said it I agree wrt the documentation too.

  • gewa 1524 days ago
    I worked with Pandas and numpy for different projects, and I really like the low level component way how numpy works. In most cases where I used Pandas I regretted it at some point. OOP and numpy in the first place would’ve been a better solution, especially because of the ease of Numba integration.
  • sriku 1524 days ago
    Nice to see .. but I think Julia is pretty much targeted at not having to do these kinds of jugglery.

    (Don't get me wrong. I actually appreciate the work, but also use julia)

  • anakaine 1524 days ago
    The one reference I didn't see was to chunking. Currently using Dask because of its graceful chunking of large and medium data - but pandaspy doesn't make reference to this capability.
  • enriquto 1524 days ago
    My whole work consists in manipulating arrays of numbers, mostly in python, and I never found any use for pandas. Whenever I receive some code that uses pandas, it is easy to remove this dependency without much ado (it was not really necessary for anything).

    Can anybody point me to a reasonable use case of pandas? I mean, besides printing a matrix with lines of alternating colors.

    • throwaway287391 1524 days ago
      I feel the same way. I do ML research and every time I'm provided a Pandas dataframe, the first thing I do before I try to work with it (after getting over my vague feeling of annoyance) is convert it to a format I understand -- e.g. a numpy array, a dict of numpy arrays, etc. I find the pandas API very unintuitive from an idiomatic Python perspective.* And it's not that I've never used SQL (I've written some small websites/apps that use it), I just haven't encountered a situation in my ML work where I've needed joins etc. to the point where I'd care to learn a whole new API for them.

      I gather it's very useful and extremely well loved for certain types of work though. But "the usability of Pandas" definitely doesn't ring true for me, let alone the implication that numpy is relatively unusable...if you've written lots of idiomatic Python working with its native data structures, NumPy will come very naturally, whereas Pandas is like a whole new language.

      * I have to call `at` instead of using Python indexing to get a row. wat. If I try to iterate over it like "for x in df", x is a column name rather than a row. wat.

    • siddboots 1524 days ago
      My whole work consists of manipulating arrays of numbers, mostly in python. I use pandas by default, and treat numpy as a specialised tool to be brought out when the use-case necessitates. Some big things for me are a) timeseries methods that just work e.g. df.resample, b) and multi-indices with stack/unstack etc., c) best I/O of any python data library (e.g. read_csv, read_parquet, read_h5, read_sql, etc)
    • smabie 1524 days ago
      The multi-index functionality is very nice for financial time-series data. I can easily resample from days to months, filter rows, etc. It’s good for columns that are of different types. I don’t really like using it though, Julia’s DataFrames are easier to use, more performant, and more integrated into the language.
    • Nimitz14 1524 days ago
      I think it's useful for when your feature columns have names. Otherwise yeah not really.
    • cerved 1524 days ago
      Lots of built in statistical stuff and powerful visualization makes exploring datasets easy
      • enriquto 1524 days ago
        > Lots of built in statistical stuff and powerful visualization makes exploring datasets easy

        I see. For linear algebra stuff it does not offer anything essential. You rarely see a matrix as a "dataset".

    • wodenokoto 1524 days ago
      Interesting. How do you do group by's and joins?
      • enriquto 1524 days ago
        I don't know exactly what you mean, but numpy's hstack and vstack seem to be enough for me.
        • wodenokoto 1523 days ago
          SQL-like joins.

          Fx, joining the sum of sales by store ID from the sales table with the table of store names on the store ID. That is a major reason why people use pandas.

  • beefield 1524 days ago
    Slightly off-topic, I have been occasionally trying to learn to use pandas, but having worked quite a lot with SQL, there is one thing that I can't get over. Is there a way to force pandas to have same data type for each element in a column? (Especially pandas seems to think that NaN is a valid replacelemnt of None, and after that you really can't trust anything to run on a column because the data types may chnage.

    Or then, more likely, I have missed some idiomatic way to work with pandas.

    • TheGallopedHigh 1524 days ago
      Off the top of my head there is an as_type method to set a column to a type.

      You can also choose how to fill None values, namely what value you want instead. See fill_na function

  • gww 1524 days ago
    There's an cool python library called anndata (https://icb-anndata.readthedocs-hosted.com/en/stable/anndata...). It's designed for single cell RNA-seq experiments where datasets have multiple 2d matrices of data along with row/column annotation data. It's use of NumPy structured arrays is interesting.
  • ben509 1524 days ago
    If you've mucked with numpy dtypes, they're shockingly powerful, but this seems like a much nicer way to do it. Great idea!
  • kristianp 1523 days ago
    Has anyone here compared Turi Create with pandas and numpy recently? It was open-sourced by apple: https://github.com/apple/turicreate

    Seems like it's good for creating ml models and deploying them to apple devices.

  • hsaliak 1524 days ago
    nice to see more libraries in python that embrace optional static typing
  • m1cl 1524 days ago
    I agree, benchmarks not good.
  • m1cl 1524 days ago
    I agree. Benchmark not good
  • throwlaplace 1524 days ago
    isn't pandas already built on top of numpy? so what does this mean and owing to what is it faster?
    • skyyler 1524 days ago
      Did you read the README.md? The author discusses the motivations of the project there.