I’ve seen devs just run select * from table then filter it and sort it in their own code. Then they complain “the database is slow” when it’s spending all its time shipping gigabytes of data they don’t need to them!
From what I encountered, this is generally the case when someone is in the "analysis/reports" mode. Rather than get summary statistics on each column, find number of nulls, etc by writing a sql query, they instead get the data into the Python/R instance, and use general purpose functions, utilities, etc. "Programmers are expensive" statement probably applies here as well. I'm not trying to be defensive here, just saying that this might be one reason.
If you believe "Programmers are expensive", then you should do as much as you can do with a declarative data manipulation language (usually SQL, you can also consider sequences of text manipulations tools using pipes) and leave that last 15-5% of high-value work to a more powerful but also verbose imperative lenguage (usually Python, but any).
Asking for what you want is considerably faster than saying how you want it done.
Dang, Ibis doesn't support Redshift or SQL Server. I'm also having trouble understanding what it really is - it's an entire framework for big data it seems and not just a translator. What I'd really like is just that, something that turns pandas dataframe operation into ANSI SQL. So input pandas2sql('tablename["col"]') -> "select col from tablename". Something really simple to use.
I imagine none of you or the GP is talking about in-database cursors, that you open in SQL, use on the same SQL script, close at the end of the script and move along. There isn't really a problem inherent to those, and they stay non-problematic if you are writing your database scripts on Python, C, or whatever.
For a server reading and writing from/on a database, you paginate your queries. On Postgres that would be using "limit" and doing small bulk inserts, but under Oracle your options for inserting are limited, so there is still value on cursors.
There is the odd occasion where you'll will iterate through the entire results set, do something fast for each and every row, and only get something useful on the end. Those are classic problems where cursors are more efficient than the alternatives. But even on those, once you add error handling and recovering may yield better results with pagination.
They talk about client-side cursors, these are part of the Python DB API. What you shouldn't use is server-side cursor, which you create with DECLARE in PosgreSQL. These keep their state on the server and are intended for optimizations for special cases like streaming data processing or realtime updates. Basically, for deep internals of realtime systems, and not regular queries.
The problem isn't the cursor itself, but that it is usually a symptom of procedural thinking vs set based thinking. In an RDBMS it's typically far faster to puzzle out the joins, CTE's, and set based expressions and functions to use to winnow down a dataset vs a cursor based procedural logic on a row by row basis.
Thank you for clarifying that; I was wondering where the concern was coming from.
Without disagreeing with any of the above, one important consideration is what you're going to do with the query. If all you want to know is a column's mean or some other simplified statistical value, there's really no sense in pulling all the data into Python just to calculate it. Do it inside the DB itself with SQL.
On the other hand, if you need that data to do other work (i.e. populate the table in a webpage, or generate a new descriptive data set or whatever), then the trade-off for pulling it into Python/pandas and running a mean in addition to the other work becomes much smaller.
My approach is usually to do as much data filtering and parsing as possible inside SQL, but things like complex parsing and string manipulation (especially!) I'll do with Python. I can do some simple string work in SQL, but I can almost always do it faster and cleaner in Python.
Why? (I am not taking a contrary position by asking).
But I do know that if you didn't want a 20,000 row response to your query to be HTML-ized to be transmitted and then displayed (sometimes over a 56kbps modem), you used cursors in the NSAPI.
I have no idea about whether that was a good design decision at the time, and even less idea now, but I kind of incorporated the practice, and didn't know it was "bad".
It is relevant for large web-API end-point responses (pagination) even now, no?
So! Why is that bad? :-)
EDIT: Please don't be snarky - I spend more time figuring out clock-skew on high-frequency mixed-signal boards than I do talking to a database, and I'd like to learn to be better at the latter.
To my knowledge, you can't get data from an RDBMS without a cursor - kinda required to do even a simple SELECT. I'm guessing what you're referring to, is keeping a cursor open from a Python process that should have been closed after the results were brought into memory.
I wouldn’t be so orthodox, there may be some special cases if a database is used in non standard ways to do strange stuff.
However, if cursors are used to retrieve or update data, to do actual database stuff, cursors always look like a capitulation and retreat into procedural territory. If you use cursors you’re not giving the database a chance to shine and make it very sad.
A few years ago I joined a Rails shop, and one thing that always struck me was how many of the engineers didn't know SQL. Most of them had learned to code on Rails, and had always had SQL abstracted away via ActiveRecord.
I know this is not the point of this article, but as data analyst/scientist roles continue to climb in popularity, I'm curious if there won't be a similar trend with Python.
Isn't that one of the selling points of Rails? Time to market is king. Optimize your SQL query performance after you've released and proven that it's an actual bottleneck. Why spend more time and money on an optimized product that might never see the light of day?
There's a difference between basic optimization, and needing to refactor most of your business logic because your assumptions about databases are boneheaded.
In general, it doesn't matter what you're doing, your basic design patterns need to fit around how a database works. If you don't know this, you'll hit scalability issues far too soon, and it'll take too long to fix them.
> Optimize your SQL query performance after you've released and proven that it's an actual bottleneck.
I've seen one project fail because the design patterns around using the ORM were incorrect. Then I joined another project where the bottleneck (from incorrect use of the ORM) was so bad the product couldn't scale beyond being a demo. It took 2 months to refactor, all because one of the programmers used an ORM incorrectly to save a few hours at the beginning.
If you know SQL you can get the best of both worlds though. You still use the ORM for basic queries, but if you know you need to, you can apply optimisations like eager loading basically for free (dev time wise). The issue is devs not understanding what's going on underneath creating unnecesary performance problems
Pretty easy to see how, just look at the Python equivalent in Django. You're so far abstracted away from what is actually going on that it's no wonder no one understands it. There is a lot of gotchas, that most probably never investigate, particularly multiple calls to the database for very simple join operations that aren't made apparent in the ORM unless you're watching SQL logs(they aren't).
When you have millions or billions of rows, SQL calls absolutely become a huge bottleneck. I deal with this all the time from shortsighted developers at the office in other ORM environments. Most recent being a SELECT...NOT IN('a','b','c'), causing an INDEX SCAN on 200 million rows, and then complaining it takes 6 minutes to run.
Look at Django's select_related. It's one of those if you don't understand what's happening under the hood, then you're probably querying way more than you should be.
I like having some sort of db:seed task during early development that seeds 1MM+ rows in all of your tables to help you experience performance issues as they're created.
It's even more important if your production application has millions of rows. Too easy to create a system that runs perfectly on 10 rows but will crash your production server as soon as you deploy it. Forgetting to create an index on the FKs is a classic one.
Yes of course, but let's be realistic, would the people doing this type of clueless work, bother to seed a database? It's a self fulfilling prophecy of naivety. It's like all of the investing and saving advice on Yahoo Finance. The people reading it are already the ones doing it, because they are interested in learning more about it.
select_related caught me when I first started using Django. It's a sneaky one because the queries worked fine when I first wrote the program. Then I started populating the database and, over the course of months, the queries got slower and slower and slower.
Eventually I was forced to pop open the hood and horrified to find this spaghetti bowl of nested, duplicate queries that took a fair bit of work to simplify and optimize. I was not so lucky as to have a DBA I could dump my problems on and was forced to learn that lesson the hard way.
What was it about select_related that slowed your queries? What'd you do to fix them? Did you have to abandon select_related, or just tweak its parameters?
I'm just building out a Django app now and using select_related, or rather prefetch_related, for retrieving tags (m2m relationship). Seems to work well so far, but I'm I'm sure I'll run into a similar thing of having to optimize all these queries soon.
Similar to what @overcast said: initially I didn't use select_related at all. Almost immediately I saw huge DB utilization with hundreds of thousands of tuples returned for (what I assumed) were pretty simple queries. I realized, as @overcast said, that it was looping instead of asking for it all at once, so I added indexes and appended "select_related" to almost every query. Then I figured it was fixed.
Once my database hit 100GB and a few hundred million rows I had no choice but to sit down and actually learn what each of my ORM commands was asking my database to do. Sometimes I removed a select_related. Sometimes I replaced it with prefetch_related. Sometimes I eliminated an entire filter operation or moved it elsewhere. A few times I injected a greatly simplified raw SQL query instead of relying on complex ORM generated SQL. In four instances I replaced expensive join operations with periodically rebuilt "materialized views" to reduce CPU usage and DB I/O. All was timed with django-debug-toolbar and/or pghero to minimize database impacts and network congestion.
So select_related was sneaky in the sense that I thought I had solved the problem very early on, when I had merely delayed it until much later. If your database always remains small you'll likely never encounter this issue.
The solution is to not fire and forget the application, but to install something like django-debug-toolbar and monitor what your program is doing as the database grows in size. But for heaven's sake, don't worry about that problem today. Get your app working so you can make money. Once it's done, however, remember that your ORM has put a thick collar on your new puppy, and as it grows you'll need to expand that collar or you'll slowly strangle your pet.
selected_related is the solution, not the problem. Without it, Django would make a database query literally every single loop iteration. My point is that without paying attention, Django makes it very easy to query things, sub-optimally.
django-debug-toolbar is excellent, and I use it on my django sites as well.
In addition, the documentation does a pretty good job of highlighting some of the common gotchas. Unlike your environment, my page/js weight is very low but I'm querying against a few hundred million records joined across many tables. Even using materialized views to eliminate the impact of joins in postgres, it's required a fairly delicate touch to make the delay for page loads tolerable.
In that respect I would likely redo the project in flask and sqlalchemy, if only because then I wouldn't have to remember the syntax nuances of two separate ORMs. They're similar, but not identical, and it's infuriating at times. Plus I'm very comfortable dipping down into raw SQL in sqlalchemy, and it hasn't been as intuitive for me with Django.
I've found that SQL and data frames are pretty exceptional together in combination. I'm a big fan of a Python module called PandaSQL https://pypi.org/project/pandasql. What makes this work so well (for me, at least) is that I can combine pandas operations that transform data frames with SQL operations that transform data frames in the same pipeline, and if I really need to, I can just break it all apart with python and reassemble it back into a dataframe later. I don't need to recreate Boolean algebra with complex and potentially buggy dataframe operations, and I don't need to recreate loops, conditionals, stats operations and so forth with complex and potentially buggy SQL.
For clustered work, I've found that Spark sql data frames give a lot of the same functionality (not quite all, though I think that's because there are some pandas operations that require a full in-memory dataframe and don't lend themselves to distributed solutions).
 there have been so many attempts to replace SQL with a different relational-like language. the end result is a new syntax that doesn't work if you try to pull your queries out and run them against a database independently. I'm going on a tangent in a footnote here, but I remember reading "your data will outlast your application." I personally strive for a usable database outside the context of the application it was originally created to support. Trust me, eventually someone will want a set of reports that would (in many cases) be fair easier to write as queries if you'd made sure your back end database was a properly designed relational database.
No, though it has been a while, and your question may prompt me to take a look around the landscape. Not that I have had a problem with pandasql, it does work nicely.
This is one of those technologies that I was hoping would make its way into the framework, kind of how connection pooling for databases was once an external module but is now often built into the various web frameworks. DataFrames with common columns are such a natural match to relational tables that it seems that a way to call SQL would be (ok, in my opinion, should be) part of pandas (I think that this is the case for R data frames).
I also just really enjoy writing SQL. To some extent, this is a personal preference - some people just mentally line up with certain ways of thinking. There are things that are clearly better to do in python, things that are clearly better to do in SQL, and then a grey area. I personally lean much farther to the SQL side of that grey area, but there's certainly nothing incorrect about going the other direction.
Python + SQL seems like a good match for many analysis problems, but Excel + SQL is not bad either. I like the ability to combine complex SQL views or SQL functions with Excel pivot functionality, querying the database directly from Excel
Even after learning python, pandas, and SQL I still find myself firing up Excel first when exploring a new problem. The interface is incredibly simple for pasting, manipulating, and visualizing data. The warts really only start appearing when you try to build business applications on top of it and shortcomings with data integrity (amongst other issues) begin creeping in.
But as a sketchpad for handling a problem? Excel is pretty hard to beat.
I also avoid pandas for simple stuff. But rather than using Excel I use dplyr and other R tidyverse packages. I think pandas is a little bit more powerful but I find R much easier to use (easier than Excel when you get used to it).
Maybe I've not fully understood pivot tables (quite possible), but this example seems to support my understanding... aren't pivot tables basically just aggregate functions such as `sum` and `avg` applied to a window of the table data (pivoted by rows -> columns)?
Some of the executives I work with like to relate all their work in Excel and they just love pivot tables.
I showed one of them the output of a table of data on web page backed by a database and they asked if I could export the data to Excel to make a pivot table that would then be displayed on the web page. Of course, I implemented their pivot table as queries against the database and created a new view to display it.
I've never understood why I would need Excel to make Pivot tables when I already have SQL.
You're absolutely right. As someone coming into Python and SQL from a purely Excel background, the first realization I had was that I had been, in essence, programming Excel. Not in a rational VBA kind of way, but in how I was linking all of my cells (and sheets and books) and then using pivot tables, etc., to extend Excel beyond a single column/row perspective. Microsoft PowerPivot (or whatever it used to be called) was my gateway drug into the world of SQL.
I can generate similar utilities to Excel/PivotTables using Plotly Dash or Pandas or even DataTables, but even after a few years of learning and practising it would take me less than 1/10th the time to generate an equivalent excel pivot and chart and have it displayed in front of a group.
I'm not saying it's better, just that those execs are so comfortable with that visual-feedback excel approach that it'll be an uphill battle to convert them to a programmatic one. Pivot tables provide very rapid means of filtering, modifying output, and aggregating information than is otherwise possible inside Excel. To those coming from a two dimensional excel spreadsheet world, discovering pivot tables is like viewing the world in 3D.
I don't see why you would go to Excel pivot tables from SQL; you already have a more powerful tool at your disposal, if you're comfortable with it. Going from Excel to SQL? That hurdle is a bit higher.
Pivot tables are quite useful for quick ad-hoc exploration of various different group-by structures with immediate feedback, empowering end-users who wouldn't be able to make the queries themselves, and waiting for someone else to do it would make it not worth it for exploration.
If the data allows that, then a data export (usually requiring a bunch of aggregation and preprocessing) to a format suitable for excel pivot tables or some of the many business analytics tools often is quite useful to users who want to analyze the data.
However, it should be expected that a side-effect of such ad-hoc data analysis often is specific reports that become understood to be useful, and which can then be re-implemented "properly" i.e. as a sql query/report that gets run in an automated, tested, reliable way and delivered where needed without that manual analysis step.
If you know SQL well, you have a strong toolbox for sure. But pivot tables are strong too, especially in exploratory "slicing and dicing" of data. It is slightly faster for some purposes and that makes it more convenient. I tend to use SQL for the heavy lifting and then explore further with pivot tables.
Yes but just in a very basic way that I basically googled. I focus the analysis logic to the SQL layer (views, functions) and the interactive analysis using pivot functionality. I don't know powerquery well yet.
Tangential question: I'm curious how many people (here on HN, or in general) learned SQL before they learned more traditional programming (e.g. Python, Java, C)? I learned traditional programming (through college) and only stumbled upon SQL years later (someone left a "How to use Microsft Access" book around at work).
I absolutely love SQL, and am in the middle of writing a book on how to use just SQLite for large, complicated data work. My target audience is programmers, but also non-programmers, because I think SQL is vastly easier to learn than something like Python. The tradeoff of course is that SQL is much more limited a framework, but it's more than enough to do data work (all the stuff that isn't visualization or complex stats).
But I can't imagine many other professional or academic career paths in which someone who ends up learning and using both Python and SQL started out with SQL, then learned Python (i.e. general programming). Would would they have done with SQL in school, or as a hobby?
When I graduated University in 2003, I took a front desk job doing business reporting and running adhoc queries. I did this work for many years in a variety of places, always working with a different type of database in each company. There's so much depth to relational data modeling and SQL. Many, many ways to design queries. I didn't get into hardcore application development until many years later.
I've forgotten some of the more advanced SQL that I've written over the years. I look back at SQL I've written as if I'm reading someone else's work.
I enjoy working with postgres but miss the temporary table workflow in sybase.
I'm a data engineer and I learned SQL before I learned python. I started out as a marketing specialist (mostly working in Google AdWords), then I graduated up to BI Analyst, and eventually to data engineer. I've picked up 100% of my python in my current job.
What did you think of Python, and was it difficult to grasp the differences between the two languages/paradigms? I ask because, for myself, any general language I pick up (Ruby,Python,JS,R), I have a good idea of underlying concepts like memory pointers and garbage collection, even if it's all abstracted by the language. With SQL, I have an incredible ignorance of the most basic programmatic concepts, like how to define variables or custom functions. The deepest underlying concept I have of SQL is that it has a query planner that does all the thinking for me.
Yeah, the learning curve was pretty steep for me. I've been learning some C# for my job, and it's been a lot easier after learning basic Python. SQL is kind of weird in the sense that starting out is very very very easy, but mastering it is incredibly hard. I've been writing SQL everyday for about 3 years and I'm still blown away with what the SQL experts at my company can do with the language. I see presentations at conferences that make me feel like an absolute beginner.
I learned SQL in high school as my first "programming" class, though that's more of a quirk of how I moved through the curriculum. This was prior to learning C++ and then transitioning to Python some years later.
It was both language and database theory, all focused on Oracle. IIRC, it was titled "Database Administration", but I remember having to draw out schema diagrams and learn all the particular arrow types (one-to-one, one-to-many, etc) and what different block shapes meant in the diagram. All that only to get into industry 5 years later and find no one goes beyond basic squares and maybe double-ended arrows in practice >.>
SQL+Python is extremely powerful, and the author makes some good points (multivariate regression should be done in python and not SQL for example), but the query example in the blog is not a good one. Every modern DB has aggregate and statistical functions like ntile, percent_Rank, median, min, max, etc. I’d honestly rather run these functions against the database than do it in python. Especially if you're working with billion+ row datasets. In fact if all I'm doing is these kinds of statisical/aggregate functions then I definitely wouldn't use python at all.
I wonder if people do timings for the variances between a function in the DB vs Python? Unless your queries are so small (like individual records) that a round-trip to the DB is impacted by network lag, I find it's almost always faster to ask the DB to do as much of the lifting as possible.
The downside, of course, is that your code potentially becomes unreadable, harder to migrate between database backends, and more difficult to debug.
SQL Server doesn't make it easy to calculate median values. There are a number of slightly convoluted ways to do it but it's far from straightforward and usually I end up calculating it in C# if I can.
One thing that hasn't been mentioned yet, that I found when I started using Python and SQL together, is the importance of well-considered indices. ORM's aren't always too clever when they generate an index, and often (for repeated queries) you can dramatically cut down the processing time by thoughtfully generating an index.
If I know I'm going to be asking the database a certain question a bunch of times I'll even generate a temporary index to speed up my analysis, and then delete it when I'm done. No sense running a SQL query for an hour if I can cut it down to 5 minutes with an index on the target column.
EDIT: I know this only applies to PostgreSQL, and there are numerous alternatives, but a big shout-out to pghero which helped me identify a bunch of duplicate and missing indexes in one of my databases, saving both time and hard drive space. Incidentally, it also sent me down the road to learn (but definitely not master) index optimization.
The move to columnar databases for BI and data science has thankfully eliminated a lot of this type of performance cruft. There are different issues now, but needing to add indexes just to run a new type of query is something I haven’t thought about in awhile.
So, I don't know about setting rules beforehand that work very well universally, but what I personally think is the better method is to just slowly build up your ruleset as you use it every day. I don't think lots of people know how to do this so here it is:
1) Upon using a website you frequent, enable one root at a time until it functions properly. (or you can do the more granular per grid block enable) 2) Click umatrix, then click at the top where it says "uMatrix $ver" which will take you to the dashboard. 3) In the dashboard click on "my rules" 4) on the right click the "commit" button.
What this does is commit your changes to the permanent ruleset and websites you frequent will start to "just work". A word of warning, if you blindly commit after browsing for a while when you may have done some temporary allows on random websites, those will be commited too. So I suggest either reviewing the commits first (always a good idea) and deselecting the ones you don't want, or having a new session for each of your frequented websites that you then commit from.
I have been planning to write up a tutorial on things like this for family and friends, maybe I'll post it to show hn.
When I visit a site that doesn't work. Usually I carefully whitelist things that I am willing to whitelist until it does work. If I can't get it working, then in most cases, I don't consider the content of that site to be valuable enough to whitelist things that seem sketchy.
uMatrix is not a "set and forget" thing like uBlock is. Using uMatrix really requires that you understand the (admittedly overwhelming) UI and make decisions for yourself about what you want to allow/block. Personally, I deny nearly everything third-party and make individual exceptions on a per-site basis, which I save for sites I visit regularly.
I had a problem where there was some fixed width data in a sql database - basically someone put mainframe data in a database.
There were 3 fields that had multiple entries in them, fixed width delimited. I had to split the fields by width and re-combine them, then also recombine them with another set of data with weird delimiters and rules.
It took a day and half (not full time) to figure it out in python. I can't even imagine tackling the problem in a non-repl language.
Turns out a good database is really good at data munging. A solution for postgresql might look something like this:
CREATE VIEW my_table_improved AS
SELECT SUBSTRING(the_column, 0, 8) as col1,
SUBSTRING(the_column, 8, 8) as col2,
SUBSTRING(the_column, 16, 8) as col3
After which you can query my_table_improved as a normal table. col1, col2, and col3 contain the data split out from the_column. In practice you'd probably also want to do some type conversion, e.g. if col1 is supposed to be an integer you can simply update the view to select `CAST(SUBSTRING(the_column, 0, 8) as INTEGER) as col1` instead. In production use you might find this to be slow at which point you will want to create indexes for your new columns. Something like this (adjusted to which queries you're running of course) should work:
CREATE INDEX ON my_table ((SUBSTRING(the_column, 0, 8)));
Of course this is a lot of work if you're unfamiliar with SQL, and above examples aren't quite complete yet for your use case, but it should get you an idea of how SQL is the exact right tool for the job here. which is somewhat the point of many commenters here: get yourself familiar with SQL and save yourself a metric tonne of work in the future.
An alternate approach might be to write a query that migrates the fixed-width format to a format where each entry is in their own column. The ease of this mostly depends on if applications depend on that column being in that format.
side note: above sql code is untested but should be roughly correct.
If it was just one column, that would be easy enough. Or if one object was on one row, it would be easy enough. In my case, up to 6 rows could be required to represent one object, and I had to slice 3 columns with an arbitrary number of slices.
Oh, and there were two types of sub-record per object, and they had to be processed in database order.
Well, as long as you can build a query to get the data in the right format (which you almost inevitably can) you can make a view out of it. But honestly the true solution here would be to migrate away from such a brain damaged format.
If you are working a lot with databases, and would like to automate some of the workflow in python, I'd highly recommend Pony ORM, which really feels like LINQ for python, and feels very close to the original SQL.
This article is talking about a role typically called “data analyst”. They help product managers, marketing, operations, etc by running reports, building dashboards, (business intelligence), building data models, running an exploratory analysis, analyzing A/B tests (product analytics) or things like building marketing attribution models or timeseries forecasting.
It’s the job that’s 80% of what a data scientist does but without Python or R. But mostly it depends because no one can decide what these roles should be called or what their responsibilities should be.