9 comments

  • encoderer 1042 days ago
    This is a wonderful article. I recently discovered an aurora gem that saved me from some flakey mysqldump: you can save the results directly to s3 with a “select into outfile s3” query. This is, according to the docs, an optimized operation when run from a read replica.

    https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

  • roopawl 1043 days ago
    Every once in a while there is a well written blog post about database internal. Uber's Postgres-MySql switch saga produced a few of them. This one is pretty good too
    • jeandenis 1043 days ago
      We worked closely with AWS on this (problem and blog) and they were great and quite transparent. Glad it's interesting/useful to you.
  • slownews45 1043 days ago
    The simplest is probably read committed especially if like many ETL jobs you are just going to grab stuff using one read for further processing. Another option, do a read committed and omit last 15 minutes of data if you are doing long running jobs to avoid churn at end of tables / logs.

    I see folks doing serializable reads for historic ETL jobs with one read in the transaction - why? Is there some history / tool issue I'm not familiar with?

    • bjacokes 1043 days ago
      For Aurora MySQL, the default for read-only replicas is repeatable read. As we mentioned towards the end of the post, read committed support appears to have been introduced to Aurora MySQL just last year. But you're right – now that it's supported, switching to read committed is by far the easiest fix.

      No idea why people would be using serializable reads for ETL jobs though! :O

      • slownews45 1042 days ago
        My own guess was that some ETL jobs were really data integrity jobs - in which case folks got used to higher levels of isolation being necessary across many reads to avoid false positives on their cross check stuff maybe.
  • whs 1043 days ago
    We had similar problem where a running ETL job caused a production outage due to binlog pressure.

    One thing that surprised us that our TAM says that on a 1 AZ write-heavy workload normal MySQL would have higher performance as Aurora synchronously write to storage servers in other AZs. On immediate read-after-write workload that would mean it would take longer time to acquire lock.

    • frakkingcylons 1043 days ago
      > One thing that surprised us that our TAM says that on a 1 AZ write-heavy workload normal MySQL would have higher performance as Aurora synchronously write to storage servers in other AZs

      What is surprising about a multi-AZ database having higher latency than one that runs in only one AZ?

      • bjacokes 1043 days ago
        From what I can tell, they provisioned their DB instance(s) in a single AZ, but weren't aware that Aurora automatically provisions its own storage and always uses multiple AZs. We touch on the separation of compute and storage in the post.

        I think the surprise is that it's not possible to have a truly "single AZ" Aurora database, even though you might have thought you provisioned your DB instances that way.

        • frakkingcylons 1043 days ago
          I see. I haven’t used Aurora, but have had experience running write heavy workloads on RDS. EBS failures would regularly (like monthly) cause our write latency to spike up 3-5x. If Aurora’s storage layer architecture is more resilient to those types of problems, that seems like a huge win.
        • goeiedaggoeie 1042 days ago
          Should not be a surprise if you are using Aurora hopefully. Papers on the topic are very clear on how they scale the storage.
    • bjacokes 1043 days ago
      This seems plausible given our understanding of the database internals. In general we found our AWS contacts to be knowledgeable and forthcoming about complex tradeoffs between Aurora and vanilla MySQL, even if some of that information is hard or impossible to find in the docs.
  • georgewfraser 1042 days ago
    I wonder why Aurora shares undo logs between replicas? It’s perfectly possible for the read replicas to each re-create their own copy of the undo logs, and retain those undo logs for different durations based on the different long-running queries on each replica.
  • user3939382 1042 days ago
    It seems like the benefits of cloud infrastructure have normalized vendor lock-in.

    I’ve never used Aurora because I don’t want to code anything to the idiosyncrasies of AWS (or any other cloud provider).

    • sofixa 1041 days ago
      Aurora has compatibility layers and you interact with it as with normal MySQL, MariaDB or PgSQL. Of course there are some underlying differences, but the code and most of the tooling stay the same.
  • exabrial 1042 days ago
    One huge difference is in locking we discovered. Do not expect Aurora to do you any kind of favors in you actually use these features.
  • shepardrtc 1042 days ago
    Really great article! I have a question: in it you say to keep an eye on RollbackSegmentHistoryListLength, and I want to do that, but I don't know at what number does it become something to worry about. There doesn't seem to be any guidance on AWS' site. I'm seeing ranges of 1,000 to 5,000 and sometimes 100,000.
    • bjacokes 1041 days ago
      Great question, although I'm not sure there's a concrete answer to it other than "it depends". You can think of that metric as representing the number of logs that haven't been garbage collected, so as it goes up, performance will get worse.

      If you're seeing spikes in RollbackSegmentHistoryListLength that coincide with dips in DB performance, you've probably identified the culprit. In the scenario described in our post, that metric would have grown monotonically for the duration of the long-lived ETL query – probably a more overt problem than what you're describing with short spikes to 100,000.

      • shepardrtc 1041 days ago
        A number of our 100k spikes spanned about a day, and a cluster of them seem to coincide with serious performance issues we have encountered. We "solved" the problem by increasing the instance size, but I'm starting to see spikes that get larger and larger, so I suspect we will run into this issue again. But now I have something to report on and watch out for. Thank you!
  • wooly_bully 1043 days ago
    H3 tags on this could really use a bump in size and contrast from the regular text.