I'm puzzled how anyone trusts ChatGPT for code

I asked ChatGPT to convert a simple 12-line Python program to Lisp.

The result is broken by mismatched brackets. Yes. In Lisp.

I ask for correction and receive the errant bracket moved, still mismatched.

I ask again and the bracket gets removed. Fine??

No. A loop break has been mistranslated to a return from function, so execution is nonsense.

I wonder how much user faith in ChatGPT is based on examples in which the errors are not apparent ... to a certain kind of user.

133 points | by chrisjj 11 days ago

70 comments

  • delichon 11 days ago
    I'm puzzled why anyone would incorporate code on trust. I certainly don't trust LLMs for code, but I use code from LLMs on most work days ... not because I trust it, but because I understand it and tested it. So, the same as code from anyone, including from myself six months ago. Trust doesn't enter into it. Clear and workable solutions do, and I've gotten an impressive number of those.
    • gordon_freeman 11 days ago
      yes. LLMs are great at giving a template or code block for specific usecase and then I use it after understanding, modifying (if needed) and testing it. I am not sure about ChatGPT but Gemini Advanced even allows modifying code and running it within the answer itself which is convenient to test out minor changes.

      Also, getting template of 100s of LOC saves time in writing it from scratch even if I know exactly what I'd need to write.

      • chrisjj 10 days ago
        > LLMs are great at giving a template or code block for specific use case and then I use it after understanding,

        I suspect the typical ChatGPT user is using it for code that's beyond his ability to write. That being the whole point of his using it.

        It follows that such code is likely beyond his ability to understand. But worse, beyond his ability to understand he doesn't understand it.

        • Brian_K_White 10 days ago
          I can read and modify all kinds of code I couldn't write in the first place.

          I haven't used any llm for code yet but most code I ever touched was mostly written by others, and I wouldn't have been able to write it from scratch, but was able to modify it, take only chunks out of the whole to use for something else, take only the structure and not the verbatim code and use in some other language, refactor to satisfy my own priorities vs the original, get useful results in languages I don't even know, etc etc.

          It doesn't seem like using some pattern-matched uber-autocorrect code would be much different, especially once you understand that understandingless nature of what your getting.

      • repelsteeltje 10 days ago
        > Also, getting template of 100s of LOC saves time in writing it from scratch even if I know exactly what I'd need to write.

        I rarely know exactly what I need to write and writing it usually isn't the problem. Sans LLM, there have always been tools and techniques you can lean on, like syntax highlighting, auto completion, your compiler feedback, or unit tests.

        I find that writing code consumes only a small percentage of my time. And while writing code is obviously where bugs originate, it's not where the big problems come from.

        Making software good is the easy part, making the right software is where the trouble lies. It's why I spend most time not coding, but talking to people. Figuring out what needs to be done, how, when and why.

    • GenerocUsername 11 days ago
      Agree. The snippets from ChatGPT are essentially open-source as in they are fully inspectable and can be freely modified before execution. There is zero trust required.
    • yevlasenko 11 days ago
      I would back this one up as well. The code I generate in chat gpt is rather "dumb" (mostly react components, nestjs modules, etc) and in this case I trust myself to test and use it.
    • fassssst 11 days ago
      Right, hopefully no one is sprinkling eval(prompt) into their codebase.
      • 1f60c 11 days ago
        I had an idea about that the other day!

        What if you could write something like:

          @implement_this
          def prime_sieve(n: int) -> list[int]:
            pass
        
        And the decorator reads the function name and optional docstring, runs it through an LLM and replaces the function with one implementing the desired behavior (hopefully correctly). I remember there was something like this for StackOverflow.
        • matsemann 11 days ago
          I made that as a joke in javascript 8 years ago. Not using LLM, but using javascript proxies so that if a function doesn't exist it tries to implement it runtime based on the name of the function using a sql like grammar. I really hope to not see something like that in real use, heh.

          https://github.com/Matsemann/Declaraoids

          Maybe I should make a LLM version of this.

          • LegionMammal978 11 days ago
            The funny part is, Spring Data JPA has a quite serious take on this, in the form of query methods [0]. You create a repository interface with certain method names, and it dynamically creates an implementation that queries the columns of the underlying table, according to the name of the method.

            [0] https://docs.spring.io/spring-data/jpa/reference/jpa/query-m...

        • jackmpcollins 10 days ago
          I'm building magentic https://github.com/jackmpcollins/magentic which has basically this syntax, though it queries the LLM to generate the answer rather than writing + running code.

            from magentic import prompt
            from pydantic import BaseModel
            
            class Superhero(BaseModel):
                name: str
                age: int
                power: str
                enemies: list[str]
            
            @prompt("Create a Superhero named {name}.")
            def create_superhero(name: str) -> Superhero: ...
          
          
          I do have plans to also solve the case you're talking about of generating code once and executing that each time.
        • adhamsalama 11 days ago
        • Nodejsmith 10 days ago
          I've never tried it myself, but Prefect does have something like this with their Marvin AI library for Python.

          https://github.com/PrefectHQ/marvin?tab=readme-ov-file#-buil...

      • _akhe 11 days ago
        Not gonna lie... llama.cpp... LlamaIndex... Ollama... kinda is eval(prompt) a lot of the time!

        Of course AI data pipelines are a totally different conversation than code solutions.

    • chrisjj 10 days ago
      > I'm puzzled why anyone would incorporate code on trust.

      Because almost every programmer these days has learned by a route that relies on incorporating code on trust. E.g. using someone else's compiler.

      > I use code from LLMs on most work days ... not because I trust it, but because I understand it and tested it.

      The snag there is that one can rarely verify test coverage is sufficient.

      > the same as code from anyone, including from myself six months ago.

      But it is not the same. ChatGPT delivers completely untested code. Not even syntax-checked in my case. Would you accept such code from any human?

      > Trust doesn't enter into it. Clear and workable solutions do

      Plenty of clear and workable solutions are discovered unsafe. I suspect that's much more likely when the code comes from a source such as ChatGPT that delivers code in a completely untested state.

      • Brian_K_White 10 days ago
        Syntactically incorrect code is not much different than an outline or pseudocode.

        There are way too countless many stories of people successfully accomplishing tasks by getting an llm to give them a starting point or outline or even a complete solution that only needed a few fixups, hardly any different from debugging your own first draft, to say it doesn't or can't work. That's already sailed.

        The fact that it doesn't produce finished correct code, or even appropriate outline for every single case, doesn't seem especially remarkable or damning to me. It is what it is, it's not what it's not.

        (I haven't used any myself, but I'm not primarily a coder who is obligated to take advantage of every available tool, and don't like how any of the llm companies are training on open source code without either consent or attribution and then selling that. And don't like how ai and even mere speech to text voice interface before that is being added to everything in general.)

        • dllthomas 10 days ago
          > hardly any different from debugging your own first draft

          I find this take surprising. Leaving AI aside entirely, debugging my first draft seems very different than debugging your first draft.

          • Brian_K_White 10 days ago
            I often don't know what's wrong with my broken code. I only know what I want, not exactly in what way I failed to produce it. I have to debug to figure that out.

            I grant, maybe that is still quite different.

            There is something extra to grasp the theory behind someone else's code sometimes.

    • zouhair 9 days ago
      I am no programmer and I love how chatgpt can explain a bash script to me line by line, makes reading those way easier and more enjoyable.
    • argc 11 days ago
      Yeah, to use it effectively you use it for inspiration and generation of template code. The code may need a lot of changes or may need relatively few. But don't copy it expecting it to work without validating and testing it.
    • 2OEH8eoCRo0 11 days ago
      So you've audited all code that you're running on your system?
    • burnte 11 days ago
      > I'm puzzled why anyone would incorporate code on trust.

      Really? You can't understand why people do things based on trust? Do you trust no one?

      • simonw 11 days ago
        Trust but verify. Code review exists because even the best developers make mistakes.
        • burnte 10 days ago
          I totally agree, but let's say Jane writes the code, Frank reviews the code and checks it in. You're trusting both of them. Trust can't be eliminated.
  • Sohcahtoa82 11 days ago
    If you're expecting to just copy/paste from ChatGPT without understanding what it generates, you're really doing it wrong.

    I've only used it for basic snippets where I know what the code I need is, but it's faster to describe it than write it out myself (for example, "Using Python and boto3, take all the files in the current directory that end in .csv or .pdf and upload them to a bucket named "reports". Prepend the S3 keys with "YYYY-MM-DD/" using today's date").

    For simple things like that, it's been great.

    • paulmd 10 days ago
      It’s also constrained by how many examples it’s seen of a particular language or data format, and the general complexity of the format (even without AI you’ll see diff breaking on json files for much the same reason, unless you go out of your way to avoid it - brackets are semantic white noise and there’s nothing unique for it to lock onto between those bracket tokens). At a minimum it’s going to absolutely chew through tokens to maintain that much context.

      Obscure languages and operating systems are also a weakness - for example I could not get llama3 to reply with FreeBSD commands even after repeatedly telling it not to - there just aren’t that many SO/Reddit/forum posts about BSD in comparison to Linux. And evidently that’s what they’ve trained it on. There isn’t even anything semantically odd about an rc.conf, it spits out correct statements, just for the wrong file in the wrong OS.

      Everyone acknowledges things like multiplication and counting are hard for LLMs because they aren’t statistically encoded. But counting lisp brackets is also basically the same problem. And unlike YAML it’s not really self-healing or obviously correctable once it’s fucked up (as you noticed). I think the implication is there’s clearly some syntaxes that are much less generateable simply because they require intense amounts of context.

      (also, as far as counting… I’m sure far smarter people have look at this, but maybe multiplication is something that could benefit from log-addition techniques, that might reduce the complexity quite a bit vs a naive “multiply column by column like a human fourth-grader” approach.)

      • chrisjj 10 days ago
        > It’s also constrained by how many examples it’s seen of a particular language

        ChatGPT: "I've encountered numerous examples of Lisp code during my training, covering different applications and techniques within the language. Whether it's simple examples or more complex implementations, I've seen quite a bit of Lisp code."

        > the general complexity of the format

        Good then that this example has a very simple format.

        Here Not a rpoblem in this (even without AI you’ll see diff breaking on json files for much the same reason, unless you go out of your way to avoid it - brackets are semantic white noise

    • NewsaHackO 11 days ago
      Yep. OP had to ask it fix mismatched brackets, and is a perfect example of the type of person who definitely should not use LLM for code.
      • chrisjj 10 days ago
        Actually I didn't have to.

        I chose to. So I could give it another chance to succeed.

        But do tell. What type of person does not have to ask it to fix its mismatched brackets?

      • boredtofears 11 days ago
        I’m sure most people can learn to use LLMs correctly if they can code, it doesn’t matter what “type” of person you are.
      • peteradio 11 days ago
        I've seen self described "product owners" claim to build using LLM outputs without knowing how to code.
    • chrisjj 10 days ago
      ? If you're expecting to just copy/paste from ChatGPT without understanding what it generates, you're really doing it wrong.

      Glad I'm not doing that, then. I understand what it is generating as soon as I try to run it - and get a "missing bracket" error. I've understood that what it has generated is crap. And that only good luck made that easily detectable.

  • shinycode 11 days ago
    Our intern uses copilot extensively and its code is riddled with errors. What a pain to review, a lot of time wasted. This is really concerning for newer generations that are not professionals yet but they trust it because the code looks better than what they are able to do. Worse, they outsource their brain and don’t sharpen their senses. How will they become professionals this way ? As a help it’s okay, as a cheating tool that make them faster …
    • zamadatix 11 days ago
      An intern using LLMs to generate bad untested code is no different than an intern using Stack Overflow to generate bad untested code in that the problem isn't Stack Overflow or the LLM or whatever tool the intern is using rather the lack of application of quality control and code review by the intern on their own code. Making them write everything from scratch isn't an instant cure either, you can write some really badly designed stuff right from scratch and push it immediately on up too. This doesn't make you mentally sharper along the way, it makes you confidently sloppy. A better path is to ensure they are spending significantly more time reviewing, testing, and incrementally improving their code than they are typing/generating it. That's where you really sharpen your brain/senses... and also where you keep from driving everyone else nuts with your PRs. Be it LLM or senior dev if you just say "one shot this functionality so I can push it to main" you're going to have a bad day.
      • chrisjj 10 days ago
        > An intern using LLMs to generate bad untested code is no different than an intern using Stack Overflow to generate bad untested code

        Very different. An LLM lets him generate /far more/ bad untested code per unit of effort.

        And SO encourages other humans to test it. Often an SO answer comes with evaluative comments and ratings that are all I need to reject it. If the LMM was "AI", it would at least provide the same, right?

        > A better path is to ensure they are spending significantly more time reviewing, testing, and incrementally improving their code than they are typing/generating it

        Then let's see the LLM-based workflow that achieves that...

      • shinycode 11 days ago
        The main difference with SO is that the code from SO cannot always be c/p, the context is different so we had to rethink it. With copilot you have the sense that it’s contextualized and it’s the right fit. With pair programming I saw him accept multiple times the autocompleted code without reading it, he just said « wow it got it fast ». That for me is even worse than SO in that regard. For the rest I agree
        • ssl-3 10 days ago
          In all cases (using examples gleaned from Stack Overflow, or seemingly-complete code from the bot, or code crafted/hacked-together from scratch with Knuth as reference material):

          The intern is at the end of this toolchain, whatever that chain is. And when the intern is using that chain to produce code that they won't bother trying to understand before submitting it, then the problem is the intern.

          It is a poor craftsman who blames their tools.

          And what I mean by that is this: Suppose an apprentice is tasked with using a saw to cut some 2x4s to a specified length. It doesn't matter what kind of saw it is: It could be a handsaw, a chop saw, a handheld circular saw, a radial arm saw, a table saw, a FrameRipper 9000 Pushbutton CNC SawSaw saw, or any other manner of wood-cutting saw.

          Maybe they're even using an LLM bot to create the G-code to run that FrameRipper 9000.

          Whatever it is they're using, suppose they produce cuts that are consistently and uselessly wrong -- that are unfit for the purpose.

          And maybe a lot of that is OK -- after all, they're there to learn, right? Certainly, a big part of learning means making mistakes. It's going to happen, and to varying extents it's absolutely expected to happen. It can be blameless -- especially early on.

          But suppose the apprentice won't ever bother with even looking at the cut-up boards that they've produced to see if they're even in the same ballpark of what was requested, and just blindly submits their work (however right or wrong it may be) as a finished task.

          This is no longer blameless.

          Now, at this point: Do you blame the tool, or the person who is using that tool? (Choose exactly one.)

          • shinycode 10 days ago
            The person. That’s when I say that when it’s as a help it’s ok, when it’s as a cheating out, aka outsourcing brain, it’s not. « I don’t want to think it through but still appear to having worked on it » is not acceptable. I agree with all other comments that says that we don’t trust the output and always control it.

            Imagine the horror to trying to read, understand and fix million lines of unsupervised code generated in a production environment …

            • ssl-3 10 days ago
              I suck at coding. I always have sucked at coding. I've worked on getting better at it for several decades, but even small projects (regardless of language) get me bogged down very quickly. Historically, most of my coding projects get abandoned as being apparently-insurmountable (by me). Writing code is not something that has ever fully "clicked" for me.

              But in the past year or so, I've finished some coding projects that are useful to me. Mostly, I've done this using a ton of help from the bot.

              But I don't just blindly trust what the bot produces. I read the output until I understand what it does (and why it does it that way), and if I can't understand it or if the method is awful then I get clarifications and/or revisions until I do understand it or the method is sane. I test it to make sure that it does do what I think that it should do, even with weird parameters. And if it's got unreadable spaghetti in it, or it isn't following my specifications, I coach it back towards not fucking up that way and we start at the beginning again for that section of code (wherein, I don't just blindly trust it).

              It is not a particularly quick process, and it does have its rocky moments, though I am getting faster and better at it...and the results are... eh? They're alright.

              Now, sure: I do have all of my previous failings ("learning opportunities") to look back on as I work with the bot to produce the things I want -- I've got a ton of useful (if functionally incomplete) background relating to the production of code under my belt.

              But this is a process that works for me. And I definitely am not cheating. I'm just using a fairly new and widely misunderstood tool to accomplish things that I (despite decades of frustrating effort) was previously broadly unable to accomplish on my own.

              I don't have any regrets about using this process. I don't feel, say, that I'm somehow less of a man because I've found utility in the use of a tool -- I was fully a man before I started using that tool, and I remain fully a man today.

              In fact, in using this tool I've learned more about code in the past year or so than I ever have before. And not just syntax and formatting, but also good structures for that code, and the importance of planning that code before a single line is ever written, and of revising that plan whenever it becomes necessary -- no matter what needs thrown away or grossly-refactored as a result of that revision.

              This makes me feel that the idea of writing code (without a bot) may actually "click" for me someday soon, but it also makes me question the necessity of that: Does it ever have to "click" when I've got tools like this available for me to use?

              At the end of the day, lazy interns were also lazy interns an eternity ago. Lazy interns produced shit code 60 years ago, and lazy interns will still produce shit code today (except now, they can produce shit code even faster -- wheeeee! -- but that's also not new). Supervising lazy interns has always been an arduous task, whether they were using written arithmetic, a slide rule, or a pocket calculator; or whether using relay logic or punch cards or one of those new-fangled glass teletypes. When the output of a lazy intern is wrong because they are just don't care about their answer, then that lazy intern is still wrong no matter how they got there or how quickly they're able to show that they're wrong...again.

              Much the same as it always has been.

              But what has changed fairly recently is that we've begun a journey that is now absolutely inevitable, and we're all in for a long ride -- whether we like it and want to be a part of it or not, we're already well on our way.

          • chrisjj 10 days ago
            > It is a poor craftsman who blames their tools.

            But this intern doesn't blame his tools. Unfortunately.

            And one can hardly say it is a poor end-user that blames the intern's tools.

            > Do you blame the tool, or the person who is using that tool? (Choose exactly one.)

            I'll blame the person who created that tool.

      • robryan 11 days ago
        At least with stack overflow you can be fairly confident the code at least works correctly in the context it was provided in if it has a lot of votes.
        • chrisjj 10 days ago
          Absolutely. Yet where's the "AI" that can do such voting?
    • JTyQZSnP3cQGa8B 10 days ago
      Why aren’t you banning the use of such tools all over the company? I think I would do that with the only reason being the fear that it might leak private source code to random companies.
      • shinycode 10 days ago
        Because it’s not in my power to decide but I’ve proposed it
  • jgb1984 11 days ago
    ChatGPT, when I was forced to use it often got confused which is the bigger number, 2 or 3. I don't care how many lines of bottom of the barrel code it can throw up, I'm not using any of it. AI generated spaghetti debt.

    LLM's will turn out to be nothing more than another glittering gadget. Except this one will obliterate any coding skills of future generations. Not to mention the gigatons of AI generated garbage text, images and music that it's filling the world with. What a waste.

    • sdenton4 11 days ago
      User: which is the larger number, 3 or 2?

      ChatGPT: The larger number is 3.

      User: If I have one apple, pick another apple and then get an apple from a friend, and Alice has two apples, who has more apples?

      ChatGPT: You have more apples. You started with one apple, picked another, and got one more from a friend, so you have three apples. Alice has two apples.

      [edit]:

      User: what does good will hunting say to the know-it-all at the bar?

      ChatGPT: After Will Hunting outsmarts the know-it-all student in the bar, he spots the student's female friend outside a shop later on. To tease the student further and show his wit, Will taps on the window to get her attention and, holding up his phone number, famously asks, "Do you like apples?" After she nods, Will triumphantly says, "Well, I got her number. How do you like them apples?" This line became quite iconic and is one of the memorable quotes from the movie.

      • Jensson 10 days ago
        Since ChatGPT doesn't have a central place to do math its ability to know whether 3 is bigger than 2 depends on the context, your examples thus doesn't prove anything.

        Every time I've tried gpt-4 etc there are such problems. I don't use it for trivial things since then what's the point, and for anything non-trivial it screws up too much.

        It is useful for translations and brainstorming, but not much else. Translating from natural language to code is a part of that, but not letting it code itself.

    • cvhashim04 10 days ago
      I think you're in for a rude awakening. The software engineering field will be an unrecognizable landscape 5-10 years from now and you'll be looked at like the guy who codes in text edit with no ide if you arent using some form of an AI to generate code -- notice I said generate, coding from scratch will be a thing of the past also.
      • jgb1984 9 days ago
        That's incredibly funny, since I've been developing software for 20+ years using debian and vim (look mom, no IDE!) and I'm pretty sure that's what I'll be doing for the years to come. Imagine how valuable it will be when you're in the minority of humans who still understand how a linux system operates, and how to write and read your own code! Amazing! I absolutely welcome cloud and AI, happy to let it destroy any and all wisdom of future generations :)
    • superfrank 11 days ago
      > ChatGPT, when I was forced to use it often got confused which is the bigger number, 2 or 3

      I mean... yeah. ChatGPT doesn't know what numbers or what logic is. It wasn't "confused". It just doesn't do that.

      You're using it for something it's not intended to be used for and, unsurprisingly, it's bad at it. If I said I'm giving up on using hammers because they were terrible for screwing screws into the wall, you'd tell me I'm using the wrong tool. This is no different. ChatGPT has its uses, but it isn't magic. Just like any tool it has limitations.

    • groby_b 11 days ago
      You... might want to inform yourself a bit more before making proclamations about the future.

      LLMs have fairly well understood failure modes. They have incredible strengths, too. And they differ very much by what you use. If the ChatGPT you were forced to use was GPT3.5 (and the behavior sounds very much like it was), you literally have no idea what modern LLMs can actually do.

  • chx 11 days ago
    Every time...

    https://hachyderm.io/@inthehands/112006855076082650

    > You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

    > Alas, that does not remotely resemble how people are pitching this technology.

    • chrisjj 10 days ago
      > I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context”

      Great. Solved the world shortage of bullshit. Oh wait...

  • lolinder 11 days ago
    FWIW, I'd expect mismatched parentheses to be particularly likely to occur in Lisp. There are very few forms of writing where two tokens () account for 30-50% of the total token size of the text. It's unsurprising that with even a tiny bit of temperature the model gets overzealous and places one too early or too late.

    Add to that the fact that matching parentheses is hard for humans to reliably do correctly—especially when typing code into a textbox on Stack Overflow—and now you have a model whose architecture struggles with knowing when to stop in highly repetitive sequences and that was trained on data that is disproportionately likely to have incorrect syntax for the language you're using.

    Other languages don't have those features, and I've never found Copilot to give me a syntax error in, say, JavaScript.

    That said, I also don't know any professional developer who trusts LLMs to code. Everyone I know who uses them uses them as a tool and takes responsibility for cleaning up their output.

    • Leherenn 11 days ago
      Copilot likes to swallow my } in C++ for some reasons. E.g. if you generate a new function at the end of a namespace block, it will usually use the bracket of the namespace block at the terminator of the function, leaving you one short.
    • chrisjj 10 days ago
      > FWIW, I'd expect mismatched parentheses to be particularly likely to occur in Lisp.

      FWIW, this example had one more ) than ( !

      I'll post it as a comment to the root.

      • lolinder 10 days ago
        Yep, that's entirely unsurprising. Generative AI in general has a hard time knowing when to stop highly repetitive patterns—image generators struggle with getting the right number of fingers, and transformers will often get stuck in loops if things get too repetitive.

        Lisp is just a uniquely hard language for a transformer to cope with—not that it doesn't make mistakes in other languages, but I've almost never seen a syntax mistake.

  • mindcrime 11 days ago
    I'm with @delichon... I don't "trust" LLM generated code, I review it and decide how appropriate (if at all) it is, possibly modify it, and include it if it makes sense. Trust isn't really part of the equation.
    • jaredsohn 11 days ago
      I also don't trust code from stackoverflow, colleagues, or even code I had written myself in the past.

      Is helpful to always be skeptical of code and try to get an understanding of it.

      Another way I think of it is I find it easier to revise than to write (both in English and code). So when I write something I just try to get my ideas out there and then skeptically reread it and recompose. LLMs let me just speed up that process.

      • mindcrime 11 days ago
        or even code I had written myself in the past.

        Aaawwww hellz no. I hate looking at my old code. I've had too many of those surreal experiences where something breaks, I start looking at how to fix it, and am suddenly stunned to realize "this could never have worked in the first place." There's one or two of those I'm still puzzling over, decades later. :-(

        • zer00eyz 11 days ago
          I have been writing go for a decade now.

          Good go is like brutalist architecture. Strip out all the cute, the smart the stupid coder tricks. Write ugly, blocky, verbose, spelled out, functional code. Test the ever loving shit out of it.

          Go is fast enough that if your going back to fix a "performance problem" you either did something head smackingly stupid (we all have) or you have a real bottleneck and it's time to roll up your sleeves and write that elegant solution.

          I tend to revisit code more often to add features or deal with edged case bugs than anything else. That clear and functional code without any magic is a win.

          Edit: you can do the above with almost any language but those that choose packages over standard libs make it HARD.

    • nemothekid 11 days ago
      I don't know if OP meant "trust" as in free from security errors or bugs. I often struggle with "how is this a useful tool?". Fixing bugs in someone else's code seems like more a time sink than just writing the code myself.

      I assume everyone else must be working on projects/languages that have far more representation in the training set, but for me, especially working in Rust, even trying to write interfaces into C libraries, I would constantly find myself dealing with hallucinated parameters or function calls, incorrectly implemented functions or just broken code.

      At the very least if I do it myself, I'll have some intuition about how the code works.

      • mindcrime 11 days ago
        > I assume everyone else must be working on projects/languages that have far more representation in the training set,

        I can't speak for anyone else, but I've used LLM's mainly for generating Java code. The generated code hasn't necessarily ever been perfect, but IME it's been "close enough" that even counting having to make some manual changes, it's still faster than writing it myself from scratch.

        That said, I'm typically only using it for relatively short snippets of code, where I "know" how to do the operation in question, but want to avoid having to stop and look up fiddly details that might not be cached "top of mind" because I haven't written that code in a while.

        I'm sure there are use-cases where LLM's don't save time, but I'm equally sure that there are at least some where they do.

  • pprotas 11 days ago
    Sounds like GPT 3.5 to me, version 4 is much better at code.

    Also, as a general rule you shouldn't trust its code output, but it's a good way to brainstorm or get started on coding something. It can give useful tips or replace stack overflow. I almost never ask it to generate code for me, but I ask it questions about existing code very often. That seems to work very well for me.

    • StewardMcOy 11 days ago
      Sorry to hijack your reply, but I've been having a lot of trouble with ChatGPT4 for code. I don't actually incorporate LLM-generated code into my work, but I often converse with ChatGPT and check what it's capable of every couple weeks so I can stay informed.

      By far, the most common problem is that I will ask ChatGPT 4 to write a small problem that does X, and it will create a skeleton. Lots of functions that implement at most some paramater validation and then contain comments like // Implement the rest of this function.

      With enough prodding, I can sometimes get it to actually fill in the function implementations one-by-one. But often times it will implement part of it and then put another one of those "implement the rest" comments in. Like for example, if I asked it to write a function to draw three circles, after prodding, it would write code to draw the first circle and then stop. Obviously, the answer here is for it to create a function that draws 1 circle and call it 3 times, but I can't fault it for taking my instructions very literally.

      Even when I do get it to implement full functions, when I later ask it for the full code listing, it's back to printing out functions that have "implement this" comments in them.

      Perhaps the fault is mine for asking it to do tasks that there aren't hundreds of implementations of on the internet.

      • simonw 11 days ago
        I've seen lots of people complain about this (the "ChatGPT is being lazy" bug, which OpenAI themselves said they would look at late last year) but I've not seen it myself. I wonder if it comes down to prompting style? What kind of prompts are you using that get these responses?
        • StewardMcOy 11 days ago
          I didn't consider prompting style. Perhaps I'm too polite? It's a bit difficult because I usually engage in conversation first to sound out whether my understanding of the problem is sound and whether it agrees on the approach to take. So the chat is quite large before I get to the prompts, and the prompts reference earlier chat messages. But let's see if I can pull up some good examples from my history...

          OK, here's one. I was initially talking with it about how to best simulate restrictions from older consoles in an indie game. We chatted a bit how to most efficiently do paletted drawing on modern graphics architectures in various APIs (e.g. Vulkan, SDL). I got to asking it how exactly the SNES stored data in VRAM, and asked it to list all the PPU registers in the SNES and what they're used for. It gave me a correct answer (I double-checked). I then asked this:

          > Can you write a header for a C++ file named PPU.hpp with methods to read and write these registers?

          And it responded with this. Note that the response includes a couple "add more as needed" comments.

          Here's a basic outline for PPU.hpp:

          #ifndef PPU_HPP #define PPU_HPP

          #include <cstdint>

          class PPU { public: // Constructor and Destructor PPU(); ~PPU();

              // Methods to read and write PPU registers
              uint8_t readRegister(uint16_t address);
              void writeRegister(uint16_t address, uint8_t value);
          
              // Specific methods for different PPU functionalities
              void setMode(uint8_t mode);
              void setBGTileSize(uint8_t bg, bool size16x16);
              void setBGCharacterBaseAddress(uint8_t bg, uint16_t address);
              void setBGScreenBaseAddress(uint8_t bg, uint16_t address);
              void setMosaicSize(uint8_t size);
              void setBGScroll(uint8_t bg, uint16_t horizontalOffset, uint16_t verticalOffset);
              // Add more methods as needed...
          
          private: // Memory-mapped PPU registers // Assuming a representation of the PPU register space uint8_t ppuRegisters[/* size of the PPU register space */];

              // Helper methods to manipulate specific bits/flags in the registers
              // Example: void setBit(uint16_t address, uint8_t bit, bool value);
              // Example: bool getBit(uint16_t address, uint8_t bit);
              // Add more helper methods as needed...
          };

          #endif // PPU_HPP

          • StewardMcOy 10 days ago
            Another example. I chatted with it for quite some time to define a toy language. I asked it to write small programs like an echo server in this toy language, and it did pretty well (though it sometimes hallucinated changes to the language). I then asked for a tree-sitter grammar, and it always got close, but never could fix the problems. Even after I pointed out the problem, its fixes were always wrong. So instead I decided to try a different approach.

            > Let's write a compiler in C that tokenizes a single file in this language, parses it into a syntax tree, and emits an object file in LLVM IR. I'd like the compiler to be readable and understandable by a human, but execute as fast as possible.

            The tokenizer it wrote was good, but when it got to the operators, it implemented + and -, and then contained this comment:

            // Add other operators and delimiters

            When I asked it to fill in the operators, it actually did a good job. However, this was the parser it gave me.

            typedef struct ASTNode { TokenType type; char* value; struct ASTNode* left; struct ASTNode* right; } ASTNode;

            ASTNode* parse(Token* tokens) { // Example parsing logic, build your AST based on the tokens return NULL; // Placeholder }

            Obviously, it's leaving building the entire AST up to me. So I asked it:

            > Now implement the complete parser.

            The result was long, so I won't paste it here, but it contained all kinds of comments:

            // Skipping parameter parsing for simplicity // parameters would go here // Simple implementation: only supports return statements for now // Simple expression parsing: only literals for now // Add more parsing functions as needed...

            So I said:

            > Seriously, I want you to implement the entire thing, not an outline or a framework.

            And it gave be a code with fewer, but more complete parsing functions. The expression function still had this comment.

            // Assume we're only parsing integers and binary +,- operations for simplicity

            And then at the bottom:

            // Assume the rest of the necessary parsing functions are implemented similarly

            • simonw 10 days ago
              That's really interesting, thanks. I wonder if the initial chatting puts it more in the mood to coach you rather than write the code?

              My prompting style is much more direct - things like this: https://chatgpt.com/share/61cd85f6-7002-4676-b204-0349a72323... - more here: https://simonwillison.net/series/using-llms/

              • StewardMcOy 10 days ago
                Interesting. Most of the time the sentences in your prompts don't even contain a predicate.

                I tried that with the two examples from before. Mixed results.

                Still had one comment telling me to finish it myself in the SNES PPU example, but that's a lot fewer than before. Unfortunately, when I told it to do the work, it gave me a more complete, but not totally complete header without mentioning that it was incomplete. It also took a few tries and I had to regenerate a response that started generating garbage code and then error'd out.

                I checked to make sure that ChatGPT had no memory of previous conversations related to the SNES PPU. So I suspect this has more to do with the task given than the prompting style.

                https://chat.openai.com/share/8703a48a-af72-4a9f-976e-688a26...

                The parser was more interesting. I didn't remove the details of our toy language discussion from ChatGPT's memory, but I also didn't specify much in the way of what I wanted. It created some funky parser code, and I didn't try to compile it to see if it was correct, but it didn't appear to display the lazy behavior I saw before. It also titled the conversation in Spanish, even though I've never chatted in Spanish with it and I'm not fluent.

                I suspect I could have gotten more correct output by giving a more detailed description of what I wanted, but I was trying to keep it simple to match your prompting style.

                https://chat.openai.com/share/ccc23d98-6d4f-4be5-b9b6-b4464f...

    • bearjaws 11 days ago
      GPT4 still sucks if you are using anything modern.

      It's knowledge of sveltekit still seems weighted to pre 1.0, it hallucinates libraries for shadcn, or radix-ui.

      You have to prefix your prompts with "Please use modern JS" otherwise it won't use .filter or .reduce sometimes.

      I agree its a good replacement for Stackoverflow, but it is often pitched as a replacement for coding.

      I've found using aider as the best compromise, let it build something small and revert if it doesn't work.

      • water9 11 days ago
        So basically, you’re trying to use stuff that came later than the data set ChatGPT was trained on and complaining it doesn’t know things that are happening in the future..
        • matsemann 11 days ago
          But how will this affect what devs work on in the future? Will new frameworks be so hampered by this that no one will use them? I tried asking this a few days ago but got little response:

          > If I ask some LLM/GPT a react question I get good responses. If I ask it about a framework released after the training data was obtained, it will either not know or hallucinate. Or if it's a lesser known framework the quality will be worse than for a known framework. Same with other things like hardware manuals not being trained on yet etc.

          > As more and more devs rely on AI tools in their work flows, will emerging tech have a bigger hurdle than before to be adopted? Will we regress to the mean?

          https://news.ycombinator.com/item?id=40263033

        • iamacyborg 11 days ago
          How is a user supposed to know what it is and is not trained on?
        • hsbauauvhabzb 11 days ago
          Which presumably will become a problem when gpt can use old scraper data or data full of gpt output and effectively starts eating itself and producing even worse results?
        • bearjaws 10 days ago
          All the things I mentioned are from 2018-2022...
        • not2b 11 days ago
          filter and reduce have been in JS for more than a decade now.
    • chrisjj 10 days ago
      > Sounds like GPT 3.5 to me

      It was.

      > version 4

      Paywalled.

      > is much better at code.

      I'll post the question as comment to root, for you to try if you want.

  • bretpiatt 11 days ago
    Did you try to use GPT 3.5? Our testing is it isn't great, using GPT 4 or some of the specialized trained versions of GPT 4 (there's one with good reviews called Lisp Engineer) our experience has been different.

    It is not replacing engineers, it isn't one where you give it a broad set of requirements and it just goes and builds, it is helping increase productivity, to get folks through areas where they need to bounce ideas off of someone.

    We're coding mostly in Python, C++, and .NET Core where I do expect it'll have a much deeper set of training data than it will for Lisp (and even for those languages we're getting marginally better performance from specialized engines than we are general GPT 4).

    The non-OpenAI other coding AIs so far are all performing worse for us than GPT4. We've done testing against LeetCode challenges and a bunch of other things.

    • neonsunset 10 days ago
      If only these LLMs were decent at C#. Unfortunately, they heavily lean towards very old data, would call obsolete APIs and in a style that is generally against what is considered to be idiomatic and terse.

      For example, I once asked Claude 3 to port some simple XML parsing code from Go (because 10s to ask is faster than 60s to type by hand haha) and it produced this https://pastebin.com/3823LBiA while the correct answer is just this https://gist.github.com/neon-sunset/6ba67f23e58afdb80f6be868...

      Functionally identical but such cruft accumulates per each single piece of functionality you ask it to implement. And this example is one where the output was at least coherent and doing its job, many more are worse.

    • chrisjj 10 days ago
      > Did you try to use GPT 3.5?

      Yes.

      And succeeded! :)

      > We're coding mostly in Python, C++, and .NET Core where I do expect it'll have a much deeper set of training data than it will for Lisp

      I can't imagine how malformed bracketing is due to insufficient training set.

      And nor it seems can ChatGPT: "I've encountered numerous examples of Lisp code during my training, covering different applications and techniques within the language. Whether it's simple examples or more complex implementations, I've seen quite a bit of Lisp code."

      • lolinder 10 days ago
        You tried the very first iteration of an LLM-based chat assistant, were unsatisfied with it because it couldn't match Lisp parentheses, and went on to form an opinion about the value of these tools and implicitly the intelligence of the people who use them. That speaks more to your preconceptions than it does to the state of better tools like Copilot or GPT-4.

        You didn't label it (which, btw, is a faux pas), but it's obvious from your replies that this wasn't an Ask HN, it was a Tell HN. You have absolutely no interest in what the rest of us have to say.

        Nevertheless, I'll try once more for luck: Basing your opinions about LLMs on your experience with GPT-3.5 is a mistake. If you don't want to use LLMs at all because you have preconceptions, that's fine, but don't pretend that you've sampled LLMs and found them lacking for professional coding when you haven't tried the professional tools.

        • chrisjj 10 days ago
          > You tried the very first iteration of an LLM-based chat assistant

          Er, V3.5 is "the very first iteration"?

          > don't pretend that you've sampled LLMs and found them lacking for professional coding when you haven't tried the professional tools.

          I think you misread my post. I didn't mention professional.

          And my post was't about a "sample of LLMs". It was about this one in particular.

          • lolinder 10 days ago
            > Er, V3.5 is "the very first iteration"?

            Yes. ChatGPT-3.5 was the very first LLM-based chat assistant that was announced on Nov 30, 2022 [0]. It hasn't gotten better since then, just more censored and faster.

            It followed GPT-1 (which was only interesting to people who were already in the know), GPT-2 (which was neat but widely recognized as pretty useless and again, not something normal people noticed) and GPT-3 (which was cool, but didn't provide a chat interface, it could only complete texts, so it made a decent base for the early versions of Copilot).

            [0] "ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in early 2022." https://openai.com/index/chatgpt/

        • fragmede 10 days ago
          damn, I wish I could give more than one upvote
  • FaridIO 11 days ago
    Quality very much depends on the language I've noticed. I don't use ChatGPT but I do use Github's Copilot. The main way I've been using it lately is to make manually written algorithms more efficient. So I'll write an algorithm, select it and ask it to figure out a more time/space efficient way of doing it. Still makes a mistake every now and then, but at the very least it helps me quickly see things from a different perspective.
    • dawnerd 11 days ago
      I mainly use copilot for its ability to figure out patterns. It’s pretty nice when working with frontend code with tailwind. Not perfect but has sped development up a bit. One thing I like is if I put a rem size in css and wanted to comment what the equivalent pixel value is it does the math.
      • ptx 11 days ago
        But the equivalent pixel value depends on the root element font size, so the comment will be wrong when that changes. If you leave the math to the browser dev tools you'll get accurate results without any AI figuring out patterns.
        • dawnerd 10 days ago
          Yep but in our workflow we’ve never deviated from 16px base. The comments in the code are purely to help when translating designs to rem in particular with tailwind.
  • notatoad 11 days ago
    Don't have faith in LLMs. don't trust them.

    treat them like a new hire - you've got to take some time to judge what their capabilities are, how they're going to fit into your workflow, and how you can make their output useful. they're going to be bad at some things, and they're going to be good at some things. and maybe they're not going to be a good fit at all. but sometimes they will be really helpful, and ignoring that and deciding to do everything yourself instead is going to make you less productive in the long run.

    • chrisjj 10 days ago
      > Don't have faith in LLMs. don't trust them.

      > treat them like a new hire

      I expect a new hire to learn post-hire. These LLM's learn nothing post-hire, right?

      • fragmede 10 days ago
        there's fine tuning happening in some systems, so without knowing which system you're using, a "yes" answer could be incorrect.
  • GregDavidson 10 days ago
    Thanks to influence from Dijkstra, Wirth, et al in my formative years, when I program I construct a (usually informal) correctness proof simultaneously (well, kind of interleaved) with my construction of the code. The two constructions assist one another and converge to procedures which solve the required problem. I find this approach more productive than an ad hoc (hack & debug until done) approach. The correctness proof supports the correctness of the solution and the solution is usually more elegant (simpler, etc.). I annotate the code with elements of the proof to assist with maintenance. I think that most programmers do this to some degree, i.e. have an internal argument about why what they're doing will work and include some assertions and comments, etc. When I'm given some hunk of complex procedural code lacking strong types, preconditions, postconditions, invariants, arguments bridging such, etc. I don't find it very useful. Proving the correctness of complex ad hoc code is often harder than just solving the problem again from scratch. There are AI automated reasoning systems (based on incremental theorem provers) that can help me write good code and I follow the evolution of such tools. So far, the code I'm seeing from llm systems seems like a maintenance and reliability nightmare.
  • DSpinellis 11 days ago
    ChatGPT can generate wrong code but it can also be a great helper. I've summarized the good, and bad, and the ugly in a recently published IEEE Software column: https://doi.org/10.1109/MS.2024.3363848
  • stephc_int13 11 days ago
    From my experience, the more ones learn about how the sausage is made, the less you want that stuff in your codebase.

    I think that this tech could be useful at point, with a lot of well-engineered guardrails around it, maybe. And this will mostly be for boilerplate and the most generic parts of your code.

  • SkyPuncher 11 days ago
    You basically picked the worse possible task for ChatGPT and are applying it as a broad conclusion.

    ChatGPT is _very_ good at taking human-language descriptions and transforming that into code. It works even better when instructions are concise and clear. I theoretically believe this stems from the fact that there’s a lot of human description of code for the LLM to train on. However, there’s relatively very little machine to machine code available.

    The tasks you’re looking to accomplish will likely have much better success if you break it into two “human-to-code” steps.

    1. Ask ChatGPT to explain what the Python code is doing, line-by-line

    2. Then ask ChatGPT to write the equivalent Lisp code for the description.

    • chrisjj 10 days ago
      > You basically picked the worse possible task for ChatGPT

      Wow, what were the odds on that?? :)

      I'd be interested in verification, so will start a new thread for that.

      > The tasks you’re looking to accomplish will likely have much better success if you break it into two “human-to-code” steps.

      > 1. Ask ChatGPT to explain what the Python code is doing, line-by-line

      The answer is "However, there's a logical flaw and a few issues in the implementation:" followed by points every one of which is in error, showing ChaptGPT has deeply mis"understood" the code.

      > 2. Then ask ChatGPT to write the equivalent Lisp code for the description

      The result fails with "odd number of &KEY arguments".

    • ptx 11 days ago
      Like XMLVM[0] with English instead of XML.

      [0] http://www.xmlvm.org/overview/

    • ssl-3 11 days ago
      Yep. It often "reasons", as much as it does, best with regular language.

      And sometimes, just like with real people, it works better if one or more intermediate steps are interposed. But it doesn't necessarily "know" this -- just as real people mightn't.

      It's not unfair to say "Yeah, well if the bot was so smart then why didn't it just take care of that intermediate step for me?"

      But it's also not unfair to retort "Well, if you were so smart then why did you expect perfect results from an imperfect tool without expending any real effort to understand and/or explain the thing on your part?"

      • Jensson 10 days ago
        > But it's also not unfair to retort "Well, if you were so smart...

        But it was the tool being evaluated, not the person. If the person has to code in natural language then why not just code it directly?

  • xhkkffbf 11 days ago
    Here's what I usually get from them:

    accurate function header accurate glue code wacky invocation of some API that may or may not exist accurate glue code

    It's rare for me to get anything really useful. I suppose the accurate glue code can be helpful.

  • tardisx 11 days ago
    Using LLM's for software development is cargo cult programming, automated.

    The problem is not "incorrect" (syntactically) code. The problem is the developer trusting the AI to understand the nuance of the problem space, and to write the code the "right way".

    As someone wise once said "writing the code was never the hard part".

    A colleague recently used an LLM to complete some work, which required some concurrency - which they were clearly unfamiliar with. The LLM created a byzantine mess of multiple channels (which immediately deadlocked) when all that was needed was a single mutex.

    A two minute human conversation would have turned up "you just need a mutex".

    Even in the unlikely scenario that the AI generated solution worked, did the developer learn anything? No. Was the codebase made more or less maintainable? Definitely less.

    I keep hearing that the LLM's will get better at understanding nuance and writing code that fits the problem space. It sounds an awful lot like how the crypto bros kept telling us that blockchain was the future of money and we'd soon all be rich.

    The half-on-the-fence AI users tell me "I agree, but I just use it to autocomplete functions or write small things". I can do that myself - and I get the opportunity for my own brain to do it's own concurrency and interrupt me "actually, maybe this is not the right way to do this".

  • fwlr 11 days ago
    Mismatched brackets seems like a lower-level job - better done by a linter, not an LLM. Likewise, I wouldn’t rely on an LLM to tell me what properties are on the response object - that job is better done by e.g. TypeScript.

    Analogously, “convert this function from Python to lisp” seems like a higher-level job, better done by a programmer human (or at least, it’s too high for the current level of available code LLMs).

    There is still a valuable level that code LLMs can operate at, in my experience - something like a cross between “a more advanced autocomplete” and “a templating/scaffolding/boilerplate engine”.

    • chrisjj 10 days ago
      > Mismatched brackets seems like a lower-level job - better done by a linter, not an LLM.

      Mismatched brackets is not any /job/. And /matched/ brackets are part of the job this LMM claimed to do. Writing runnable code.

  • chrisjj 9 days ago
    PS Using chatgpt.com ChatGPT 3.5, input was:

      Translate this to Common Lisp.
    
      def solve(graph):
        a = 0
        while a < len(graph):
          for b in range(len(graph)):
            if b > a and graph[a].intersection(graph[b]):
              graph[b] = graph[a] | graph[b]
              graph.pop(a);a-=1
              break
          a+=1
        return graph
    
      a = [{1, 2}, {4, 5}, {1, 3}, {1, 4}]
    
      print((solve( a )))
    
    Output today is:

      (defun solve (graph)
        (let ((a 0))
          (loop while (< a (length graph))
          do (loop for b below (length graph)
             do (when (and (> b a)
               (intersection (elt graph a) (elt graph b)))
            (setf (elt graph b) (union (elt graph a) (elt graph b)))
            (setf graph (remove (elt graph a) graph))
            (decf a)
            (return)))
             (incf a))
          graph))
    
      (defparameter *a* '((1 2) (4 5) (1 3) (1 4)))
    
      (format t "~a" (solve *a*)))
    
    and though this differs from the original attempt's output, it fails to compile the same way - "unmatched close parenthesis" on graph)).

    Using Bing.com GPT-4 the output again differs, but again fails the same.

  • deathanatos 11 days ago
    > I wonder how much user faith in ChatGPT is based on examples in which the errors are not apparent ... to a certain kind of user.

    I think this is it, basically. A certain segment of programmers are simply not skeptical enough. The same sort of skepticism that will keep jepsen.io going until the heat death of the universe debugging faulty distributed consensus algos.

    My coworker recently generated a bash script with it. The shell was terrible, full of novice shell anti-patterns. But this makes "sense" — what was it trained on, after all? Most shell in its training set would be of low quality, because most shell is of low quality.

    The script was "determine lines added/removed by author in git repo". The `git log` call it made was bizarre — emitting way more information than it needed to — which it then subsequently filtered out extraneous bits with `grep`. … but it could have just never outputted them in the first place. Things like `| grep | awk`, lack of quoting where needed, etc. Multiple loops through the log output where one would do (and having to slurp it into memory because of that, where a simple git log | <process output> would suffice). The regexes to match the log output were wrong, resulting in the wrong output — but that didn't stop multiple people from going "nice" or "cool script" first.

  • simonw 11 days ago
    It's more reliable for code than it is for anything else.

    If it hallucinates something, the code won't run - or it won't work correctly when you test it.

    Much easier to "fact check" code than prose.

  • jasonjmcghee 11 days ago
    As others have mentioned - code should rarely involve trust. This is what code review is for. Even your mentor / cracked engineer makes mistakes. Code review and tests are always critical. Whether it's a human or an AI, code and the code that tests it needs to be reviewed / validated.
  • BugsJustFindMe 11 days ago
    I trust ChatGPT as much as I trust my co-workers. That is to say, I don't need to trust them, because I have eyes and a brain.

    "This code doesn't do what you say it will" is not a review comment reserved exclusively for machine-generated code.

    • chrisjj 10 days ago
      > I trust ChatGPT as much as I trust my co-workers. That is to say, I don't need to trust them, because I have eyes and a brain.

      ... and 100 hours a day to review and test their code? :)

      • BugsJustFindMe 10 days ago
        It takes them longer to write the wrong code than it takes me to look at it and tell them where it's wrong.

        And yes, reviewing code is part of my job. And the reason it's part of my job is that people who get hired as programmers are often very bad at not making obvious mistakes.

        • chrisjj 10 days ago
          It had not ocurred to me that anyone would hire a programmer who submitted code that fails even to compile, as per ChatGPT.
          • BugsJustFindMe 10 days ago
            > It had not ocurred to me that anyone would hire a programmer who submitted code that fails even to compile, as per ChatGPT.

            I would say that this points more to the narrowness of your industry experience than anything else.

            Failing to compile is one of a million different ways that code doesn't produce the claimed or desired or correct behavior. It's not particularly special. All code needs to be reviewed, regardless of who/what wrote it and how.

            • chrisjj 10 days ago
              > I would say that this points more to the narrowness of your industry experience than anything else.

              Good point. I'll correct to:

              It had not ocurred to me that anyone would /knowingly/ hire a programmer who submitted code that fails even to compile, as per ChatGPT.

              Anyone who would knowingly could well be delighted to use ChatGPT instead.

              • BugsJustFindMe 10 days ago
                > It had not ocurred to me that anyone would /knowingly/ hire a programmer who

                They don't. Hiring happens before the first commit. This incredulity about programmers not all being consistently excellent is what makes it sound like your practical experience is very limited.

                Point of fact, you're here complaining that a product (ChatGPT 3.5) known to be substantively shitty is substantively shitty, and when people say "you know v4 is very significantly better, right?" you say "but that one isn't free". That gives a strong vibe that you know you're doing something bad and just don't care. Is that the vibe of consistent excellence? I would say no.

                > submitted code that fails even to compile

                You seem stuck on this business of not compiling when that failure mode is in fact extremely less consequential than literally any other failure mode. Wake me up when humans stop writing code that compiles and runs and tests just fine while doing the wrong thing literally all the time.

                Asking how anyone can blindly trust ChatGPT code is a fundamentally broken question. You can't blindly trust human code either. Trust has to be earned by past demonstrations of performance within the specific domain.

                A reviewer earns trust by consistently catching other peoples' mistakes. A programmer earns trust by consistently not making obvious mistakes in the first place. When I see a reviewer who consistently fails to catch other peoples' mistakes, I do not trust that reviewer.

                Giving a human more inherent trust than you give ChatGPT without experiencing sufficient past performance to have earned it would be your own failure. Humans are very stupid and fuck up constantly.

                • chrisjj 10 days ago
                  > This incredulity about programmers not all being consistently excellent

                  I wouldn't say programmers who compile are all excellent. And I didn't.

                  > when people say "you know v4 is very significantly better, right?" you say "but that one isn't free".

                  I think you must be mistaking another poster for me. I didn't say that either.

                  • BugsJustFindMe 9 days ago
                    > > when people say "you know v4 is very significantly better, right?" you say "but that one isn't free".

                    > I think you must be mistaking another poster for me. I didn't say that either.

                    Is this you? https://imgur.com/5wCeTZg

  • m12k 11 days ago
    I find it to be a great way to look up syntax or APIs, and generate simple examples of using these. So for example "In [language] how do you parse JSON?" will give me a usable example to follow much faster than trying to find the anointed Stack Overflow question and then sift through outdated answers to find the modern one.

    If you need to revive a language you haven't coded in for a while (or regex or sql), ChatGPT can really help get you off the ground much faster, though you'll still need to understand the output it gives you and compose it together yourself.

  • exe34 11 days ago
    I don't trust code from stack overflow either - I always use the found code in a minimal working example - where I fix the obvious issues, convince myself it's doing what I want, and not doing anything that I don't understand, and then put that into my main project.

    I've had a lot of success with chat gpt where it told me what libraries to use, what classes/functions I need, etc, where it would have taken me ages to find out by reading documentation and even longer to go through tutorials that leave out the more interesting stuff.

  • cnotv 10 days ago
    If you are used to planning your code, and writing what you need to do, with lists and everything, it usually does a good job of creating simple functions, RegEx rules, and finding the snippet of documentation you need. Even if I could code everything with not a single issue, it saves time.

    If you have not understood yet where it helps, maybe you should play around with it a bit more.

    To write unit tests is often perfect and way better than some of my colleagues, especially the old ones who don't trust and have never written tests.

  • err4nt 10 days ago
    The people who trust it are the people who either don't know the difference, or the code ChatGPT outputs is comparable or better than what they can output.

    It seems to work okay the other direction though, if code is well-named and well-commented, ChatGPT can explain what it does in natural language tailored to people of varying backgrounds and levels of experience. But it's not so good at taking natural language and making it into code that has to be much stricter.

  • jokethrowaway 11 days ago
    1. I don't trust code from humans too

    2. You can't trust LLM for anything. They are a Large Language Model, you could consider LLMs having knowledge to be a bug. Their architecture is not meant to reason.

    3. ChatGPT and co really needs a feedback loop with compilers to drastically improve(http://devin.ai/ is working on that)

    4. If you pick a popular language you get good results because they trained it more with those. JS and Python are ok, Rust or Haskell are never correct

  • speak_plainly 11 days ago
    I’ve had great success with ChatGPT 4 writing some simple Python scripts and it has really helped speed up my workflow and helped me address issues as they cropped up. I did have trouble when I tried to get it to integrate and combine all these scripts into one larger one but I’m sure with some work I could get things working. Perhaps there are issues with the code (and I’m a certain kind of user) but I really just care about the output and having this assistance has been a godsend for me.
  • paradite 10 days ago
    One thing that matters a lot is the prompt. A good prompt can dramatically improve the performance and quality of the code output from ChatGPT.

    It took me months of trial and error to optimize the prompt for using ChatGPT to write code. I wrote a blog post summarizing the best practices:

    https://16x.engineer/2024/02/03/chatgpt-coding-best-practice...

    • chrisjj 10 days ago
      Thanks.

      My prompt followed all the best practice offered there. However that practice appears to assume program generation not translation.

  • calderknight 11 days ago
    What model were you using? What prompt did you use?

    You can learn to trust it for some tasks that it's reliably good at.

    But for the most part, you don't trust it - you read through it and check it.

    • chrisjj 10 days ago
      ChatGPT.

      Translate the following Python program to Lisp.

      Yes, I forgot to add "... and no crap." ;)

      • calderknight 10 days ago
        ChatGPT is just the brand. I guess you're using GPT-4, but if you're using the default model (GPT-3.5) that would certainly explain below-expectation results.
        • chrisjj 10 days ago
          > ChatGPT is just the brand.

          "ChatGPT" is a model, says its provider.

          But yes, the web page says model GPT-3.5.

          • calderknight 10 days ago
            Mystery solved! The answer to your OP is that the reason anyone can trust ChatGPT for code is that they use a much better model than the one you're using! GPT-3.5 is ancient and way behind GPT-4. In fact, there are now tens of organisations who have developed model classes way ahead of GPT-3.5.
            • chrisjj 9 days ago
              > the reason anyone can trust ChatGPT for code is that they use a much better model than the one you're using! GPT-3.5 is ancient and way behind GPT-4.

              Testing that theory now using Bing Copilot GPT-4.

              Failed.

              unmatched close parenthesis Line: 11, Column: 11, File-Position: 408

              Input is:

                Translate this to Common Lisp.
              
                def solve(graph):
                  a = 0
                  while a < len(graph):
                    for b in range(len(graph)):
                      if b > a and graph[a].intersection(graph[b]):
                        graph[b] = graph[a] | graph[b]
                        graph.pop(a);a-=1
                        break
                    a+=1
                  return graph
              
                a = [{1, 2}, {4, 5}, {1, 3}, {1, 4}]
              
                print((solve( a )))
              
              Output is:

                (defun solve (graph)
                  (let ((a 0))
                    (loop while (< a (length graph)) do
                      (loop for b from 0 below (length graph) do
                  (when (and (> b a) (intersection (nth a graph) (nth b graph)))
                    (setf (nth b graph) (union (nth a graph) (nth b graph)))
                    (setf graph (remove-if (lambda (x) (equal x (nth a graph))) graph))
                    (decf a)
                    (return))))
                      (incf a))
                    graph))
              
                (let ((a (list (list 1 2) (list 4 5) (list 1 3) (list 1 4))))
                  (format t "~a~%" (solve a)))
              • calderknight 8 days ago
                Yeah, if going that route the answer to the OP is "most people aren't using it for Lisp".
  • toddmorey 11 days ago
    I think of it as not a solutions engine but a suggestions engine.

    Sometimes what it suggests won’t work at all. But other times it opens my thinking to an approach I had too much tunnel vision to see.

    I end up writing the code ultimately. There was only once a tiny function I could paste in directly, unedited. But still helpful.

    One great technique can be to ask it for options: “List five ways this code could be improved / more readable.” Usually about 1 in 5 suggestions will lead to something interesting.

  • mst 11 days ago
    There's a long tradition of sharing illustrative code snippets on IRC by typing it straight into a pastebin linked to the channel.

    The odds of them containing at least one stupid and one subtle mistake are very high, but they're immensely effective at illustration a general shape to another developer in a way they can pick up, cross reference with the docs, and turn into working code in their project.

    Treating LLM generated code the same way has been working out well for me so far.

    • chrisjj 10 days ago
      I guess the problem then could be ChatGPT's success at encouraging treating its output as more.
  • aranchelk 11 days ago
    Big picture, IMO most devs (especially more junior ones) are far better at writing code than reading it, that’s part of the allure of green field projects.

    No, you don’t trust any LLM output, just as you shouldn’t trust code you get from Google, SO, etc. You read it, understand it, and decide to use it, alter it, or toss it.

    Before LLMs put everyone out of the job, I think code literacy along with (obviously) LLM-specific skills are going to be very important.

  • icedchai 11 days ago
    Do you trust yourself or your coworkers to write correct code, without being tested or even run? Do you really expect an LLM to do better?
    • chrisjj 10 days ago
      I do not deliver untested code.

      This "AI" does.

      • icedchai 9 days ago
        I really don't see the problem. It's a tool, not a coworker. It's up to you to test it. If you find a solution from googling, you still need to test it. It's the same thing.
        • chrisjj 9 days ago
          The coworker comparison is not mine. I too consider it invalid.
  • itronitron 11 days ago
    IDEs used to be very capable development environments. They could do magical things (by today's standards) because they understood the structure of the language as well as available resources. The current batch of AI doesn't seem to take this into consideration, whether out of laziness, incompetence, or some strategic or cynical view.
  • lejohnq 11 days ago
    I wouldn’t say I trust ChatGPT without verification yet, but it definitely accelerates me by generating good enough code for me to quickly debug.

    I am a much faster debugger than a from scratch generator, so ChatGPT/etc helps accelerate me through those moments even when I know exactly what I want to write.

    This makes it useful enough a tool to use anytime I’m writing code

  • jghn 11 days ago
    In the before times, would you copy/paste something directly from StackOverflow or the like?

    No of course not. You'd look it over, assess if it does what you think is the right thing, and consider if that's really what you want. But more likely you look at it, decide it's not quite what you need, and modify it as appropriate.

    LLMs are no different.

    • itronitron 11 days ago
      There are a lot of comments equating stuff posted on SO with the stuff that ChatGPT produces, but in the before times code posted on SO was likely written with a specific objective in mind, while the GPT generated stuff is not.

      The best analogy I can think of is the distinction between aiming a gun (SO, spitball examples) and waving a gun (AI/ChatGPT). Both require you to double check where the gun is pointed but with the former you at least have some confidence that it's pointed in the right direction.

      • jghn 11 days ago
        Sure, these differences exist. I don't find the quality to be much different between the two sources. I treat both of them as a probably wrong, likely inexperienced developer. Sometimes they pleasantly surprise me. More often than not they don't but in both cases I make use of those resources because they at least move me in the right direction.
    • chrisjj 9 days ago
      > In the before times, would you copy/paste something directly from StackOverflow or the like?

      No. I'd first check the votes and comments from other users.

      > LLMs are no different.

      One difference is no votes and comments from other users.

    • chrisjj 9 days ago
      > In the before times, would you copy/paste something directly from StackOverflow or the like?

      No. I'd first check the votes and comments from other users.

      There's no such option with ChaptGPT.

  • renewiltord 11 days ago
    You don't have faith or trust or anything. You use it like a tool. I don't trust my IDE and select the first autocomplete solution. It's just a tool. When Google came out a lot of people were like "How can you trust Google? I searched for X and it gave me crap". Yeah, you can't trust Google like that.
    • chrisjj 10 days ago
      > I don't trust my IDE

      You trust it not to prang your program, right?

      > and select the first autocomplete solution.

      I have never seen an IDE mess up Lisp brackets.

  • jug 11 days ago
    Key is to not trust it, but use AI as an assistant that you don't trust. Even then, it can save you hours of work.
  • andrewstuart 11 days ago
    You're holding it wrong.

    ChatGPT does not do your job for you.

    It is an assistant - you have to do the debugging and adapting and understanding and verification and fixing.

    And, ChatGPT is better at coding in programming languages that have more training material, so the more popular the languages, the more likely ChatGPT is to give a meaningful result.

  • rokkitmensch 10 days ago
    This is a very Lisper objection. The thing is a token predictor, and can't count levels of depth.

    Think "autistic junior engineer, whose work needs lots of testing but is also prolific at writing tests" instead of "Godlike text generator", it's much more productive.

    • chrisjj 10 days ago
      > This is a very Lisper objection.

      Why thank you! :)

      > The thing is a token predictor, and can't count levels of depth.

      Er, the average C program has more depth than this small Lisp one.

      If it can't handle levels of depth, then why is it even claiming ability to code?

  • analyte123 11 days ago
    Are you using GPT-4 or the free version?
  • fragmede 10 days ago
    You're holding a hammer, and hit your thumb. do you swear off hammers as the devil, or do you learn to use the tool better? LLMs work on tokens. I wouldn't expect them to be good at matching parens. Even the very best hammer is a terrible screwdriver.
    • chrisjj 10 days ago
      A parens is a token, last time I looked.

      And even Wikipedia agrees.

  • furyofantares 11 days ago
    I read the code. And I would have just added the bracket myself if the code was otherwise correct.
    • chrisjj 10 days ago
      As I said, it wasn't.
  • cirrus3 11 days ago
    I don't think anyone who has used it frequently has any assumptions it is correct. You almost always have to edit it either for bugs or just plain incorrect... but it can still be a very helpful tool to get you a nice starting point.
  • alkonaut 11 days ago
    Try making a 200 line chunk of boilerplate for something mundane like a big table in a react page. This is where it saves you googling parameters and doing copy paste. And you don’t need to trust it just test it after it’s done.
    • onkarshahi1 1 day ago
      creating doctor dashboard to see their patients with appointments for new consultation or follow-up or review test reports. Doctor should be able see the complete timeline of patient's encounters, symptoms, doctor observations, medication - current or past and test reports for diagnostics or radiology. They should be able to compare reports to see the progress of medication.
  • petabyt 10 days ago
    The utility of chatgpt is to spit up common code that it was trained on. It will write a mandelbrot renderer just fine and can also explain it. Try to get it to generate any Brain*ck and it will fail miserably.
  • superfrank 11 days ago
    I trust it in the same was I trust a junior developer or intern. Give it tasks that I know how to do, can confirm whether it's done right, but I don't want to spend time doing it. That's the sweet spot.
  • Harmohit 11 days ago
    I am not a LLM lover but I think you are thinking about it the wrong way. I think that most of the value that LLMs provide for writing code is in syntax suggestions and being slightly better autocomplete.
  • certkindausr 11 days ago
    took a bit of trial and error to have "G" write a local server with live reload. it works and all security issues were taken care of.

    faith can be replaced by iterating until you don't need to trust anymore.

    "what security issues might become an issue?" "how can i test that fix?" "explain this part of the code." "browse the net for security issues for this part."

    "how does lisp use brackets and indentation?" "use what you just taught me and fix your mistakes in your translation of these 12 lines of python code."

  • TachyonicBytes 11 days ago
    In an ironic twist, the original Lisp paper has mismatched parentheses.
  • rhelz 11 days ago
    Treat it like the best educated, most well-informed intern in the world. Usually gives good answers, but still, not a seasoned professional yet, so check the work accordingly.
    • chrisjj 10 days ago
      But interns learn from corrections. This product does not.
      • rhelz 9 days ago
        Have you heard the expression "Hire a teenager while they still know everything" :-)
  • nforgerit 11 days ago
    The best thing I can say about evaluating LLMs for code is that they reminded me of the infinite monkey theorem.
  • yencabulator 10 days ago
    It's funny how many LLM enthusiasts think they can notice bugs by reading code. More likely you're just nodding your head along and missing the bugs.

    When someone posted LLM-generated elisp on HN, it used a non-existent escape code in a regexp. Huge conversation, nobody but me realized that bug.

    • lolinder 10 days ago
      I don't think I can notice bugs by reading code. I think I can notice bugs by:

      * Writing code in a language with a static type system.

      * Reading the code and "playing computer".

      And, last but by far the most important:

      * Actually testing the code

      Turns out that using an LLM to generate code doesn't preclude testing it!

      • yencabulator 10 days ago
        Writing tests after the fact is way harder than writing code to be testable. You're pretty much guaranteed to miss edge conditions, which tend to be the cases LLMs often screw up.

        The rampant optimism is just weird to me.

        • lolinder 10 days ago
          You seem to be assuming that we're generating large amounts of code with LLMs that we don't read and adapt and make our own. I can definitely see that as a risk for overconfident juniors, but that is not the MO for senior developers who are using these tools.

          95% of the code that I add from an LLM is less than a line, a nice autocomplete. The other 5% is a function that I have ChatGPT get started for me and then I read it and refactor it until I understand it as well as any code I've written myself.

          The rampant fearmongering is just weird to me. It's a tool, and like any tool some people abuse it, but it's weird that some people simply can't believe that there are people who use it well.

          • yencabulator 10 days ago
            I see people overestimating their ability to notice bugs.
  • minikomi 11 days ago
    Which model, which code. We're going to need receipts before going forward.
  • wolpoli 11 days ago
    For context, ChatGPT doesn't seem to be very good at less popular languages. In my experience, it's able to generate good C# code, but it is very poor at generating Autohotkey code. So it's a similiar situation here with LISP.
  • elicksaur 11 days ago
    People nitpicking the word “trust” are missing the point.

    There’s an obvious and growing dichotomy in experience using LLMs to assist in programming. One camp claims it 2-10x’s their productivity. The other claims it is trash and useless.

    A much more interesting thread to pull on would be how these subjective experiences can be so far apart while using the same products.

    • chrisjj 10 days ago
      The difference is surely QA. High productivity is easy where the quality bar is low.
  • clarkdale 11 days ago
    Sometimes I forget a parenthesis here and there.
  • jeroenhd 11 days ago
    It works fine for C# and Java. Usually works fine for Rust. The context window is too small for anything more than a quick demo and it often just assumes that certain types and classes exist, but it's good enough that you can get a working stub in no time. This is especially useful when dealing with (open source) libraries that lack decent documentation.

    Obviously you inspect and validate the bullshit it generates, but generally it produces some very useful bullshit.

    The output is as good as the data set available on the input side. I don't think many people use Lisp, so the Lisp output will be worse. Other uncommon languages have the exact same problem.

  • boredemployee 11 days ago
    i feel sorry for those it doesn't, for me it works like a charm
  • 77pt77 11 days ago
    I asked it to write code to calculate the arithmetic average of a list of ints in OCaml.

    It used only integer operators. Nowhere were floats involved.

    I asked "why is there no +. or /. operator"

    It completely bullshitted me saying + is polymorphic in Ocaml.

    Surprised, I asked since when. It game me a caml version over 20 years old.

    I fired a repl and obviously + only works on ints.

    When I called it out zero fs were given.

    I'll pass, thank you very much.

  • rickcarlino 11 days ago
    From “All You Need is Cheating”:

    https://sourcegraph.com/blog/cheating-is-all-you-need

    One of the craziest damned things I hear devs say about LLM-based coding help is that they can’t “trust” the code that it writes, because it “might have bugs in it”.

    Ah me, these crazy crazy devs.

    Can you trust code you yeeted over from Stack Overflow? NO!

    Can you trust code you copied from somewhere else in your code base? NO!

    Can you trust code you just now wrote carefully by hand, yourself? NOOOO!

    All you crazy MFs are completely overlooking the fact that software engineering exists as a discipline because you cannot EVER under any circumstances TRUST CODE. That’s why we have reviewers. And linters. And debuggers. And unit tests. And integration tests. And staging environments. And runbooks. And all of goddamned Operational Excellence. And security checkers, and compliance scanners, and on, and on and on!

    So the next one of you to complain that “you can’t trust LLM code” gets a little badge that says “Welcome to engineering motherfucker”. You’ve finally learned the secret of the trade: Don’t. Trust. Anything!

    Peeps, let’s do some really simple back-of-envelope math. Trust me, it won’t be difficult math.

    You get the LLM to draft some code for you that’s 80% complete/correct. You tweak the last 20% by hand. How much of a productivity increase is that? Well jeepers, if you’re only doing 1/5th the work, then you are… punches buttons on calculator watch… five times as productive.

  • j45 11 days ago
    A key is to not let it code something you don't understand and can't code.

    GPT 4 is much better than the others.

    You should try something like Aider to get better output

  • zzz999 10 days ago
    [dead]