I scraped all of OpenAI's Community Forum

(julep-ai.github.io)

310 points | by alt-glitch 30 days ago

17 comments

  • xfalcox 29 days ago
    That's super cool, thanks for sharing! I will share this as an easy to follow example of what we can with AI.

    > Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P). Let's view some posts similar to this one complaining about function calling

    That's indeed a great thing to surface, and that's exactly how the the OpenAI forum selects the "Related Topics" to show at the end of every topic. We use embeddings for this feature, and the entire thing is open-source: https://github.com/discourse/discourse-ai/blob/main/lib/embe...

    We also embeddings for suggesting tags, categories, HyDE search and more. It's by far my favorite tech of this new AI/ML gen so far in terms of applicability.

    > Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment label (negative, positive, neutral) and post_sentiment_score confidence score for each post.

    We do the same, with even the same model, and conveniently show that information on the admin interface of the forum. Again all open source: https://github.com/discourse/discourse-ai/tree/main/lib/sent...

    Disclaimer: I'm the tech lead on the AI parts of Discourse, the open source software that powers OpenAI's community forum.

  • wavyknife 29 days ago
    (disclaimer: I work for Discourse)

    Discourse has an AI plugin that admins can run on their community to generate their own sentiment analysis (among other things), though it's not quite as thorough as this write up! https://meta.discourse.org/t/discourse-ai-plugin/259214

    We're always interested to see how public data can be used like this. It's something that can be a lot more difficult on closed platforms.

    • Aachen 29 days ago
      > helps you keep tabs on your community by analyzing posts and providing sentiment and emotional scores to give you an overall sense of your community for any period of time [...]

      > Toxicity can scan both new posts and chat messages and classify them on a toxicity score across a variety of labels

      Is that within the defined data processing purposes of all Discourse setups? Does the tool warn admins they might need to update their policies before being able to run this tool, perhaps needing to seek consent (depending on their jurisdiction and ethics)? It sounds somewhat objectionable, trying to guess my mental state from what I write without opt-in

      Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

      > tagging NSFW image content in posts and chat messages

      • eddd-ddde 29 days ago
        I don't think there's anything left for you to consent once you decide to post on a public forum. If I can read your post and guess your mental state so can any other bot.
        • Aachen 29 days ago
          If you park your car on the side of the road, that also doesn't allow anyone to do with it what they please

          If you write an article and post it on your blog, people can't just come along and take the text verbatim

          If you license your blog as public domain, then someone takes the content and does something objectionable with it, you can (in many countries) still make use of moral rights if you'd wish to correct the situation

          If I post something publicly on a forum, I'm well aware I may have agreed or consented (depending on the forum) to terms that allow this type of processing, but that is not the default. There exist restrictions, both legally and morally (some legal ones are even called moral rights and are inalienable). Hence my question how this plugin handles extending the allowed data processing to cover taking the content and making automated decisions and claims that may or may not be accurate. I would not be comfortable with that being an automated behind-the-scenes process flagging my posts as good or bad towards the moderators, since they likely won't care to read back hundreds of comments and see whether the computer did a good job

      • wavyknife 29 days ago
        Discourse is not a centralized platform, so it's up to individual sites to ensure they're compliant with data and privacy regulations.
        • Aachen 29 days ago
          I mention that in the first nonquote sentence
      • BadHumans 29 days ago
        More companies and communities than you think already do this without your knowledge let alone consent.
        • david_allison 29 days ago
          That doesn't mean we can't do better
          • BadHumans 29 days ago
            Better at what though? I don't even think it's a problem to begin with.
      • xfalcox 29 days ago
        > Is that within the defined data processing purposes of all Discourse setups?

        It's an optional plugin that can be enabled / disabled by the site admin. Those modules are all disabled by default, and each need to be enabled by the site owner.

        > Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

        Discourse PMs can be read by admins, see the definition here: https://meta.discourse.org/t/guidance-and-best-practices-on-...

        • Aachen 29 days ago
          Of course an admin can always open up the database and read your forum PMs, that's not surprising. The very first line in the link you provided, however, is what I was worried about:

          > Moderators can read PMs that have an active flag.

          This system is now setting nsfw flags in an automated fashion, specifically seeking out content that the persons involved wouldn't want others to see. Clearly a forum is the wrong place for that content, but people don't always make good decisions (especially kids; I was a kid on forums too and would be very surprised if nothing ever transpired there). The receiving person can already flag anything they deem inappropriate. A system making automated decisions about messages that were intended to be private creates problems and it is not clear to me who this serves

          • wavyknife 29 days ago
            > it is not clear to me who this serves

            customers

  • SunlitCat 29 days ago
    I didn't even knew they have community forums. Looking at the main homepage (openai.com), the only external links I can find are to chatgpt and their docs hosted on platform.openai.com. The other links lead to their socials, github and soundcloud (of all places).

    Maybe I'm not looking thoroughly enough, so I may be wrong, tho!

    • hughesjj 29 days ago
      I would also love to see these forums both to post and to lurk
      • djantje 29 days ago
        https://community.openai.com/ (when you are logged in on platform.openai.com, there is a link from the menu)
        • SunlitCat 29 days ago
          Thank you!

          Gone are the days when you simply saw all the important links on the main page, it seems. :)

  • miduil 29 days ago
    That's an interesting write-up, I wonder how this would look for other big Discourse communities such as NixOS.
    • alt-glitch 29 days ago
      This is definitely a workflow we can package into something open-source.

      I wonder how the community moderators would like it.

      • dcreater 29 days ago
        I for one would love it!
  • klooney 29 days ago
    What's the "Day Knowledge Direction" cluster in the Atlas view?
  • fzysingularity 29 days ago
    So epic, thank you for making this dataset available to everyone!
  • alright2565 29 days ago
    I saw this part:

    > Every Discourse Discussion returns data in JSON if you append .json to the URL.

    then this:

    > Raw data was gathered into a single JSONL file by automating a browser using Playwright.

    Kinda seems to me like having a whole browser instance for this isn't necessary? I would have been surprised if this .json pattern didn't continue for all pages, and it turns out that it does in fact also work for the topic list: https://community.openai.com/latest.json

    The other place I've seen this sort of API pattern is reddit. For example, https://www.reddit.com/r/all.json or (randomly chosen) https://www.reddit.com/r/mildlyinfuriating/comments/1bqn3c0/...

  • velid0 29 days ago
    Now train a gpt based on the data :D
    • testfrequency 29 days ago
      But make sure to call it ClosedData or something so we know it’s not open source

      (sorry, I think openai and sam are gross)

      • davely 29 days ago
        Maybe I don’t understand this sentiment, but are people really that hung up on the name?

        I see this sort of thing posted a lot (i.e., “it should be ClosedAI instead of OpenAI, lol”)

        What if it just means “Open for Business” instead of “Open Access for All”? Or maybe they should just make it an acronym?

        I’m sorry for the confusion on my part, but there’s just been a lot of words dedicated toward expressing frustration with the company because they chose to use “open” in their name.

        Personally, I don’t find it frustrating that Apple doesn’t sell fruit and Intel doesn’t actually give intelligence data.

        • phyzome 29 days ago
          "What if it just means" -- I mean, we don't have to ask "what if". We can look at the original press release:

          https://openai.com/blog/introducing-openai

          « We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies. »

          They never give an explicit explanation for their name, but it's pretty obvious.

        • rootusrootus 29 days ago
          Is the frustration because of the name, or because open [access] was part of their ethos at the beginning, and people think they've abandoned it?
          • startupsfail 29 days ago
            OpenAI is supposed to be a nonprofit. But, when the nonprofit board tried to exercise control, it became very clear that the nonprofit arm is not, in fact in control any longer. The board was wiped out, nearly everyone in the company seemingly was willing to join Microsoft or Sam Altman or what not.

            This doesn’t seem to be compatible with continuing loftily call themselves with the same name, as the initial nonprofit mission.

        • woopsn 29 days ago
          It's a gimmick. When the nonprofit was organized in 2015, the name certainly did not mean open for business. It meant (loftily) undertaking the quasi-religious quasi-humanist mission "in the spirit of liberty" to generate a new kind of super wealth as "broadly and evenly distributed as possible".

          As in prepare for the end... THE END OF HIGH PRICES!

          > to benefit humanity as a whole, unconstrained by a need to generate financial return

          - https://openai.com/blog/introducing-openai

  • garyiskidding 29 days ago
    This is really amazing. Pretty insightful. Thank you.
  • xandrius 30 days ago
    Love it, just for the sole reason of turning something OpenAI made into a dataset for everyone else :D
    • codetrotter 29 days ago
      I don’t think OpenAI are gonna lose any sleep over this.

      Isn’t a “community forum” like this basically just: “we’re not gonna spend money on providing adequate customer support so instead here is a forum where y’all can talk amongst yourselves and we’ll give you some badges and imaginary points for doing the customer support yourselves”?

      • alt-glitch 29 days ago
        I believe a community forum is absolutely vital for an "ecosystem" company. There needs to be a town square where people can discuss ideas and share feedback about that particular ecosystem.

        OpenAI has a pretty active forum with moderators replying and helping out all the time.

      • solardev 29 days ago
        They probably just sic a customer service GPT on it and use it to train the other ones...
  • dorkwood 29 days ago
    I did a bit of data scraping for fun in the past, but I was never quite sure of the legality of what I was doing. What if I was breaking some law in some jurisdiction of some country? Was someone going to track me down and punish me?

    OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

    • alt-glitch 29 days ago
      We were really heading someplace with The Semantic Web aka The Real Web 3.0 [1]

      Alas we have to fight against the machines in order to properly read the internet thru machines.

      I believe Discourse knowingly keeps its data easy to scrape though, so kudos to them!

      [1]: https://en.wikipedia.org/wiki/Semantic_Web

    • bsuvc 29 days ago
      > OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

      Cloudflare gives a shit.

      My household had to use our 5G internet for most things for a week or two until our IP reputation recovered.

      • stoorafa 29 days ago
        Yeah it’s probably worth renting a server if there’s any doubt about whether it’s wholly appropriate to do something
        • htrp 28 days ago
          Some sites just block the entire AWS/GCP ip address range.
    • ifyoubuildit 29 days ago
      Do you think it would be better if someone did track you down and punish you? Which world do you want to live in?
      • n0sleep 29 days ago
        I think large companies should be punished for stealing from people to make themselves richer.
    • EcommerceFlow 29 days ago
      A precursor to this would have been that Linkedin lawsuit Microsoft lost, allowing that one company to scrape all of Linkedin (technically "public information").
      • htrp 28 days ago
        hiQ Labs v. LinkedIn
  • enonimal 29 days ago
    > Number of Posts with negative sentiment, grouped by Topic

    > # 1 Result: Python Packaging

    Checks out

    • doctorpangloss 29 days ago
      The Python package is really well engineered, and the startup that is making the OpenAPI client based on it, Stainless, is doing a good job.

      This shows laypeople piling into a hype thing and running immediately into the roadblock of programming.

      Normal people don't want to like, put in effort to feel like they are a part of something.

      They are used to "just" having to turn on Netflix to feel like they are a part of the biggest TV show, or "just" having to click a button to buy a Stanley Cup, or "just" having to click a button to buy Bitcoin. The API and performance issues, IMO, they're not noise, but they are meaningless. To me this also signals how badly Grok and Stability are doing it, they are doubling and tripling down on popular opinions that have a strong, objective meaninglessness to them (like how fast the tokens come out and how much porn you're allowed to make). Whereas the Grok people are looking at this analysis and feeling very validated right now.

      I have no dog in this race, but I would hope that the OpenAI people do not waste any time on Python APIs for dumb people; instead, they should definitely improve their store and have a firmer opinion on how that would look. They almost certainly have a developing opinion on a programming paradigm for chatbots, but I feel like they are hamstrung by needed to quantize their models to meet demand, not decisions about the look and feel of Python APIs or the crappiness of the Python packaging ecosystem. Another POV is that the Apple development experience persists to be notoriously crappy, and yet they are the most valuable platform for most companies in the world right now; and also, JetBrains could not sustain an audience for the AppCode IDE, because everyone uses middlewares anyway; so I really don't think Python APIs matter as much as the community says they do. It's a Nice to Have, but it Does Not Matter.

      • enonimal 29 days ago
        we may think more similarly than you seem to think...

        this was more a slam on python packaging in general, than it is on the OpenAI implementation.

        I wouldn't be surprised if many of the issues under this topic are more related to Python package version nightmares, than OpenAI's Python implementation itself.

    • minimaxir 29 days ago
      A pro-tip for using the OpenAI API is to not use the official Python package for interfacing with it. The REST API documentation is good, and just using it in your HTTP client of choice like requests is roughly the same LOC without unexpected issues, along with more control.
      • rockostrich 29 days ago
        I've found this happens with a lot of first party clients. At work, we use LaunchDarkly for feature flags and use their code references tool to keep track of where flags are being referenced. The tool uses their first party Go client to interact with the API but the client doesn't handle rate limiting at all even though they have rate limiting headers clearly documented for their API.
        • klooney 29 days ago
          First party clients are typically an afterthought, and you can't add features without getting a PM to sign off, which strangles the impulse to polish & sand down rough edges.
          • rattray 29 days ago
            Agreed. Any in particular come to mind that you'd like to see improved?

            (my company provides first-party clients with a lot of polish; maybe we could help)

      • rattray 29 days ago
        Hey minimaxir, I help maintain the official OpenAI Python package. Mind sharing what issues you've had with it? (Have you used it since November, when the 1.0 was released?)

        Keen for your feedback, either here or email: alex@stainlessapi.com

        • minimaxir 29 days ago
          There's nothing wrong per se, it works as advertised. But as a result it's a somewhat redundant dependency.
          • rattray 29 days ago
            Ah, gotcha. Thanks, that makes sense. FWIW, here are some things it provides which might be worth having:

            1. typesafety (for those using pyright/mypy) and autocomplete/intellisense

            2. auto-retry (w/ backoff, intelligently so w/ rate limits) and error handling

            3. auto-pagination (can save a lot of code if you make list calls)

            4. SSE parsing for streaming

            5. (coming soon) richer streaming & function-calling helpers (can save / clean up a lot of code)

            Not all of these matter to everybody (e.g., I imagine you're not moved by such benefits as "dot notation over dictionary access", which some devs might really like).

            I would argue that auto-retry would benefit a pretty large percentage of users, though, especially since the 429 handling can paper over a lot of rate limits to the point that you never actually "feel" them. And spurious/temporary network connections or 500s also ~disappear.

            For some simple use-cases, none of these would really matter, and I agree with you - especially if it's not production code and you don't use a type-aware editor.

    • rattray 29 days ago
      FWIW, here are the only links I could find in the article which were tagged "Python3 Package": https://community.openai.com/t/647723 and https://community.openai.com/t/586484 . Note they don't see to have anything to do with the Python package whatsoever.

      I was pretty disappointed to see this, as I work on the Python package and was hoping for a good place to find feedback (apart from the github issues, which I monitor pretty closely).

      I'm not a data scientist; maybe someone from the Julep team could comment on the labeling? Or how I could find some more specific themes of problems with the Python package? (Was it just that people who have a problem of some kind just happen to also use the Python library?)

  • throwaway98797 29 days ago
    did they have the right to use all thier data?

    /s

  • varelse 30 days ago
    [dead]
  • warlord1 29 days ago
    [dead]
  • ritavdas 30 days ago
    [flagged]
  • olesya1979 30 days ago
    [dead]