Show HN: Train a language model to talk like you

(research.google.com)

209 points | by MasterScrat 1549 days ago

18 comments

MasterScrat 1549 days ago
You may have seen my recent post about [Chatistics: a Python tool to parse your Messenger/Hangouts/WhatsApp/Telegram chat logs into DataFrames](https://news.ycombinator.com/item?id=22069699).
This notebook uses the exported chat logs to train a simple GPT/GPT2 conversational model! It uses Google Colab, a notebook platform that allows you to train complex models online for free.
The approach is super simple: it takes all your chat logs, turns them into this format:
> <speaker1> Hi
> <speaker2> Hey - how are you?
> <speaker1> Great, thanks!
> ...
...then simply trains a GPT model on this corpus. In practice, I found that the default parameters (including using GPT and not GPT2) give the best resources for this setup.
This notebook will be part of our workshop "Meet your Artificial Self" happening this Saturday at AMLD 2020 in Lausanne, Switzerland: https://appliedmldays.org/workshops/meet-your-artificial-sel...
Feedback is welcome! :D
[-]
- prophesi 1548 days ago
  I definitely need to give this a whirl. Does it use Python 2 or 3, and is it as simple as importing its ipynb to run it in a local Jupyter notebook?
capableweb 1548 days ago
I got a bit tricked by the title here on HN. Maybe we can replace `talk` with `write`? Thought this was something that could learn how I speak and could generate sound from that, but seems to just be able written language, which is not nearly as interesting (for me).
[-]
- moron4hire 1548 days ago
  Yeah, Microsoft has had NN-based speech generators that can mimick your own voice for about a year now. Thought this was going to be a competing service.
- thedirt0115 1548 days ago
  How about this combined with recent advances in text-to-speech like Deep Voice that can sound like you? (Edit: punctuation)
arethuza 1549 days ago
I'm disappointed that this is about typed text rather than actual talking - I had hoped that training something that talked like me might assist technology vendors in actually creating voice recognition technology that works for me.
And yes my problems with voice recognition are probably due to my Scottish accent.... ;-)
Tenoke 1548 days ago
I've been playing with training different sizes[0] of gpt on my own chat data precisely for this reason.
Coincidentally, today I was even planning to publish my last post and notebook for training gpt2-1.5b and then chatting to oneself with the model. I left it for tomorrow though.. Maybe a mistake.
There is quite a lot you can do and talking to my trained model which is responding to me as me can be real weird at times. It's definitely the most engaged Ive been with gpt while talking to myself.
Having said that you seem to train here on very little. Still - cool demo.
[0] https://svilentodorov.xyz/blog/gpt-345M-finetune/
[-]
- MasterScrat 1548 days ago
  I would be very curious to see your notebook - while this simple approach works well with GPT, we are not getting the results we'd want with a more complex question/answer model that uses GPT2. So I'd love to see your implementation details!
  > Having said that you seem to train here on very little.
  The datasets provided in the notebook are really meant to be fallbacks for people who are not willing to use their own chat log data. When training on my own data, I have about 500k messages, which starts being enough to get interesting results.
  edit: wow, I see you're training on "14M facebook messages", that's impressive - do you actually chat that much?!
  [-]
  - Tenoke 1547 days ago
    I just pushed it, the blog post (which includes the notebook) is here[0].
    It's 14mb of data, not sure how many messages it actually comes down to but FB Messenger has been my main platform for talking to friends for the last decade.
    0. https://svilentodorov.xyz/blog/gpt-15b-chat-finetune/
    [-]
    - MasterScrat 1547 days ago
      Awesome :D
      We've published the third notebook: https://colab.research.google.com/drive/1XYNef9zcHhTjt6kM6yd...
      We would gladly have your feedback on our approach
- hug 1548 days ago
  I run a discord which is just a collection of people from my local city, with no real fixed subject or agenda. I trained the 345M GPT-2 against the "general" channel, and then set up a discord bot such that every message has a 2% chance that the past 5 messages will be read in as context, and a couple of sentences spat out in response.
  It's sometimes very lucid, sometimes insane, but most of all it's very entertaining.
  As a complete aside, I also tried 'transfer learning' against a huge amount of Marxist literature and then a small amount of erotic fiction, but that experiment hasn't quite worked out. Sexy Robot Marx will have to wait.
perturbation 1549 days ago
This is cool - might be worth training a simple discriminator model to identify your utterances, and then you can use the plug-and-play language model (PPLM - https://github.com/huggingface/transformers/blob/master/exam...) to generate utterances modeling a specific speaker without special tokens. Could also take less time to fine-tune.
the-dude 1549 days ago
I totally missed that Lyrebird was acquired : https://news.ycombinator.com/item?id=21006405
data_ders 1549 days ago
My curiosity is tempered by the fact that I've seen this episode of Black Mirror before... :)
https://en.wikipedia.org/wiki/Be_Right_Back
[-]
- ferCats99 1548 days ago
  I think is more like White Christmas
bryanrasmussen 1549 days ago
A computer trained to talk like me would spend a lot of time swearing and whining about how it can't take it anymore, which I admit would be pretty funny.
raidicy 1548 days ago
This is part of a workshop series[0]. Does anyone know if the talks/shops will be recorded?
[0]https://appliedmldays.org/workshops
[-]
- MasterScrat 1548 days ago
  We don't have any plan to record it currently.
  But we will release the two other notebooks used during the workshop, and plan to write a blog post detailing the full content.
  [-]
  - raidicy 1548 days ago
    That sounds great! This and the main workshop site have many subjects I'm really interested to check out. Thank you!
thisisastopsign 1549 days ago
I’ve never used PyTorch before... is this running within my local machine, or is there some API in here that’s also sending data to Google to also train their models? Asking a privacy point-of-view..
[-]
- heybrandons 1549 days ago
  The python notebook is hosted on google colab which will execute on free (for you) google servers. If you’re concerned about privacy probably do not upload your personal chat logs. You could also download the notebook and install resources on a machine you control. There looks like alternative datasets to test for Obama and movie dialogues
woefulregret 1548 days ago
throwaway, duh.
When I was a teenager I wrote a very graphic and very disturbing work of fiction that was archived on a popular erotica text website.. I have had anxiety for many years now that eventually someone will glue the authorship of that story to my identity.. If people in my real life discover my fantasies from years back because of my writing signature, I do not want to guess where that will leave me.. I am not looking forward to the future!!
fudged71 1549 days ago
Could you train this on a Q&A/FAQ corpus and get somewhat relevant results? (And is there any better tool for doing this?)
[-]
- cyorir 1549 days ago
  Along these lines, I worked on a team project in a university course to create an automated Q&A system making use of IBM Watson. We chose to focus on a Q&A system for business regulation in the state of Illinois. However, just using existing FAQs isn't sufficient. To build a corpus, we scraped several websites belonging to the state of Illinois for any information that would be relevant to businesses operating in Illinois. Then, we created sample question-answer pairs, with answers taken directly from the corpus. Using both the provided QA pairs and the rest of the unlabeled corpus, Watson trained a model to answer questions that hadn't been trained on by providing excerpts from the corpus. By ensuring that the model was providing excerpts from the corpus, we wouldn't have to worry that we were providing (too much) incorrect information; most of the time, the answers were relevant, too. Of course, you could create a similar system without using proprietary IBM software.
MadWombat 1548 days ago
Oh, oobee doo
I wanna be like you
I wanna walk like you, talk like you, too
You'll see it's true someone like me
Can learn to be like someone like you
alfonsodev 1549 days ago
This is going to be useful for when we fully turn into cyborgs.
[-]
- pjmorris 1548 days ago
  Maybe we're already there. Example: I've got a friend who worked in tech support long enough that he built a soundboard of recordings of his voice asking typical tech support questions in response to user problems.
  [-]
  - whatshisface 1548 days ago
    They did that on the show "IT crowd."
nickster 1549 days ago
I wonder if they are using this in Android Messenger or Gmail for the suggested responses.
[-]
- neodymiumphish 1548 days ago
  I really don't send many emails through Gmail, but when I do it is INSANELY accurate in its suggested sentence completion. Sometimes simple stuff like an address or whatever, but it can get really creepy when I'm sending something to my wife as a reference for some bill or interaction with our landlord and it knows exactly what I'm trying to say after just a word or two (sometimes, something like "Hey, I just..." and it has the rest of the sentence ready to go).
heybrandons 1549 days ago
Thanks for sharing MasterScrat! This looks fun!
brainzap 1549 days ago
Train it on Fred Rogers