OpenVoice: Versatile instant voice cloning

(research.myshell.ai)

473 points | by ulrischa 30 days ago

29 comments

randkyp 30 days ago
This is HN, so I'm surprised that no one in the comments section has run this locally. :)
Following the instructions in their repo (and moving the checkpoints/ and resources/ folder into the "nested" openvoice subfolder), I managed to get the Gradio demo running. Simple enough.
It appears to be quicker than XTTS2 on my machine (RTX 3090), and utilizes approximately 1.5GB of VRAM. The Gradio demo is limited to 200 characters, perhaps for resource usage concerns, but it seems to run at around 8x realtime (8 seconds of speech for about 1 second of processing time.)
EDIT: patched the Gradio demo for longer text; it's way faster than that. One minute of speech only took ~4 seconds to render. Default voice sample, reading this very comment: https://voca.ro/18JIHDs4vI1v I had to write out acronyms -- XTTS2 to "ex tee tee ess two", for example.
The voice clarity is better than XTTS2, too, but the speech can sound a bit stilted and, well, robotic/TTS-esque compared to it. The cloning consistency is definitely a step above XTTS2 in my experience -- XTTS2 would sometimes have random pitch shifts or plosives/babble in the middle of speech.
[-]
- bambax 30 days ago
  I am trying to run it locally but it doesn't quite work for me.
  I was able to run the demos allright, but when trying to use another reference speaker (in demo_part1), the result doesn't sound at all like the source (it's just a random male voice).
  I'm also trying to produce French output, using a reference audio file in French for the base speaker, and a text in French. This triggers an error in api.py line 75 that the source language is not accepted.
  Indeed, in api.py line 45 the only two source languages allowed are English and Chineese; simply adding French to language_marks in api.py line 43 avoids errors but produces a weird/unintelligible result with a super heavy English accent and pronunciation.
  I guess one would need to generate source_se again, and probably mess with config.json and checkpoint.pth as well, but I could not find instructions on how to do this...?
  Edit -- tried again on https://app.myshell.ai/ The result sounds French alright, but still nothing like the original reference. It would be absolutely impossible to confuse one with the other, even for someone who didn't know the person very well.
  [-]
  - randkyp 30 days ago
    I played with it some more and I have to agree. For actual voice _cloning_, XTTS2 sounds much, much closer to the original speaker. But the resulting output is also much more unpredictable and sometimes downright glitchy compared to OpenVoice. XTTS2 also tries to "act out" the implied emotion/tone/pitch/cadence in the input text, for better or worse.
    But my use case is just to have a nice-sounding local TTS engine, and current text-to-phoneme conversion quirks aside, OpenVoice seems promising. It's fast, too.
    [-]
    - echelon 30 days ago
      And StyleTTS2 generalizes out of domain even better than that.
  - dragonwriter 29 days ago
    > but when trying to use another reference speaker (in demo_part1), the result doesn’t sound at all like the source
    I’ve noticed the same thing and I wonder if there is maybe some undocumented information about what makes a good voice sample for cloning, perhaps in terms of what you might call “phonemic inventory”. The reference sample seems really dense.
    > Indeed, in api.py line 45 the only two source languages allowed are English and Chinese
    If you look at the code, outside of what the model does it relies on the surrounding infrastructure converting the input text to the international phonetic alphabet (IPA) as part of the process, and only has that implemented for English and Mandarin (though cleaners.py has broken references to routines for Japanese and Korean.
- causi 30 days ago
  We're so close to me being able to open a program, feed in an epub, and get a near-human level audiobook out of it. I'm so excited.
  [-]
  - aedocw 30 days ago
    Give https://github.com/aedocw/epub2tts a look, the latest update enables use of MS Edge cloud-based TTS so you don't need a local GPU and the quality is excellent.
    [-]
    - causi 27 days ago
      Interesting. Seems like a pain to get running but I'll give it a shot. Thanks.
  - lessolives 29 days ago
    [dead]
  - jurimasa 30 days ago
    I think this is creepy and dangerous as fuck. Not worth the trouble it will be.
    [-]
    - _zoltan_ 30 days ago
      you're gonna be REALLY surprised out there in the real world.
    - CamperBob2 30 days ago
      Other sites beckon.
- aftbit 30 days ago
  I want to try chaining XTTS2 with something like RVCProject. The idea is to generate the speech in one step, then clone a voice in the audio domain in a second step.
- fellowniusmonk 29 days ago
  I'm running it locally on my M1. The reference voices sound great, trying to clone my own voice it doesn't sound remotely like me.
- epiccoleman 30 days ago
  I have got to build or buy a new computer capable of playing with all this cool shit. I built my last "gaming" PC in 2016, so its hardware isn't really ideal for AI shenanigans, and my Macbook for work is an increasingly crusty 2019 model, so that's out too.
  Yeah, I could rent time on a server, but that's not as cool as just having a box in my house that I could use to play with local models. Feels like I'm missing a wave of fun stuff to experiment with, but hardware is expensive!
  [-]
  - sangnoir 30 days ago
    > its hardware isn't really ideal for AI shenanigans
    FWIW, I was in the same boat as you and decided to start cheap, old game machines can handle AI shenanigans just fine wirh the right GPU. I use a 2017 workstation (Zen1) and an Nvidia P40 from around the same time, which can be had for <$200 on ebay/Amazon. The P40 has 24GB VRAM, which is more than enough for a good chunk of quantized LLMs or diffusion models, and is in the same perf ballpark as the free Colab tensor hardware.
    If you're just dipping your toes without committing, I'd recommend that route. The P40 is a data center card and expects higher airflow than desktop GPUs, so you probably have to buy a "blow kit" or 3D-print a fan shroud and ensure they fit inside your case. This will be another $30-$50. The bigger the fan, the quieter it can run. If you already have a high-end gamer PC/workstation from 2016, you can dive into local AI for $250 all-in.
    Edit: didn't realize how cheap P40s now are! I bought mine a while back.
  - beardedwizard 30 days ago
    I would love a recommendation for an off the shelf "gpu server" good for most of this that I can run at home.
    [-]
    - macrolime 30 days ago
      Mac Studio or macbook pro if you want to run the larger models. Otherwise just a gaming pc with an rtx 4090 or a used rtx 3090 if you want something cheaper. A used dual 3090 can also be a good deal, but that is more in the build it yourself category than off the shelf.
      [-]
      - pksebben 30 days ago
        I went the 4090 route myself recently, and I feel like all should be warned - memory is a major bottleneck. For a lot of tasks, folks may get more mileage out of multiple 3090s if they can get them set up to run parallel.
        Still waiting on being able to afford the next 4090 + egpu case et al. There are a lot of things this rig struggles with running OOM, even on inference with some of the more recent SD models.
      - ckl1810 30 days ago
        Depending on what models you want to run, RTX 4090 or RTX 3090 may not be enough.
        Grok-1 was running on a M2 Ultra with 196GB of ram.
        https://twitter.com/ibab_ml/status/1771340692364943750
      - 101008 30 days ago
        Sorry if this is a silly question - I was never a Mac user, but I quick googled Mac Studio and it seems it's just the computer. Can I plug it to any monitor / use any keyboard and mouse, or do I need to use everything from Apple with it?
        [-]
        macrolime 30 days ago
        You can, but with some caveats. Not all screen resolutions work well with MacOS, though using BetterDisplay it will still usually work. If you want touch id, it's better to get the Magic Keyboard with touch id.
        timschmidt 30 days ago
        Any monitor and keyboard will work, however Apple keyboards have a couple extra keys not present on Windows keyboards so require some key remapping to allow access to all typical shortcut key combinations.
        [-]
        spectre3d 30 days ago
        Mainly to swap the Windows and Alt keys, which you can do in System Settings without any additional software.
        If you use a mouse with more than right-click and scroll wheel, with side buttons for example, then you’ll need extra software.
    - lakomen 30 days ago
      I'm clueless about AI, but here's a benchmark list https://www.videocardbenchmark.net/high_end_gpus.html
      Imo the 4070 super is the best value and consumes the least amount of Watts, 220 in all the top 10.
      So anything with one and some ECC RAM aka AMD should be fine. Intel non-xeons need the expensive w680 boards and very specific RAM per board.
      ECC because you wrote server. We're professionals here after all, right?
      [-]
      - w4ffl35 28 days ago
        I have a 2080s and build my ai software for it and above. 4090 is a good purchase
      - antonvs 30 days ago
        What if I enjoy gambling with cosmic ray bitflips?
        [-]
        GTP 30 days ago
        Maybe they would make your AI model evolve into an AGI over time :D
    - batch12 30 days ago
      So I went really cheap and got a Thunderbolt dock for a gpu and a secondhand Intel nuc that supported it. So far it has met my needs.
    - lardo 30 days ago
      CivitAI has one https://civitai.com/builds
  - holtkam2 30 days ago
    I'm in exactly the same boat. Yeah ofc you can run LMs on cloud servers but my dream project would be to construct a new gaming PC (mine is too old) and serve a LM on it, then serve an AI agent app which I can talk to from anywhere.
    Has anyone had luck buying used GPUs, or is that something I should avoid?
    [-]
    - ssl-3 30 days ago
      I bought some used GPUs during the last mining thing. They all worked fine except for some oddball Dell models that the seller was obviously trying to fix a problem on (and they took them back without question, even paying return shipping).
      And old mining GPUs are A-OK, generally: Despite warnings from the peanut gallery for over over a decade that mining ruins video cards, this has never really been the case. Profitable miners have always tended to treat these things very carefully, undervolt (and often, underclock) them, and pay attention to them so they could be run as cool and inexpensively as possible. Killing cards is bad for profits, so they aimed towards keeping them alive.
      GPUs that were used for gaming are also OK, usually. They'll have fewer hours of hard[er] work on them, but will have more thermal cycles as gaming tends to be much more intermittent than continuous mining is.
      The usual caveats apply as when buying anything else (used, "new", or whatever) from randos on teh Interwebz. (And fans eventually die, and so do thermal interfaces (pads and thermal compound), but those are all easily replaceable by anyone with a small toolkit and half a brain worth of wit.)
- zoklet-enjoyer 30 days ago
  I forgot all about Vocaroo!
tonnydourado 30 days ago
I might be missing something, but what are the non-questionable, or at least non-evil, uses of this technology? Because every single application I can think of is fucked up: porn, identity theft, impersonation, replacing voice actors, stealing the likeness of voice actors, replacing customer support without letting the customers know you're using bots.
I guess you could give realistic voices to people that lost their voices by using old recordings, but there's no way that this is a market that justify the investment.
[-]
- paczki 30 days ago
  The ability to use my own voice in other languages so I can do localization on my own youtube videos would be huge.
  With game development as well, being able to be my own voice actor would save me an immense amount of money that I do not have and give me even more creative freedom and direction of exactly what I want.
  It's not ready yet, but I do believe that it will come.
  [-]
  - Capricorn2481 30 days ago
    People are already doing this and it was hugely controversial in The Finals
    [-]
    - fennecfoxy 26 days ago
      There's a bit thing atm around human creative work being replaced by AI; even if a voice isn't cloned but just generated by AI it gets people frothing as if human hands in factories haven't been replaced by robots, or a human rider on horseback wasn't replaced by an engine.
    - rbits 30 days ago
      I feel like the main reason, though, was that they could easily afford real voice actors.
      [-]
      - itsTyrion 29 days ago
        that's... exactly what they did. They hired real VAs that recorded some lines that are used as-is in the game, and stuff for training, which is to make more dynamic commentary about the match. It was a controversy because people saw "AI" and lost their marbles. Nothing wrong if it's contracted/licensed training data.
        [-]
        mewpmewp2 29 days ago
        Ability to do it realtime in video games in general seems huge.
- AnonC 30 days ago
  > what are the non-questionable, or at least non-evil, uses of this technology?
  iPhone Personal Voice [1] is one. It helps people who are physically losing their voice and the ones around them to still have their voice in a different way. Apple takes long voice samples of various texts for this though.
  [1]: https://www.youtube.com/watch?v=ra9I0HScTDw
  [-]
  - tonnydourado 30 days ago
    That's kinda what I was thinking on the second paragraph. Still, gotta be a small market.
- tompetry 30 days ago
  I have the same concerns generally. But one non-evil popped into my head...
  My dad passed away a few months ago. Going through his things, I found all of his old papers and writings; they have great meaning to me. It would be so cool to have them as audio files, my dad as the narrator. And for shits, try it with a British accent.
  This may not abate the concerns, but I'm sure good things will come too.
  [-]
  - block_dagger 30 days ago
    Serious question: is this a healthy way to treat ancestors? In the future will we just keep grandma around as an AI version of her middle aged self when she passes?
    [-]
    - tompetry 30 days ago
      Fair question. People have kept pictures, paintings, art, belongings, etc of their family members for countless generations. AI will surely be used to create new ways to remember loved ones. I think that is a big difference than "keeping around grandma as an AI version of herself", and pretending they are still alive, which I agree feels unhealthy.
      [-]
      - jaakl 28 days ago
        Made me think how different can be generations and what is countless. I can count back 2 and have not even seen anything about my grandfather who was born 101 years before me (1875). At least I have “signature” as XXX in 1860ies on a plan of farmland from his father, when he bought it out from slavery. And that actual farmland. Good luck AIng that heritage.
    - mynameisash 30 days ago
      I think everyone's entitled to their opinion here. As for me, though: my brother died at 10 years old (back in the 90s). While there are some home videos with him talking, it's never for more than a few seconds at a time.
      Maybe a decade ago, I came across a cassette tape that he had used to record himself reading from a book for school - several minutes in duration.
      It was incredibly surprising to me how much he sounded like my older brother. It was a very emotional experience, but personally, I can't imagine using that recording to bootstrap a model whereby I could produce more of his "voice".
    - Narishma 30 days ago
      There's a Black Mirror episode about something like that, though I don't remember the details.
      [-]
      - GTP 30 days ago
        I remember a journalist actually doing it, but just the AI part of course, not the robot.
      - oli-g 30 days ago
        Yup, "Be Right Back", S2E1
        And possibly another one, but that would be a spoiler
    - ssl-3 30 days ago
      It seems unhealthy for us to sort out what is and is not a healthy way for someone else to mourn, or to remember, their own grandmother.
      It is healthier for us to just let others do as they wish with their time without passing judgement.
    - gremlinsinc 30 days ago
      it worked for super man, he seemed well adjusted after talking to his dead parents.
  - hypertexthero 30 days ago
    Not sure if this is related to this tech, but I think it is worthwhile: The Beatles - Now And Then - The Last Beatles Song (Short Film)
    https://www.youtube.com/watch?v=APJAQoSCwuA
- CuriouslyC 30 days ago
  Text to speech is very close to being able to replace voice actors for a lot of lower budget content. Voice cloning will let directors and creators get just the sound they want for their characters, imagine being able to say "I want something that sounds like Harrison Ford with a French accent." Of course, there are going to be debates about how closely you can clone someone's voice/diction/etc, both extremes are wrong - perfect cloning will hurt artists without bringing extra value to directors/creators, but if we outlaw things that sound similar the technology will be neutered to uselessness.
  [-]
  - unraveller 29 days ago
    Years ago The Indiana Jones podcast show made a feature length radio adventure drama (with copyright blessing for music/story of indiana jones) and it has a voice actor that sounds 98% exactly like harrison ford. No one was hurt by it because cultural artefacts rely on mass distribution first.
    https://archive.org/details/INDIAN1_20190413?webamp=default
  - tonnydourado 30 days ago
    That's basically replacing voice actors and stealing their likeness: both are arguably evil, and mentioned. So, I haven't missed them.
    P.S.: "but what about small, indie creators" that's not who's gonna embrace this the most, it's big studios, and they will do it to fuck over workers.
    [-]
    - CuriouslyC 30 days ago
      As someone involved in the AI creator sphere, that's a very cold take. Big studios pay top shelf voice talent to create the best possible experience because they can afford it. Do you think Blizzard is using AI to voice Diablo/Overwatch/Warcraft? Of course not. On the other hand, there are lots of small indie games being made now that utilize TTS, because the alternative is no voice, the voice of a friend or a very low quality voice actor.
      Do I want to have people making exact clones of voice actors? No. The problem is that if you say "You can't get 90% close to an existing voice actor" then the technology will be able to create almost no human voices, it'll constantly refuse like gemini, even when the request is reasonable. This technology is incredibly powerful and useful, and we shouldn't avoid using it because it'll force a few people to change careers.
      [-]
      - tonnydourado 30 days ago
        Have you seen how big studios treat vfx artists? They absolutely will replace voice actors with AI.
        Also:
        > This technology is incredibly powerful and useful
        At what, exactly? The only "useful" case you presented is "actually, replacing voice actors with AI isn't so bad".
        [-]
        CuriouslyC 30 days ago
        You want a world where only the rich can create beautiful experiences. You're either rich or short sighted.
        Edit: If you've got a cadre of volunteer voice actors that don't suck hidden somewhere, you need to share buddy. That's the only way your comments make sense.
        [-]
        tonnydourado 30 days ago
        I don't know what else to tell you, I just think people deserve to be paid for the work they do.
        Your vision of a world where anyone can create voice for their projects for cheap CAN NOT exist without someone getting exploited. Nor is it sustainable, really.
        You said they this world would be worth some people losing their careers, but what do we gain? More games/audiobooks of questionable quality? Is this really worth fucking a whole profession over?
        [-]
        CuriouslyC 30 days ago
        We agree that people should be paid for the work that they *DO*. Your view smacks of elitism, and voice actors don't have any more right to be able to make decent money peddling their voice than indie game devs have to peddle games with synthetic voices.
        [-]
        tonnydourado 30 days ago
        Your view smacks of contempt for workers, particularly in the arts. Specially the emphasis on "do", as if voice actors don't actually work, and just live of royalties or something. The kind of worldview that the rich and the delusioned working poor tend to share.
        amarant 30 days ago
        Professions disappear, it's a natural side effect of progress. Stablehands aren't really that common anymore, because most people drive cars instead of horses.
        I really hope we can deprecate a whole bunch of professions related to fossil fuels, including coal miners and oil drillers etc.
        I sympathise with the people working in those professions, I do, but times change and professions come and go, and I don't buy the argument that we should stop inventing new stuff because it might outcompete people.
        As for positive uses of this technology, it might be used to immortalise a voice actor. For example Sir David Attenborough probably won't be around forever, but thanks to this technology, his iconic voice might be!
        [-]
        wsintra2022 30 days ago
        I made an e book of Carl Rogers narrated by David Attenborough, turned out decent, I used coquai who sadly have closed with all my API credits
        Osmose 30 days ago
        You have a narrow view of what a beautiful experience is. It does not require professional-level voice acting.
        It is not unfair that, in order to have voice acting, you must have someone perform voice acting. You don't have the natural right to professional-level voice acting for free, nor do you need it to create beautiful things.
        The tech is simply something that may be possible, and it has tradeoffs, and claiming that it's an accessibility problem does not grant you permission to ignore the tradeoffs.
        [-]
        ben_w 30 days ago
        > You don't have the natural right to professional-level voice acting for free
        I also don't have the natural right to work as a professional-level voice actor.
        "Natural rights" aren't really a thing, the phrase is a thought-terminating cliché we use for the rhetorical purpose of saying something is good or bad without having to justify it further.
        > The tech is simply something that may be possible, and it has tradeoffs, and claiming that it's an accessibility problem does not grant you permission to ignore the tradeoffs.
        A few times as a kid, I heard the meme that the American constitution allows everything then tells you what's banned, the French one bans everything then tells you what's allowed, and the Soviet one tells you nothing and arrests you anyway.
        It's not a very accurate meme, but still, "permission" is the wrong lens: it's allowed until it's illegal. You want it to be illegal to replace voice actors with synthetic voices, you need to campaign to make it so as this isn't the default. (Unlike with using novel tech for novel types of fraud, where fraud is already illegal and new tech doesn't change that).
        Riverheart 30 days ago
        “You want a world where only the rich can create beautiful experiences. You're either rich or short sighted.”
        Being rich to create a beautiful experience is neither required nor does it require a synthetic voice to achieve.
        It does require effort and being rich can reduce that effort for sure.
      - Osmose 30 days ago
        The lightness with which you treat forcing tens of thousands of people to change their career is absurd. Indie games are hardly suffering for a lack of voice acting, even if you only look at it from a market perspective and ignore that voice acting is a creative interpretation and not simply reading the words the way the director wants.
        Yes, we should avoid using it because it will upend the lives of a significant amount of artists for the primary benefit of "some indie games will have more voice acting and big game companies will be able to save money on voice actors". That's not worth it, how could you think it is?
        [-]
        waterhouse 30 days ago
        Suppose all existing voice actors, and, to be maximally generous, everyone who had spent >1 year training to be a voice actor, was given a pension for some years, paying them the greater of their current income or some average voice actor income. And then there would be no limits on using AI voices to substitute for voice actors.
        Would you be happy with that outcome, or do you have another objection?
        ben_w 30 days ago
        > The lightness with which you treat forcing tens of thousands of people to change their career is absurd.
        Only tens of thousands? Cute. For most of the 2010s, I was expecting self-driving cars to imminently replace truck drivers, which is a few millions in the US alone and I think around 40-45 million worldwide. I still do expect AI to replace humans for driving, I just don't know how long it will take. (I definitely wasn't expecting "creative artistry" to be an easier problem than "don't crash a car", I didn't appreciate that nobody minds if even 90% of the hands have 6 fingers while everyone minds if a car merely equals humans by failing to stop in 1 of every (3.154e7 seconds per year * 1.4e9 vehicles / 30000 human driving fatalities per year ~= 1.47e+12) seconds of existence).
        Almost every nation used to be around 90% farm workers, now it's like 1-5% (similar numbers to truckers) and even those are scared of automation; the immediate change was to factory jobs, but those too have shifted into service roles because of automation of the former, and the rest are scared of automation (and outsourcing).
        Those service-sector roles? "Computer" used to be a job; Graphical artists are upset about Stable Diffusion; Anyone working with text, from Hollywood script writers to programmers to lawyers, is having to justify their own wages vs. an LLM (for now, most of us are winning this argument; but for how long?)
        We get this wrong, it's going to be a disaster; we get it right, we're all living better the 0.1%.
        > Indie games are hardly suffering for a lack of voice acting, even if you only look at it from a market perspective and ignore that voice acting is a creative interpretation and not simply reading the words the way the director wants.
        I tried indie game development for a bit. I gave up with something like £1,000 in my best year. (You can probably double that to account for inflation since then).
        This is because the indie game sector is also not suffering from a lack of developer talent, meaning there's a lot of competition that drives prices below the cost of living. Result? Hackathons where people compete for the fun of it, not for the end product. Those hackathons are free to say if they do or don't come with rules about GenAI; but in any case, they definitely come with no budget.
        > Yes, we should avoid using it because it will upend the lives of a significant amount of artists for the primary benefit of "some indie games will have more voice acting and big game companies will be able to save money on voice actors". That's not worth it, how could you think it is?
        A few hours ago I was in the Deutsches Technikmuseum; there's a Jacquard Loom by the cafe: https://technikmuseum.berlin/ausstellungen/dauerausstellunge...
        The argument you give here is much the same argument used against that machine, back in the day: https://spectrum.ieee.org/the-jacquard-loom-a-driver-of-the-...
        Why do you think those textile workers lost the argument?
        And to pre-empt what I think is a really obvious counter, I would also add that the transition we face must be handled with care and courtesy to the economic fears — to all those who read my comment and think "and therefore this will be easy and we should embrace it, just dismiss the nay-sayers as the Luddites they are": why do you think Karl Marx wrote the Communist Manifesto?
      - ceejayoz 30 days ago
        > Do you think Blizzard is using AI to voice Diablo/Overwatch/Warcraft? Of course not.
        Do you think Blizzard won't when the tech gets cheap and good enough?
        [-]
        CuriouslyC 30 days ago
        Probably not, because the voice actors are a community draw. In fact, one of the top threads in the overwatch subreddit right now is pictures of all the voice actors. They go to cons and interact with fans and they don't cost so much that losing that value to save a few bucks is worth it.
        [-]
        infinitezest 29 days ago
        Based on the way consumers have behaved over my 30-odd years of life, I seriously doubt they will care enough about the fates of voice actors, developers or any other folks who are mistreated or discarded in the project of creating yet another Call of Duty iteration. We're an atomized society thats being trained all day every day to just purchase things, consequences be damned. I want to believe you are right but I suspect you are not. I suspect you are coping and to be honest, I want to do that too.
        [-]
        CuriouslyC 29 days ago
        There are some cases where that's true, but people connect with actors too, which is why a AAA star will get paid 20+ million dollars to just associate their name with a project.
        I think the truth will be in the middle. Nobody's choosing AI over Brad Pitt or Henry Cavill, and in the various niches I think the highest performing humans will still do very well. Overwatch is a good example of that, see https://blizzardwatch.com/2018/01/24/rolling-stone-calls-ove.... AI is going to destroy the bottom of the barrel though, stuff like Fiverr and a lot of mediocre trained voice actors are going to have to get better or change careers.
    - ben_w 30 days ago
      I disagree on three of your points.
      It is creating a new and fully customisable voice actor that perfectly matches a creative vision.
      To the extent that a skilled voice actor can already blend existing voices together to get, say, French Harrison Ford, for it to be evil for a machine to do it would require it to be evil for a human to do it.
      Small indie creators have a budget of approximately nothing, this kind of thing would allow them to voice all NPCs in some game rather than just the main quest NPCs. (And that's true even in the absence of LLMs to generate the flavour text for the NPCs so they're not just repeating "…but then I took an arrow to the knee" as generic greeting #7 like AAA games from 2011).
      Big studios may also use this for NPCs to the economic detriment of current voice actors, but I suspect this will be a tech which leads to "induced demand"[0] — though note that this can also turn out very badly and isn't always a good thing either: https://en.wikipedia.org/wiki/Cotton_gin
      [0] https://en.wikipedia.org/wiki/Induced_demand
    - allannienhuis 30 days ago
      I don't disagree with the thought that large companies are going to try to use these technologies too, with typical lack of ethics in many cases.
      But some of this thinking is a bit like protesting the use of heavy machinery in roadbuilding/construction, because it displaces thousands of people with shovels. One difference with this type of technology is that the means to use it doesn't require massive amounts of capital like the heavy machinery example, so more of those shovel-weilders will be able to compete with those that are only bringing captial to the table.
      [-]
      - tonnydourado 30 days ago
        I'm not saying that this should be forbidden or something. I just wonder what is the motivation for the people pitching and actually developing this. I'm all for basic, non-profit-driven, research, but at some point you gotta ask yourself "what am I helping create here?"
        [-]
        CrazyStat 30 days ago
        Saying something is evil would seem to suggest that you think it should be forbidden. Maybe you should choose a different word if that’s not your intention.
- drusepth 30 days ago
  Super-niche use-case: our game studio prototyped a multiplayer horror game where we played with cloning player voices to be able to secretly relay messages to certain players as if it came from one of their team-mates (e.g. "go check out below deck" to split a pair of players up, or "I think Bob is trying to sabotage us" to sew inter-player distrust, etc).
  Less-niche use-case: if you use TTS for voice-overs and/or NPC dialogue, there can still be a lot of variance in speech patterns / tone / inflections / etc when using a model where you've just customized parameters for each NPC -- using a voice-clone approach, upon first tests, seems like it might provide more long-term consistency.
  Bonus: in a lot of voiced-over (non-J)RPGs, the main character is text-only (intentionally not voiced) because they're often intended to be a self-insert of the player (compared to JRPGs which typically have the player "embody" a more fleshed-out player with their own voice). If you really want to lean into self-insert patterns, you could have a player provide a short sample of their voice at the beginning of the game and use that for generating voice-overs for their player character's dialogue throughout the game.
  [-]
  - Terr_ 30 days ago
    The idea of a personalized protagonist voice is interesting, but I'd worry about some kind of uncanny valley where it sounds like myself but is using the wrong word-choices or inflections.
    Actually, getting it to sound "like myself" in the first place is an extra challenge! For many people even actual recordings sound "wrong", probably because your self-perception involves spoken sound being transmitted through your neck and head, with a different blend of frequencies.
    After that is solved, there's still the problem of bystanders remarking: "Is that supposed to sound like you? It doesn't sound like you."
    [-]
    - Kinrany 29 days ago
      Being able to show friends your internal voice would be cool.
  - Jordrok 30 days ago
    > Super-niche use-case: our game studio prototyped a multiplayer horror game where we played with cloning player voices to be able to secretly relay messages to certain players as if it came from one of their team-mates (e.g. "go check out below deck" to split a pair of players up, or "I think Bob is trying to sabotage us" to sew inter-player distrust, etc).
    That's an insanely cool idea, and one I hadn't really considered before. Out of curiosity, how well did it work? Was it believable enough to fool players?
- pksebben 30 days ago
  There's a huge gap in uses where listenable, realistic voice is required, but the text to be spoken is not predetermined. Think AI agents, NPCs in dynamically generated games, etc. These things are currently not really doable with the current crop of TTS because either they take too long to run or they sound awful.
  I think the bulk of where this stuff will be useful isn't really visible yet b/c we haven't had the tech to play around with enough.
  There is also certainly a huge swath of bad-actor stuff that this is good for. I feel like a lot of the problems with modern tech falls under the umbrella of "We're not collectively mature enough to handle this much power" and I wish there were a better solution for all of that.
  [-]
  - gremlinsinc 30 days ago
    eh, you mean the solution isn't, so here's even more power... see you next week!
    [-]
    - pksebben 30 days ago
      If I'm getting your meaning, that is - we don't have a fix for "we can but we ought not to", then yeah I see what you mean.
      Even that is not straightforward, unfortunately. There's this thing where the tech is going to be here, one way or the other. What we may have some influence on isn't whether it shows up, but who has it.
      ...which brings me to what I see as the core of contention between anyone conversing in this space. Who do you think is a bigger threat? Large multinational globocorps or individual fanatics, or someone else that might get their hands on this stuff?
      From my perspective, we've gone a long time handing over control of "things"; society, tax dollars, armaments, law - to the larger centralized entities (globocorps and political parties and wall street and so on and so on). Things throughout this period have become incrementally worser and worser, and occasionally (here's looking at you, september '08) rapidly worser.
      Put in short, huge centralized single-points-of-failure are the greater evil. "Terrorists", "Commies", "Malcontents" (whatever you wanna call folks with an axe to grind) make up a much lesser (but still present!) danger.
      So that leaves us in a really awkward position, right? We have these things that (could) amount to digital nukes (or anything on a spectra towards such) and we're having this conversation about whether to keep going on them while everyone knows full well that on some level, we can't be trusted. It's not great and I'll be the first to admit that.
      But, I'm much more concerned about the people with strike drones and billions of dollars of warchest having exclusive access to this stuff than I am about joey-mad-about-baloney having them.
      Joey could do one-time damage to some system, or maybe fleece your grandma for her life savings (which is bad, and I'm not trying to minimize it).
      Globocorp (which in this scenario could actually be a single incredibly rich dude with a swarm of AI to carry out his will) could institute a year-round propaganda machine that suppresses dissent while algorithmically targeting whoever it deems "dangerous" with strike drones and automated turrets. And we'd never hear about it. The 'media AI' could just decide not to tell us.
      So yeah, I'm kinda on the side of "do it all, but do it in the open so's we can see it". Not ideal, but better than the alternatives AFAICT.
- mostrepublican 30 days ago
  I used it to translate a short set of tv shows that were only available in Danish with no subtitles in any other language and made them into English for my personal watching library.
  The episodes are about 95% just a narrator with some background noises.
  Elevenlabs did a great job with it and I cranked through the 32 episodes (about 4 mins each) relatively easily.
  There is a longer series (about 60 hours) only in Japanese that I want to do the same thing for. But don't want to spend Elevenlabs prices to do.
  [-]
  - ukuina 30 days ago
    OpenAI TTS is very competitively priced: $15/1M chars.
- victorbjorklund 30 days ago
  Why is replacing voice actors evil? How is it worse than replacing any other job using a machine/software?
  [-]
  - buu700 30 days ago
    Agreed. I think the framing of "stealing" is a needlessly pessimistic prediction of how it might be used. If a person owns their own likeness, it would be logical to implement legal protections for AI impersonations of one's voice. I could imagine a popular voice actor scaling up their career by using AI for a first draft rendering of their part of a script and then selectively refining particular lines with more detailed prompts and/or recording them manually.
    This raises a lot of complicated issues and questions, but the use case isn't inherently bad.
  - machomaster 30 days ago
    The problem is not about replacing actors with technology. It is about replacing the particular actors with their computer-generated voice. It's about likeness-theft.
- accrual 30 days ago
  A long term goal of mine is to have a local LLM trained on my preferences and with a very long memory of past conversations that I could chat with in real time using TTS. It would be amazing to go on a walk with Airpods and chat with it, ask questions, learn about topics, etc.
  [-]
  - willsmith72 30 days ago
    I do that already with the chatgpt mobile app, but not with my own voice.
    I'd like it if there were more (and non-american) voice options, but I don't think I'd ever want it to be my voice I'm hearing back.
    [-]
    - accrual 30 days ago
      Yeah, I wouldn't necessarily want it to be my own voice either, but it would be very cool to make it be the voice of someone I enjoy listening to. :)
- andrewmcwatters 30 days ago
  I want to preserve samples of my voice as I age so that when voice replication technology improves in the future, I can hear myself from a different time of my life in ways that are not prerecorded.
  I would also like to give my children this as a novelty of preserved family history so if I so desire, I can have fun with them by letting them hear me from different ages.
- lenerdenator 30 days ago
  > there's no way that this is a market that justify the investment.
  It's not just worth justifying investment. You can make just about anything worth the investment as measured by a 90-day window of fiscal reporting. H!tmen were a wildly profitable venture for La Cosa Nostra.
  It's about not justifying the societal risk.
- spyder 30 days ago
  Huh? Replacing human labor with machine is evil? You wouldn't even able to post this comment without that happening, because computers wouldn't exists or we wouldn't have time for that because many of us would work on farms to produce enough food without the use of human-replacing technologies.
  In a similar way as machines allowed to produce abundance of food with less labor, the voice AI combined with AI translation can make information more accesible for the world. Voice actors wouldn't be able to voice act all the useful information in the world, (especially for the more niche topics and for the smaller languages) because it wouldn't worth to pay them and humans are also slower to than machines. We are not far from almost realtime voice translation from any language to any other one. Sure, we can do it with text-only translation, but voice makes it more accessible for lot of people. ( For example between 5–10% of the world has dislexya. )
- thatguysaguy 30 days ago
  To think of non-evil versions just consider cases where right now there's no voice actor to replace, but you could add a voice. E.g. indie games.
- swores 30 days ago
  What about for remembering lost loved ones? There are dead people I would love to hear talk again, even if I know it's not their personality talking just their voice (and who knows, maybe with LLM training on a single person it could even be roughly their personality, too).
  I can imagine a fairly big market of both people setting it up before they die, with maybe a whole load of written content and a schedule of when to have it read in future, and people who've just lost someone, and want to recreate their voice to help remember it.
  [-]
  - tonnydourado 30 days ago
    > I can imagine a fairly big market (...)
    I can't, and if I could, I think this would be fairly dystopian. Didn't black mirror have an episode about something similar? I vaguely remember an Asimov/Arthur C. Clark short story about the implications of time travel (ish) tech in a similar context. Sounds like a case of "we've build the torment nexus from classic sci-fi novel 'do not build the torment nexus'"
  - grugagag 30 days ago
    We already have ways to preserve the voices of people past their lives. Cloning their voices and writing things in their names is not only wrong but deceptive.
  - dotancohen 30 days ago
    Jack Crusher did something similar for Wesley.
- wdb 30 days ago
  You can use it to easily fix voice overs on you videos without needing to re-record etc.
  [-]
  - tonnydourado 30 days ago
    Reasonable, but I'm skeptical of the market
- kajecounterhack 30 days ago
  I like the idea of cloning my own voice and having it speak in a foreign language
- thatguysaguy 30 days ago
  I'm 100% going to clone my voice and use it on my discord bot.
- SunlitCat 30 days ago
  Maybe having better real time conversations in computer games. Like game characters saying your name in voiceovers.
- onel 29 days ago
  Podcast ads
  I used to work in the space and host read ads are the best performers compared to other types of audio ads. Imagine you can automate that ad creation as a podcaster/platform. You just give it the script and it creates a new ad in that voice.
  Now, is this wrong? I think this is a separate discussion
- layer8 30 days ago
  It enables to use your favorite audiobook reader’s voice for all your TTS needs. E.g. you can have HN comments read to you by Patrick Steward, or by the Honest Trailers voice. Maybe you find that questionable? ;)
  [-]
  - zdragnar 30 days ago
    So, replacing voice actors with unpaid clones of their voices, effectively stealing their identity.
    The range of use goes from totally harmless fun to downright evil.
    [-]
    - layer8 30 days ago
      If I take pictures of someone and hang my home with AI-generated copies of those pictures, I’m not stealing their identity.
    - RyanCavanaugh 30 days ago
      The existence of Photoshop doesn't mean that you can put Kobe Bryant on a Wheaties box without paying him. There's no reason that a voice talent's voice can't be subject to the same infringement protections as a screen actor's or athlete's likeness.
      [-]
      - popalchemist 30 days ago
        You absolutely can put Kobe on a Wheaties box without problems legally, IF you do not sell it. That's "fair use." It has not been tested in court yet, but precedent seems to suggest that creating voice clones for private use is also fair use, ESPECIALLY if that person is a celebrity, because privacy rights are limited for celebrities.
  - johncalvinyoung 30 days ago
    Utterly questionable.
- IMTDb 30 days ago
  Non robotic screen readers for blind people
  [-]
  - tonnydourado 30 days ago
    That would be non-evil, sure. But I wonder if blind people even want it? They're already listening to screen readers at insane speeds, up to 6-8x, I think. Do they even care that it doesn't sound "realistic"?
    [-]
    - blackqueeriroh 30 days ago
      Well, I’m sure the blind readers of HN (which I am certain exist) can answer this question, and you, a sighted person, don’t need to even wonder from your position of unknowing.
      [-]
      - tonnydourado 30 days ago
        I mean, I explicitly used "wonder" because I don't wanna assume about blind people's experiences and needs. What else should I have done so you wouldn't come in kicking me in the nuts?
        [-]
        SamPatt 30 days ago
        In this thread there's a bunch of "non-evil" responses, and your replies are all "I'm skeptical" or just dismissing them outright.
        It appears from the outside that you've decided this is Officially Bad technology and aren't genuinely seeking evidence otherwise.
        [-]
        tonnydourado 30 days ago
        You're assuming worse of me than I'm assuming of the technology.
        There's almost no reply here with a use that is a) not somewhat bad and b) has enough of an upside to compensate the downsides.
        Except maybe this one, but I do know enough about accessibility to know how blind people generally use computers, which is why I asked the question.
- YoshiRulz 30 days ago
  It could be used by people who can write English fluently, but are slow at speaking it, as a more personal form of text-to-speech.
  Personally, I'm eager to have more control over how my voice assistant sounds.
  [-]
  - Zambyte 30 days ago
    Similarly, a real-time voice to voice translation system that uses the speakers voice would be really cool.
- albert_e 30 days ago
  If I am learning new content I can make my own notes and convert them into an audiobook for my morning jog or office commute using my own voice.
  If I am a content creator I can generate content more easily by letting my AI voice narrate my slides say. Yes that is cheap and lower quality than a real narrator who can deliver more effective real talks ...but there is a long tail of mediocre content on every topic. Who cares as long as I am having fun, sharing stuff, and not doing anything illegal or wrong.
- allannienhuis 30 days ago
  I can think that better quality audio content generated from text would be a killer application. As someone else mentioned, pipe in an epub, output an audiobook or video game content. With additional tooling (likely via ai/llm analysis), this could enable things like dramatic storytelling with specific character voices and dynamics interpreted from the content of the text.
  I can see it empowering solo creators in similar ways that modern music tools enable solo or small-budget musicians today.
  [-]
  - devinprater 30 days ago
    Or, when it gets fast enough, someone could have their own personal dub of video games (BlazBlue Central Fiction) or TV shows and such.
  - latexr 30 days ago
    > pipe in an epub, output an audiobook or video game content.
    That falls into “replacing voice actors”, mentioned by the OP.
    [-]
    - blackqueeriroh 30 days ago
      No, it really doesn’t. There are thousands of very smart and talented creators without the budget to hire voice actors. This lets them get a start. AI voices let you lower the barrier to entry, but they won’t replace most voice actors because the higher you go up the stack, the more the demand for real actors will also go up because AI voices aren’t anywhere near being able to replace real voice actors.
      [-]
      - latexr 30 days ago
        > AI voices let you lower the barrier to entry, but they won’t replace most voice actors because the higher you go up the stack, the more the demand for real actors will also go up
        That is as absurd as saying LLMs are increasing the demand for writers.
        > because AI voices aren’t anywhere near being able to replace real voice actors.
        Even if that were true—which it is not; the current crop is more than adequate to read long texts—it assumes the technology has reached its limit, which is equally absurd.
      - tonnydourado 30 days ago
        As another reply put, I'm very skeptical that the benefits for small content creators will offset the damaged to society as a whole, from increased fraud and harassment.
    - albert_e 30 days ago
      What if I want to listen to my notes in my own voice
      Or my favorite books in my own voice.
      Or my lecture notes in my professor's voice.
- dougmwne 30 days ago
  It seems like it would be great for any kind of voiceover work or any recorded training or presentation. If you want to correct a mis-speak or add some information, instead of re-recording the entire segment, you could seamlessly update a few words or sentences.
- bigcoke 30 days ago
  AI girlfriend... ok I'm done.
  [-]
  - lenerdenator 30 days ago
    It's 2024. Are nerds still trying to turn any technology of sufficient ability into Kelly LeBrock?
    [-]
    - bigcoke 30 days ago
      this is going to be a real thing for gen z, but replace kelly with any girl from anime
      [-]
      - lenerdenator 30 days ago
        Jeeze, I can't imagine why women feel so alienated from the tech industry.
        It's almost as if any time some sort of way to make computers more human-like emerges, the first thing a subset of the men in the space do is think "How can I use this to make a woman who has absolutely no function other than my emotional, practical, and physical gratification?"
        [-]
        amenhotep 30 days ago
        Humans in desiring deep emotional and sexual connections with people of their desired gender and being driven to weird behaviours when they can't achieve it in the way you personally approve of shock
        [-]
        lenerdenator 30 days ago
        Then work on it. Ask friends for feedback. Go to therapy. Have some damned introspection instead of just reducing 51% of the people on the planet to a bangmaid.
- Mkengine 30 days ago
  I don't know how stressful my life will be then, but I thought about reading to my kids later and creating audiobooks with my voice for them, for when I am traveling for work, so they can still listen to me "reading" to them.
- bdcravens 30 days ago
  The first couple I've come up with are training courses at scale, or converting videos with accents you have a hard time understanding to one you can (no one you'll understand better than yourself)
- wongarsu 30 days ago
  Organized crime should be happy to invest in that. Especially the "indian scam callcenter" type of crime.
- unraveller 29 days ago
  New uses will be fashioned. Imagine altering the output accent of your assistant on the fly to signify more information like sarcasm, posh accent for assuredness etc no one else listening over the shoulder can detect the true private meaning without your permission. We needn't be stuck in confines of the public dialogue communication world.
- sunshine_reggae 30 days ago
  You forgot plausible deniability, AKA "I never said that".
thorum 30 days ago
OpenVoice currently ranks second-to-last in the Huggingface TTS arena leaderboard, well below alternatives like styletts2 and xtts2:
https://huggingface.co/spaces/TTS-AGI/TTS-Arena
(Click the leaderboard tab at the top to see rankings)
[-]
- KennyBlanken 30 days ago
  Having gone through almost ten rounds of the TTS Arena, XTT2 has tons of artifacts that instantly make it sound non-human. OpenVoice doesn't.
  It wouldn't surprise me if people recognize different algorithms and purposefully promote them over others, or alter the page source with a userscript to see the algorithm before listening and click the one they're trying to promote. Looking at the leaderboard, it's obvious there's manipulation going on, because Metavoice is highly ranked but generates absolutely terrible speech with extremely unnatural pauses.
  Elevenlabs was scarily natural sounding and high quality; the best of the ones I listened to so far. Pheme's speech overall sounds really natural, but has terrible sound quality, which is probably why it's ranked so well. If Pheme could be higher quality audio, it'd probably match Elevenlabs.
- carbocation 30 days ago
  I would like to see the new VoiceCraft model on that list eventually (weights released yesterday, discussion at [1]).
  1 = https://news.ycombinator.com/item?id=39865340
- m463 28 days ago
  I haven't tried openvoice, but I did try whisperspeech and it will do the same thing. You can optionally pass in a file with a reference voice, and the tts uses it.
  https://github.com/collabora/whisperspeech
  I found it to be kind of creepy hearing it in my own voice. I also tried a friend of mine who had a french canadian accent and strangely the output didn't have his accent.
- ckl1810 30 days ago
  Is there a benchmark for compute needed? Curious to see if anyone is building / has built a Zoom filter, or Mobile app, whereby I can speak English, and out comes Chinese to the listener.
- abdullahkhalids 29 days ago
  HG TTS arena is asking if the text-to-speech sounds human like. That's somewhat different from voice cloning. A model might produce audio which is less human like, but still sound closer to the target voice.
- Jackson__ 30 days ago
  As someone who has used the arena maybe ~3 times, the subpar voice quality in the demo linked immediately stood out to me.
- c0brac0bra 30 days ago
  I'd like to see Deepgram Aura on here.
muglug 30 days ago
It’s funny how a bunch of models use Musk’s voice as a proof of their quality, given how disjointed and staccato he sounds in real life. Surely there are better voices to imitate.
[-]
- iinnPP 30 days ago
  Proving the handling of uncommon speech is definitely a great example to use alongside the other common and uncommon speech examples on the page.
- m463 27 days ago
  I would imagine folks with really great voices like Morgan Freeman¹ or Don LaFontaine² are already voice actors, and using their voice might be seen as soul stealing (or competing with their career)
  1: https://en.wikipedia.org/wiki/File:Morgan_freeman_bbc_radio4...
  2 https://youtu.be/USrkW_5QPa0
- ianschmitz 30 days ago
  Especially with all of the crypto scams using Elon’s voice
futureshock 30 days ago
In related news, Voicecraft published their model weights today.
https://github.com/jasonppy/VoiceCraft
smusamashah 30 days ago
The quality here is good (very good if I can actually run it locally). As per github it looks like we can run it locally.
https://github.com/myshell-ai/OpenVoice/blob/main/docs/USAGE...
[-]
- 486sx33 30 days ago
  Still a bit robotic but better highs and lows for sure. The Catalog is huge! Thanks for posting
andrewstuart 30 days ago
If someone can come up with a voice clinging product that I can run on my own computer not the cloud, and if it’s super simple to install and use, then I’ll pay.
I find it hard to understand why so much money is going into ai and so many startups are building ai stuff and such a product does not exist.
It’s got to run locally because I’m not interested in the restrictions that cloud voice cloning services impose.
Complete, consumer level local voice cloning = payment.
[-]
- dsign 30 days ago
  I couldn't agree more.
  I've tried some of this ".ai" websites that do voice-cloning, and they tend to use the following dark strategy:
  - Demand you create a cloud account before trying.
  - Sometimes, demand you put your credit card before trying.
  - Always: the product is crap. Sometimes it does voice-cloning sort of as advertised, but you have to wait for the training and the execution in queue, because cloud GPUs are expensive and they need to manage a queue because it's a cloud prouduct. At least that part could be avoided if they shipped a VST plugin one could run locally, even if it's restricted to NVidia GPUs[^2].
  [^1]: To those who say "but the devs must get paid": yes. But subscriptions miss-align incentives, and some updates are simply not worth the minutes they cause in productivity lost while waiting for their shoehorned installation.
  [^2]: Musicians and creative types are used to spend a lot in hardware and software, and there are inference GPUs which are cheaper than some sample libraries.
  [-]
  - andrewstuart 30 days ago
    I don’t mind if the software is a subscription it just has to be installable and not spyware garbage.
    Professional consumer level software like a game or productivity app or something.
  - andoando 30 days ago
    I made a voice cloning site. https://voiceshift.ai No login, nothing required. Its a bit limited but I can add any of the RVC models. Working on a feature to just upload your own model.
    I can definitely make it a local app.
  - riwsky 30 days ago
    How do you figure subscriptions misalign incentives? The alternative, of selling upgrades, incentivizes devs to focus on new shiny shit that teases well. I instead rather they focus on making something I get value out of consistently.
    [-]
    - dsign 30 days ago
      - A one-off payment makes life infinitely simpler for accounting purposes. In my jurisdiction, a software license owned by the business is an asset and shows as that in the balance sheet, and can be subject to a depreciation schedule just as any other asset.
      - Mental peace: if product X does what I need right now and I can count that I will be able to use product X five years from now to do the same thing, then I'm happy to pay a lump sum that I see as an investment. Even better, I feel confident that I can integrate product X in my workflows. I don't get that with a subscription product on the hands of a startup seeking product-market fit.
- jeroenhd 30 days ago
  RVC does live voice changing with a little latency: https://github.com/RVC-Project/Retrieval-based-Voice-Convers...
  The product isn't exactly spectacular, but most of the works seems to have bene done. Just needs someone to go over the UI and make it less unstable, really.
- rifur13 30 days ago
  Wow perfect timing. I'm working on a sub-realtime TTS (only on Apple M-series silicon). Quality should be on-par or better than XTTS2. Definitely shoot me a message if you're interested.
- smusamashah 30 days ago
  Buy this one is supposed to be runnable locally. It has complete instructions on Github including downloading models locally and installing python setting it up and running it.
  [-]
  - andrewstuart 30 days ago
    I'm wanting to download an installer and run it - consumer level software.
- endisneigh 30 days ago
  I see these types of comments all the time, but fact is folks at large who wouldn’t use the cloud version won’t pay. The kind of person who has a 4090 to run these sort of models would just figure out how to do it themselves.
  The other issue is that paying for the software once doesn’t capture as much of the value as a pay per use model, thus if you wanted to sell the software you’d either have to say you can only use it for personal use, or make it incredibly expensive to account for the fact that a competitor would just use it.
  Suppose there were such a thing - then folks may complain that it’s not open source. Then it’s open sourced, but then there’s no need to pay.
  In any case, if you’re willing to pay $1000 I’m sure many of us can whip something up for you. Single executable.
  [-]
  - andoando 30 days ago
    I have a 2070 and it works just fine, as long as you're not doing real time conversion. You can try it on https://voiceshift.ai if youre curious.
- washadjeffmad 30 days ago
  I mean this at large, but I just can't get over this "sell me a product" mentality.
  You already don't need to pay; all of this is happening publication to implementation, open and local. Hop on Discord and ask a friendly neon-haired teen to set up TorToiSe or xTTS with cloning for you.
  Software developers and startups didn't create AGI, a whole lot of scientists did. A majority of the services you're seeing are just repackaging and serving foundational work using tools already available to everyone.
  [-]
  - TuringTest 30 days ago
    I agree, buy playing devil's advocate, it's true that people without the time and expertise to setup their own install can find this packaging valuable enough to pay for it.
    It would be better for all if, in Open Source fashion, this software had a FLOSS easy-to-install packaging that provided for basic use cases, and developers made money by adapting it to more specific use cases and toolchains.
    (This one is not FLOSS in the classic sense, of course. The above would be valid for MIT-licensed or GPL models).
  - lancesells 30 days ago
    The answer is convenience. Why use dropbox when you can run Nextcloud? You can say the same thing about large companies. Why does Apple use Slack (or whatever they use) when they could build their own? Why doesn't Stripe build their own data centers?
    If I had a need for an AI voice for a project I would pay the $9 a month, use it, and be done. I might have the skills to set this up on my machine but it would take me hours to get up to speed and get it going. It just wouldn't be worth it.
  - nprateem 30 days ago
    You can extend that reasoning to anything, but time and energy are limited
- ipsum2 30 days ago
  How much would you pay? I can make it.
  [-]
  - andrewstuart 30 days ago
    You can’t sell this cause the license doesn’t allow it.
    [-]
    - pmontra 30 days ago
      "This repository is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which prohibits commercial usage"
      People could pay somebody for the service of setting up the model on their own hardware, then use the model for non commercial usage.
      [-]
      - GTP 30 days ago
        IANAL, but this looks like a grey area to me: it could be argued that the person/company getting paid to do the setup is using the model commercially.
    - GTP 30 days ago
      Doesn't allow it yet, but on the readme, they write "This will be changed to a license that allows Free Commercial usage in the near future". So someone will soon be able to sell it to you.
    - ipsum2 30 days ago
      Not using this model, but something similar. How much would you pay?
      [-]
      - ipsum2 30 days ago
        Based on the lack of replies, the answer appears to be $0.
    - ddtaylor 30 days ago
      Bark is MIT licensed for commercial use.
- palmfacehn 30 days ago
  XTTS2 works well locally. Maybe someone else here can recommend a front end.
  [-]
- ddtaylor 30 days ago
  I can show you how to use Bark AI to do voice cloning.
  [-]
  - rexreed 30 days ago
    What local hardware is needed to run Bark AI? What is the quality? Looking for something as good or better than Eleven Labs.
    [-]
    - ddtaylor 30 days ago
      It can run on CPU without much issue and takes up a few gigs of RAM and will produce about in realtime. If you GPU accelerate you only need about 8GB of video memory and it will be at least 5X faster.
      Out of the box it's not as good as Eleven Labs based on their demos, but those are likely cherry picked. There are some tunable parameters for the Bark model and most consider the output high enough quality to pass into something else that can do denoising.
  - mdrzn 30 days ago
    Please do!
Havoc 30 days ago
Tried it locally - can't get anywhere near the clone quality of the clips on their site.
Not even close. Perhaps I'm doing something wrong...
joshspankit 30 days ago
Does anyone know which local models are doing the “opposite”: Identify a voice well enough to do speaker diarization across multiple recordings?
[-]
- Teleoflexuous 30 days ago
  Whisper doesn't, but WhisperX <https://github.com/m-bain/whisperX/> does. I am using it right now and it's perfectly serviceable.
  For reference, I'm transcribing research-related podcasts, meaning speech doesn't overlap a lot, which would be a problem for WhisperX from what I understand. There's also a lot of accents, which are straining on Whisper (though it's also doing well), but surely help WhisperX. It did have issues with figuring out the number of speakers on it's own, but that wasn't a problem for my use case.
  [-]
  - joshspankit 30 days ago
    WhisperX does diarization, but I don’t see any mention of it fulfilling my ask which makes me think I didn’t communicate it well.
    Here’s an example for clarity:
    1. AI is trained on the voice of a podcast host. As a side effect it now (presumably) has all the information it needs to replicate the voice
    2. All the past podcasts can be processed with the AI comparing the detected voice against the known voice which leads to highly-accurate labelling of that person
    3. Probably a nice side bonus: if two people with different registers are speaking over each other the AI could separate them out. “That’s clearly person A and the other one is clearly person C”
    [-]
    - c0brac0bra 30 days ago
      You can check out PicoVoice Eagle (paid product): https://picovoice.ai/docs/eagle/
      You pass N number of PCM frames through their trainer and once you reach a certain percentage you can extract an embedding you can save.
      Then you can identify audio against the set of identified speakers and it will return percentage matches for each.
- Drakim 30 days ago
  On my wishlist would be a local model that can generate new voices based on descriptions such as "rough detective-like hard boiled man" or "old fatherly grampa"
  [-]
  - mattferderer 30 days ago
    You might be interested in this cool app that Microsoft made that I don't think I've seen anyone talk about anywhere called Speech Studio. https://speech.microsoft.com/
    I don't recall their voices being the most descriptive but they had a lot. They also let layout a bunch of text & have different voices speak each line just like a movie script.
- satvikpendem 30 days ago
  Whisper can do diarization but not sure it will "remember" the voices well enough. You might simply have to stitch all the recordings together, run it through Whisper to get the diarized transcript, then process that how you want.
  [-]
  - beardedwizard 30 days ago
    Whisper does not support diarization. There are a number of projects that try to add it.
- c0brac0bra 30 days ago
  Picovoice says they do this but it's a paid product. It supposedly runs on the device but you still need a key and have to pay per minute.
Fripplebubby 30 days ago
Really interesting! Reading the paper, it sounds like the core of it is broken into two things:
1. Encoding speech sounds into an IPA-like representation, decoding IPA-like into target language
2. Extracting "tone color", removing it from the IPA-like representation, then adding it back in into the target layer (emotion, accent, rhythm, pauses, intonation)
So as a result, I am a native English speaker, but I could hear "my" voice speaking Chinese with similar tone color to my own! I wonder, if I recorded it, and then did learn to speak Chinese fluently, how similar it would be? I also wonder whether there is some kind of "tone color translator" that is needed to translate the tone color markers of American English into the relevant ones for other languages, how does that work? Or is that already learned as part of the model?
burcs 30 days ago
There's a "vocal fry" aspect to all of these voice cloning tools, a sort of uncanny valley where they can't match tones correctly, or get fully away from this subtle Microsoft Sam-esque breathiness to their voice. I don't know how else to describe it.
[-]
- blackqueeriroh 30 days ago
  Yeah, this is why I’m nowhere near worried about this replacing voice actors for the vast majority of work they currently get paid for.
akashkahlon 30 days ago
So every novel is a movie soon by the author itself using Sora and with Audio buys from all the suitable actors
[-]
- Multicomp 30 days ago
  I hope so, then those of us who want to tell a story (writers, whether comic or novellist or short story or screenplay or teleplay or whatever) will be able to compete more and more on quality and richness of the story copy and content to the audience, not with the current comparative advantage of media choices being made for most storytellers based on difficulty to render.
  Words on page are easier than still photos, which are easier than animation, which are easier than live-action TV, which are easier than IMAX movies etc.
  If we move all of the rendering of the media into automation, then its just who can come up with the best story content, and you can render it whatever way you like: book, audiobook, animation, live action TV, web series, movie, miniseries, whatever you like.
  Granted - the AI will come for us writers to, it already is in some cases. Then the Creator Economy itself will be consumed with eventually becoming 'who can meme the fastest' on an industrial scale for daily events on the one end, and who has taken the time to paint / playact / do rendering out in the real world.
  But I sure would love to be able to make a movie out of my unpublished novel, and realistically today, that's impossible in my lifetime. Do I want the entire movie-making industry to die so I and others like me can have that power? No. But if the industry is going to die / change drastically anyways due to forces beyond my control, does that mean I'm not going to take advantage of the ability? Still no.
  IDK. I don't have all the answers to this.
  But yes, this (amazingly accurate voice cloner after a tiny clip?! wow) product is another step towards that brave new world.
- rcarmo 30 days ago
  This can’t really do a convincing Sean Connery yet.
- _zoltan_ 30 days ago
  just by more NVDA. :-)
lordofgibbons 30 days ago
I've noticed that all TTS systems have a "metalic" sound to them. Can this be fixed automatically using some kine of post-processing?
[-]
- huytersd 30 days ago
  Try cutting out some of the highs?
duggan 30 days ago
With a bit of coaxing I managed to get this running on my M2 mac with Python 3.11.
Updated setup.py (mostly just bumping versions), and demo output: https://gist.github.com/duggan/63b7de9b5f6e8e74fe4b05af64dbe...
[-]
- speedylight 30 days ago
  I barely managed to get one good output(by following the installation for Linux instructions) before the 1st Jupyter Demo crapped out on me. I didn’t have the patience to investigate the error, but I am very impressed with the quality considering I’m also running it on an M2 laptop.
chenxi9649 30 days ago
I am the most impressed by the cross-lingual voice cloning...
https://research.myshell.ai/open-voice/zero-shot-cross-lingu... I can only speak on their Dutch -> Chinese voice cloning but it's better than anything else I've tried. There is basically no "english/dutch accent" in the chinese at all. Where as the ElevenLabs Chinese voice(cloning or not) is so much worse...
ckl1810 30 days ago
OpenAI vollies back:
https://twitter.com/OpenAI/status/1773760852153299024
pantsforbirds 30 days ago
I wonder if in < 5 years I can make a game with a local LLM + AI TTS to create realistic NPCs. With enough of these tools I think you could make a very cool world-sim type game.
[-]
- rcarmo 30 days ago
  I’m much more interested in the dismal possibility of using this in politics. Nation state actors, too.
  [-]
  - fennecfoxy 26 days ago
    Will we ever fix the problem of humans in a democracy voting for their team/flag rather than policies? Or humans seeing misinformation and not questioning it in the slightest?
    Those are the real issues, it's a people problem really.
jasonjmcghee 30 days ago
The quality of the output is really fantastic compared with other open source (next best XTTSv2).
The voice cloning doesn't seem as high quality as other products I've used / seen demos for. Most of the examples match pitch well, but lose the "recognizable" aspect. The Elon one just doesn't sound like Elon, for example- interestingly the Australian accent sounds more like him.
yogorenapan 30 days ago
Note: the open source version is watered down compared to their commercial offering. Tried both out and the quality doesn’t come close to
mattizzle81 26 days ago
Hmm for being so resource intensive that demo doesn't sound that great to me, sounds robotic. Piper will run on an iPhone and sounds about the same.
bsenftner 30 days ago
Is there a service here somewhere? The video mentions lower expense, but I can't find any such service sign up... (ah, all the usage info is all on github)
Has anyone tried self hosting this software?
trollied 30 days ago
Look up iPhone “personal voice”. People don’t seem to know about it.
speedbird 30 days ago
Not convinced. The second reference has a slight Indian accent that isn’t carried over into the generated samples.
Training data bias?
[-]
- opdahl 30 days ago
  What are you talking about? I am not noticing it at all.
lacoolj 30 days ago
That season of 24 is coming true
treprinum 30 days ago
Did this just obliterate ElevenLabs?
[-]
- htrp 30 days ago
  Eleven's advantage is being able to have consistent outputs through high quality training data.
starwin1159 30 days ago
Cantonese can't be imitated
[-]
- Zambyte 30 days ago
  How do people learn it?
- paulryanrogers 30 days ago
  Why?
paraschopra 30 days ago
yay!
nonrandomstring 30 days ago
I also lost my voice in a bizarre fire breathing accident and urgently need to log into my telephone banking account.
Can anyone here give me a short list of relatively common ethical use cases for this technology. I'm trying to refute the ridiculous accusation that only deceptive, criminal minded people would have any use for this. Thanks.
----
Edit: Thanks for the many good-faith replies. So far I see these breaking down into;
Actual disability mitigation (of course I was joking about the circus accident). These are rare but valid. Who wouldn't want their _own_ restored.
Entertainment and games
Education and translation
No further comment on the ethics here, but FWIW I'm nervously looking at this having just written a study module on social engineering attacks, stalking, harassment and confidence tricks. :/
And yes, as bare tech, it is very cool!
[-]
- pmontra 30 days ago
  I want to use (almost) my own voice in an English video without my country's accent?
- CapsAdmin 30 days ago
  The practical main use case I can think of is entertainment. Games could use it, either dynamically or prerecorded. Amateur videos could also use it for fun.
  Outside of that, more versatile text to speech is generally useful for blind people.
  More emotional and non-robotic narration of text can also be useful non-blind people on the go.
  [-]
  - andrewmcwatters 30 days ago
    It would be neat to have your game client locally store a reference sentence on your system and generate voice chat for you at times when you couldn’t speak and could only type.
- freedomben 30 days ago
  The reason I am looking for something, is because a friend of mine died of cancer, but left some voice samples, and I want to narrate an audiobook for his kids in his voice.
  In general, though, I agree, the legitimate use cases for something like this seem relatively minor compared to the illegitimate use cases. However, the technology is here, and simply depriving every one of it isn't going to stop the scammers, as has already been evidenced. In my opinion, the best thing for us to do is to rapidly get to a place where everybody knows that you cannot trust the voice on the other end anymore, as it could be cloned. Fortunately, the best way to accomplish that is also the same way that we allow average people to benefit from the technology: make it widely available
  [-]
  - nonrandomstring 30 days ago
    > In my opinion, the best thing for us to do is to rapidly get to a place where everybody knows that you cannot trust the voice on the other end anymore,
    Strongly agree with this. Sadly I don't think that transition to default distrust of voice will be rapid. We are wired at quite a low level to respond to voice emotionally. which bypasses our rational scepticism and vigilance. That's why this is a rather big win for the tricksters.
- Larrikin 30 days ago
  Home Assistant is making huge progress in creating an open source version of Alexa, Siri, etc. You can train it to use your voice, but the obvious use is celebrity voices for your home. Alexa had them, then took them away, and refused to refund people.
  [-]
  - diggan 30 days ago
    > but the obvious use is celebrity voices for your home
    Beside the fact that it seems more like a "entertainment" use case rather than "functional", is it really ethical to use someone's voice without asking/having rights to use it?
    Small concern, granted, but parent seems to have specifically asked for ethical use cases.
    [-]
    - ChrisMarshallNY 30 days ago
      I believe that a number of celebrities (I think Tom Hanks is one) have already sued companies for using deepfakes of their voices. Of course, the next year (in the US) is gonna see a lot of stuff generated by AI.
- laurentlb 30 days ago
  I think there are lots of applications for good Text-To-Speech.
  Cloning a voice is a way to get lots of new voices to use as TTS.
  I'm personally building a website with stories designed for language learners. I'd like to have a variety of realistic voices in many languages.
- RobotToaster 30 days ago
  Could be combined with translation to automatically create dubs for videos/tv/etc.
- corobo 30 days ago
  I imagine Stephen Hawking would have found a use for this had it been available before everyone got used to his computer speaking voice. Anything that may cause someone to lose their ability to speak along the lines of your example.
  Another might be for placeholdering - you could use an array of (licensed and used appropriately) voices to pitch a TV show, film, radio show, podcast, etc to give people a decent idea of how it would sound to get financing and hire actual people to make the real version. Ofc you'll need an answer to "why don't we just use these AI voices in the actual production?" from people trying to save a few quid.
  Simple one- for fun. I'm considering AI cloning my voice and tinkering around until I find something useful to do with it. Maybe in my will I'll open source my vocal likeness as long as it's only to be used commercially as the voice of a spaceship's main computer or something. I'll be a sentence or two factoid in some Wikipedia article 300 years from now, haha.
  Universal translator - if an AI can replicate my voice it could have me speak all sorts of languages in real-time.. sucks to be a human translator admittedly in this use case. Once the tech is fully ironed out and reliable we could potentially even get rid of "official" languages (eg you have to speak fluent English to be an airline pilot - heck of a learning curve on top of learning to be a pilot if you're from a country that doesn't teach English by default!)
  I dunno if it'd be a weird uncanny valley thing, I wonder how an audiobook would sound reading a book in my own voice - unless I'm fully immersed in fiction that's generally how I take in a book, subvocalising with my voice in my head - maybe it'd help things bed in a bit better if it's my own voice reading it to me? If so I wonder how fast I could have AI-me read and still be able to take in the content with decent recall.. Might have to test this one!
  Splintering off the audiobook idea - I wonder if you could help people untrain issues with their speaking in this manner? Like would hearing a non-stuttering version of their voice help someone with a stutter? I am purely in the land of hypothesis at this stage, but might be worth trying! Even if it doesn't help in that way, the person with a stutter would at least have a fallback voice if they're having a bad day of it :)
  E: ooh, having an AI voice and pitch shifting it may help in training your voice to sound different, as you'd have something to aim for - "I knew I could do it because I heard it being done" sort of theory. The first example that popped into my head was someone transitioning between genders and wanting to adjust their voice to match the change.
  I imagine there's other fields where this may be useful too - like if you wanted a BBC news job and need to soften out your accent (if they still require Received Pronunciation, idk)
  Admittedly I could probably come up with more abuse cases than use cases if I put my mind to it, but wanted to stick to the assignment :)
  [-]
  - mywacaday 30 days ago
    Charlie Bird, a very well know Irish journalist and broadcaster who recently passed away from motor neuron disease went through the process of getting a digitized version of his voice done as part of a TV program as he was losing his voice rapidly at the time. The result was very good as they had a large body of his news reports to train the model on. Most Irish people would be very familiar with his voice and the digitized version was very convincing. I would imagine something like this would be great for people who wouldn't have a huge volume of recordings to work off. A short video by the company that provided the tablet with his voice is here https://www.youtube.com/watch?v=UGjJHVUyi0M
- RyanCavanaugh 30 days ago
  I'd really like to make some video content (on-screen graphics + voice), but the thought of doing dozens of voice takes and learning to use editing software is really putting me off from it. I'd really rather just write a transcript, polish it until I'm satisfied with it, and then have the computer make the audio for me.
  I'll probably end up just using OpenAI TTS since it's good enough, but if it could be my actual voice, I'd prefer that.
- idle_zealot 30 days ago
  It's mostly interesting to me for artistic applications, like voicing NPC or video dialog, or maybe as a digital assistant voice. Being able to clone existing voices would be useful for parody or fanworks, but I suspect that it is also possible to mix aspects of multiple voices to synthesize new ones to taste.
- raudette 30 days ago
  For creating games/entertainment/radio drama, allows 1 person to voice act multiple roles
- _agt 30 days ago
  At my university, we’re using this tech to insert minor corrections into lecture recordings (with instructor’s consent of course). Far more efficient than bringing them into a studio for a handful of words, also less disruptive to content than overlaid text.
- napkin 30 days ago
  I’m currently using xtts2 to make language learning more exciting, by training models on speakers I wish to emulate. I’m really into voices, and this has helped tremendously for motivation when learning German.
- serbrech 30 days ago
  On the fly speech translation but in the voice of the speaker
  [-]
  - 7373737373 30 days ago
    Or a different voice if the voice of the speaker or the way they talk is annoying
- BriggyDwiggs42 30 days ago
  Idk but its kinda cool
smashah 30 days ago
Terrifying.
[-]
- riskable 30 days ago
  I know, right‽ Soon everything is going to be AI-enabled and our toothbrushes will be singing us Happy Birthday!
  [-]
  - smashah 29 days ago
    Both of those things already exist