Base 65536

(github.com)

162 points | by leoh 1308 days ago

21 comments

  • tomxor 1308 days ago
    Similarly we do a lot of unicode bit packing on Dwitter.net since the code size limit is in characters (140) not bytes.

    There are a few compressors allowing an increase of arbitrary ASCII chars, with the most allowing 194 chars including decoder, the compressors are themselves dweets! : https://www.dwitter.net/d/11852 This is not the most efficient way to store bits in unicode but there is a balance with dwitter because you need to fit the decoder in the same 140 characters.

    There are also some more specialist methods such as packing image or other binary data and decode using string.charCodeAt() e.g https://www.dwitter.net/d/12808 this is not limited to exact bit word sizes, for instance I packed this 3 color image efficiently by encoding it numerically in base3, ~1.6 bits per pixel, the decoding just requires division by power's of 3 and mod3 (generalization of bitwise shifting and masking): https://www.dwitter.net/d/19166

    [edit] I should explain, that last example combines the general code compressor with the image bit packing technique, the image it encoded as 5 pixels (3 color) per 8bit characters but doesn't use the higher bits of unicode to allow the whole code to be encoded using the above compressor which fits two chars up to 8bit into UTF-16 and wraps it in the decoder.

    You can see the source by replacing "eval" with "throw".

    • srtjstjsj 1307 days ago
      How about adding standard decompressers to the Dwitter JS environment so they can be executed with less code?
      • tomxor 1307 days ago
        Do you mean to add the decompressor wrapper automatically so that more UTF-16 packed code can fit?

        That would effectively increase the limit to 280 usable ASCII chars, i.e a double dweet. This has been discussed before but I don't think the authors will implement it.

        There are a few other JS golfing sites that have multiple size brackets and I believe they are less active as a result. Although dwitter is not exactly popular I think this clear limitation is what keeps it focused and the community actively engaged. The 194B hack is a slight weakening of that, but it's often reflected by votes as people are less impressed when you've resorted to using that trick unless it was really worth it... it also feels like a fair compromise because you cannot for instance use emojis inside that compression method so it's not entirely free.

  • bazzargh 1308 days ago
    For those looking for motivation - you can see examples of the related base2048 encoding used in tweets to the bbc micro bot twitter account: https://www.dompajak.com/bbcmicrobot.html ...encoding and decoding (to learn from other's coding tricks) is easy with the editor: https://editor.8bitkick.cc/?compress=true

    without these tools I'd never have managed to fit this in a tweet: https://twitter.com/bbcmicrobot/status/1248312174054948864

  • ithkuil 1308 days ago
    Initially designed to pack more data in a tweet. Thanks to this article I learned about:

    https://blog.twitter.com/engineering/en_us/topics/insights/2...

    • laurent92 1308 days ago
      So they have differing tweet length according to the UTF8 pane it’s using. But Germans are still cheated compared to English, which is a quite concise language. (of course, the result focuses on Japanese because of the huge difference to latin languages)
  • jwilk 1308 days ago
  • userbinator 1308 days ago
    Base65536 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..

    The string encodes two bytes per code point.

    Is it really 64k then? It is hard to understand how this works without documentation on the format.

    I wonder how many others were able to mentally decode the example input string just on sight, from having worked with ASCII extensively.

    • Sniffnoy 1308 days ago
      It encodes two bytes per code point; the encoding of that code point may be longer than two bytes. (Indeed it mentions that it's optimized for UTF-32.) It accomplishes having the full 2^16, while only using safe characters, by not restricting to the BMP.

      If you read the details of how it was constructed, you can see that the reason the 2^17 version isn't ready yet is because there aren't yet 2^17 safe characters in Unicode! (There are a total of 123,813 characters in Unicode 13 that are judged "safe" by qntm's criteria.)

    • Dylan16807 1308 days ago
      A code point is just a number. (Well, a number below 1.1 million.)

      The simplest way to turn two bytes into a code point is to output the same number.

      But some code points are invalid, so you have to skip them. The simplest way is to split your 64k into two ranges. For example numbers 0-50k can get code points 1k-51k, and numbers 50k-64k can get code points 70k-84k.

      Base65536 wants to avoid code points that are likely to misbehave, so it uses about ten different ranges of code points, carefully selected to avoid problems (and to look visually cohesive).

    • masklinn 1308 days ago
      > Is it really 64k then?

      There are significantly more than 64k possible codepoints (the total codepoint space has about 1.1 million possible values) so why not? In baseX the "X" is the number of symbols in the encoded output e.g. base32 uses 32 separate output symbols and base64 uses 64.

      The problem was finding 64k codepoints which are mapped, tweet-safe (e.g. not control characters), and which twitter only counts for 1.

      > It is hard to understand how this works without documentation on the format.

      The concept is explained in a separate document: https://qntm.org/safe

    • amptorn 1308 days ago
      It's actually 65,792 code points total: 65,536 plus 256 for "padding" characters if the input has an odd number of octets.
  • octoberfranklin 1308 days ago
    https://github.com/qntm/base65536#why

    > Why?

    > Erm.

    > I wanted people to be able to share HATETRIS replays via Twitter.

    My brain is melting.

    • yebyen 1307 days ago
      So I went ahead and allowed myself to ask, "why is it called HATETRIS" and tried it out without reading anything first.

      This has been a wonderful rollercoaster, thank you for sharing and pointing this out :D I only played the game for 30 seconds before I could already guess exactly why it's called that!

      • lisper 1307 days ago
        HATETRIS is actually a surprisingly engaging challenge, particularly since it has no time constraints. After a few minutes I found I enjoyed it at least as much if not more than regular Tetris. Figuring out how to make all the pieces appear makes quite the juicy little puzzle, as does figuring out how to maximize the number of cleared lines.
        • yebyen 1307 days ago
          I was not able to clear more than 2 lines per game. I would like to become good at this game. I'm not sure it's possible. But the prospect of a community that shares hatetris replays is intriguing...
    • codetrotter 1307 days ago
      ИටฎਕذඥʣݹஐఢƔݹமقݩܥआƣະࡆ౫ටࠃݹԫइາকԥටݧऍ௨ටຜݹߝƐໃԓஸҨȣݓ௨ʈȣџƝۇȥɞୡइไऎਉضƖݹ௨پЈஆjതਈࡨ௬ࡃࠆٽߤಈϼஅԬݑȣ1

      Wonder if HN comments will include all of the characters in this replay or if any of them are removed. Since HN strips most emojis for example, I mean.

      Edit: Yup, all of the characters in this replay were included.

  • Slartie 1308 days ago
    I love this! Idea, execution, everything.

    This reminds me of the times when I encoded a binary database of several megabytes, with indices and everything, in the 53 integer bits of huge 64-bit floating point number arrays, because that was the only way to use that data in a Lua runtime environment of a computer games' UI that didn't expose direct file system access to UI plugins.

  • toastal 1308 days ago
    I want to marvel at the characters and languages our species has created and we've also found a way to encode and display on computers.
  • kps 1307 days ago
    > maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.

    Oh, that's why some carefully composed tweets don't fit — things like General Punctuation and Currency Symbols count double. I think I'm up to the Gillette5 version of Hanlon's Razor where Twitter is concerned.

  • lifthrasiir 1308 days ago
    Superseded by Base2048 [1] by now.

    [1] https://github.com/qntm/base2048

    • thih9 1308 days ago
      Assuming the goal is to transmit data via Twitter.

      Other environments may treat characters after U+1100 with no penalties.

      See: “Rationale” section at: https://github.com/qntm/base2048#rationale

      > following some experimentation and examination of the new web client code, we now understand that maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.

      • lifthrasiir 1308 days ago
        You are right that Base2048 is tailor-made for Twitter, but I think so is Base65536.

        More specifically, the explicit length limitation is easily one of the weirdest aspects of Twitter. It makes Twitter experience somewhat like that of IRC, faciliating knee-jerk reactions and misunderstandings. And the limit is severe even for casual conversation to the point that it had to be changed once emojis broke its core assumptions. I don't think Base65536 has much use outside of Twitter and likes.

        • Dylan16807 1308 days ago
          Base65536 is useful anywhere people are pasting codes to each other with sizes around 20-1000 bytes. Taking up half as much screen space as base64 is a big feature.
  • ignoramous 1307 days ago
    I needed something similar for a javascript project and stumbled upon this excellent library by pieroxy that worked flawlessly: https://pieroxy.net/blog/pages/lz-string/index.html

    base65536 looks pretty similar to that except that base65536 is not as much focused on compression than it is on "copy-paste-ability".

    Shoco is another excellent library that comes to mind for shorter strings: https://ed-von-schleck.github.io/shoco/

  • outsomnia 1308 days ago
    Mightn't the raw data be a good candidate for huffman encoding

    https://en.wikipedia.org/wiki/Huffman_coding

    • lifthrasiir 1308 days ago
      Base65536 is not a compression algorithm but rather an encoding scheme that can convey any data compressed or not. Compression algorithms have an implicit input of the list of output symbols, which is commonly but not necessarily an octet (0 through 255). If your output format is the tweet you will eventually end up with something like Base65536 for output symbols.
      • GuB-42 1308 days ago
        The article mentions compression briefly. They are using basic run-length encoding.

        There are probably better algorithms. GP suggested Huffman but I don't think it will work. Something like the algorithm used by crinkler [1] (context modeling + arithmetic coding) may work better. It can be made even better if combined with the encoding (the base doesn't have to be a power of two).

        Anyways, no need to overdo it, the author's technique works. At least for the newer base2048 version, optimized for long tweets

        [1] https://xoofx.com/blog/2010/12/28/crinkler-secrets-4k-intro-...

    • ziml77 1308 days ago
      If they went out of their way to make base65536, I'd be surprised if they didn't evaluate at least a few compression schemes. It's mentioned in the readme that they are using RLE so it seems reasonable to assume that their data tends to have long runs that compress well with RLE.
    • Dylan16807 1308 days ago
      The HATETRIS moves? You could try but because Huffman coding works at a bit level and there are only 4 moves it's going to have absolutely awful efficiency. You'd want arithmetic encoding at least.
      • recursive 1307 days ago
        Left and right seem like they'd be a lot more common than whatever the other two moves are, so maybe a 1-bit code code be assigned to one of those, and still do better than RLE.
    • redrobein 1308 days ago
      How would this be considered more efficient if you need an additional character to mark each character?
  • iflp 1308 days ago
    Make sure to check out the game. Best tetris ever.

    https://qntm.org/files/hatetris/hatetris.html

  • frou_dh 1308 days ago
    On the other end of the scale, we have base-1, e.g. scratching a row of 1's on a surface as a tally.

    I guess base-0 can't really exist, since 0⁰ is undefined.

  • samcheng 1308 days ago
    Javascript really has come a long way.

    I can remember when hacks like this were in CPAN, and then in pip. Now npm?!

  • praptak 1308 days ago
    "To encode one additional bit per code point, we need to double the number of code points we use from 65,536 to 131,072."

    I'd argue that going into fractional bits per code point would not make the project significantly geekier than it already is :-)

  • Zamicol 1307 days ago
    Related: An "arbitrary base converter" that treats all strings as numbers: https://convert.zamicol.com/
  • p2t2p 1308 days ago
    Reminds me of Jira. Jira (at least server and data centre editions) uses base 64 numbers to impose global ordering onto issues.
  • captn3m0 1308 days ago
    Has anyone tried to tweet compressed data in any of these encodings.

    Wondering if there any famous pieces of text that would fit.

  • spiritplumber 1308 days ago
    I like Base1. Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
  • intricatedetail 1308 days ago
    Using a platform to do something it was not designed for could constitute hacking and therefore getting you into a legal trouble.