Java String Compression – How to Compress a String to a Fixed Length in Java

algorithmcompressionjavastring

I need to achieve string compression to a certain length.

The input is a string consisting of 3 to 16 characters, which can include the English alphabet in lowercase, numbers (0-9), and underlines (or regex: ^[a-z0-9_]{3,16}$).

The English alphabet in any case, numbers, underlines, brackets, and other various utf8 characters can be used to compress the original string. The compressed string must support the reverse process: from the compressed string to the original string.

In my case, the specific length of the compressed string is 6. However, I do not know if it is possible to compress a 16-character string into a "6-character code". If this is not possible, then I suggest increasing it to 7-8 characters.

Example:
input: bestnickname
output: D_1(oP

I've been trying to find various algorithms to solve this problem, but I haven't found one.

Best Answer

If, instead of UTF-8 (as specified in the question), the OP really wants to encode into UTF-16, which is how characters are encoded in Java (since the type char is referenced in the comments and the question is tagged "Java"), then indeed the specified 3 to 16 symbols where each symbol takes one of 37 values can be encoded into six 16-bit values constrained to valid UTF-16, with no surrogate values required, i.e. the old UCS-2. In fact, only 14 bits need to be encoded into each UTF-16 character, so a subset of characters can be chosen to be more readable.

To implement the encoder, first choose 16,384 Unicode characters within the valid code points 0x0000 to 0xffff (which excludes 0xd800 to 0xdfff) to permit in the encoded form. Make a map of 0..16383 to your chosen Unicode characters. Then assign an integer to each 3 to 16 37-value sequence. The calculation can be done using the BigInteger class. This result is an 84-bit integer. Break the 84-bit integer into six 14-bit pieces (fits perfectly!), and write the corresponding code point for each of the 14-bit pieces from your map.