Class BOCSU

java.lang.Object
org.graalvm.shadowed.com.ibm.icu.impl.coll.BOCSU

public class BOCSU extends Object

Binary Ordered Compression Scheme for Unicode

Users are strongly encouraged to read the ICU paper on BOCU before attempting to use this class.

BOCU is used to compress unicode text into a stream of unsigned bytes. For many kinds of text the compression compares favorably to UTF-8, and for some kinds of text (such as CJK) it does better. The resulting bytes will compare in the same order as the original code points. The byte stream does not contain the values 0, 1, or 2.

One example of a use of BOCU is in org.graalvm.shadowed.com.ibm.icu.text.Collator#getCollationKey(String) for a RuleBasedCollator object with collation strength IDENTICAL. The result CollationKey will consist of the collation order of the source string followed by the BOCU result of the source string.

Unlike a UTF encoding, BOCU-compressed text is not suitable for random access.

Method: Slope Detection
Remember the previous code point (initial 0). For each code point in the string, encode the difference with the previous one. Similar to a UTF, the length of the byte sequence is encoded in the lead bytes. Unlike a UTF, the trail byte values may overlap with lead/single byte values. The signedness of the difference must be encoded as the most significant part.

We encode differences with few bytes if their absolute values are small. For correct ordering, we must treat the entire value range -10ffff..+10ffff in ascending order, which forbids encoding the sign and the absolute value separately. Instead, we split the lead byte range in the middle and encode non-negative values going up and negative values going down.

For very small absolute values, the difference is added to a middle byte value for single-byte encoded differences. For somewhat larger absolute values, the difference is divided by the number of byte values available, the modulo is used for one trail byte, and the remainder is added to a lead byte avoiding the single-byte range. For large absolute values, the difference is similarly encoded in three bytes. (Syn Wee, I need examples here.)

BOCU does not use byte values 0, 1, or 2, but uses all other byte values for lead and single bytes, so that the middle range of single bytes is as large as possible.

Note that the lead byte ranges overlap some, but that the sequences as a whole are well ordered. I.e., even if the lead byte is the same for sequences of different lengths, the trail bytes establish correct order. It would be possible to encode slightly larger ranges for each length (>1) by subtracting the lower bound of the range. However, that would also slow down the calculation. (Syn Wee, need an example).

For the actual string encoding, an optimization moves the previous code point value to the middle of its Unicode script block to minimize the differences in same-script text runs. (Syn Wee, need an example.)

Since:
release 2.2, May 3rd 2002
  • Method Details

    • writeIdenticalLevelRun

      public static int writeIdenticalLevelRun(int prev, CharSequence s, int i, int length, ByteArrayWrapper sink)
      Encode the code points of a string as a sequence of byte-encoded differences (slope detection), preserving lexical order.

      Optimize the difference-taking for runs of Unicode text within small scripts:

      Most small scripts are allocated within aligned 128-blocks of Unicode code points. Lexical order is preserved if "prev" is always moved into the middle of such a block.

      Additionally, "prev" is moved from anywhere in the Unihan area into the middle of that area. Note that the identical-level run in a sort key is generated from NFD text - there are never Hangul characters included.