Length of string can be interpreted variously -

  • number of chars in the string
  • number of characters in the string
  • number of bytes in the string

String.length() gives you the number of chars in the string accurately.

However a char is not necessarily a complete character. Why?
Supplementary characters exist in the Unicode charset. These are characters that have code points above the base set, and they have values greater than 0xFFFF. They extend all the way up to 0×10FFFF.

In Java, these supplementary characters are represented as surrogate pairs, pairs of char units that fall in a specific range. The leading or high surrogate value is in the 0xD800 through 0xDBFF range. The trailing or low surrogate value is in the 0xDC00 through 0xDFFF range.

J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex) which tells you how many Unicode code points are between the two indices. The index values refer to code unit or char locations, so endIndex - beginIndex for the entire String is equivalent to the String's length.

So:
int characterLength = myString.codePointCount(0, charLength);

As before:
int charLength = myString.length();

Unless you plan to sell your software to China or Japan (read internationalize) you are unlikely to encounter any difference between charLength and characterLength.

So how many bytes are in a String?
int byteCount = myString.getBytes().length;

getBytes converts its Unicode characters into a legacy charset with the exception of UTF-8 which is a multibyte encoding of Unicode and not a legacy charset. It then returns the characters in a byte array.

Hat tip: Joconner here