In Java, String.length() is to return the number of characters in the string, while String.getBytes().length is to return the number of bytes to represent the string with the specified encoding. By default, the encoding will be the value of system property file.encoding, the encoding name can be set manually as well by calling System.setProperty("file.encoding", "XXX"). For example, UTF-8, Cp1252. In many cases, String.length() will return the same value as String.getBytes().length, but in some cases it's not the same.
String.length()
is the number of UTF-16 code units needed to represent the string. That is, it is the number of char
values that are used to represent the string (thus equal to toCharArray().length
). This is usually the same as the number of unicode characters (code points) in the string - except when UTF-16 surrogates are used. Since char is 2 bytes in Java, so String.getBytes().length will be 2x of String.length() if the encoding is UTF-16.
String.getBytes().length
is the number of bytes needed to represent the string in the platform's default encoding. For example, if the default encoding was UTF-16, it would be exactly 2x the value returned by String.length()
. For UTF-8, the two lengths may not be the same.
The relationship between these two lengths is simple if the string contains only ASCII codes as they will return the same value. But for non-ASCII characters, the relationship is a bit complicated. Outside of ASCII strings, String.getBytes().length
is likely to be longer, as it counts bytes needed to represent the string, while length()
counts 2-byte code units.
Next we will take encoding UTF-8 as an example to illustrate this relationship.
try{ System.out.println("file.encoding = "+System.getProperty("file.encoding")); char c = 65504; System.out.println("c = "+c); String s = new String(new char[]{c}); System.out.println("s = "+s); System.out.println("s.length = "+s.length()); byte[] bytes = s.getBytes("UTF-8"); System.out.println("bytes = "+Arrays.toString(bytes)); System.out.println("bytes.length = "+bytes.length); } catch (Exception ex){ ex.printStackTrace(); }
Let's see what the output is first:
file.encoding = UTF-8 c = ï¿ s = ï¿ s.length = 1 bytes = [-17, -65, -96] bytes.length = 3
From the output, we can see the String.getBytes() returns three bytes. How is this happening? Since c is 65504, in hexdecial is 0xFFE0(1111 1111 1110 0000). Based on the UTF-8 definition, we can see:
Bits | First | Last | Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx |
|||||
11 | U+0080 | U+07FF | 2 | 110xxxxx |
10xxxxxx |
||||
16 | U+0800 | U+FFFF | 3 | 1110xxxx |
10xxxxxx |
10xxxxxx |
|||
21 | U+10000 | U+1FFFFF | 4 | 11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
||
26 | U+200000 | U+3FFFFFF | 5 | 111110xx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
|
31 | U+4000000 | U+7FFFFFFF | 6 | 1111110x |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
Note : The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.
It needs three bytes o store this value 65504 in UTF-8. The three bytes are :
11101111 10111111 10100000
The integer value of this three bytes in two's complement form are :
-17 -65 -96
That's why we are getting this above output.
Next let's see the JDK implementation of this conversion. It's in the sun.nio.cs.UTF8.java class in Java 8.
public int encode(char[] sa, int sp, int len, byte[] da) { int sl = sp + len; int dp = 0; int dlASCII = dp + Math.min(len, da.length); // ASCII only optimized loop while (dp < dlASCII && sa[sp] < '\u0080') da[dp++] = (byte) sa[sp++]; while (sp < sl) { char c = sa[sp++]; if (c < 0x80) { // Have at most seven bits da[dp++] = (byte)c; } else if (c < 0x800) { // 2 bytes, 11 bits da[dp++] = (byte)(0xc0 | (c >> 6)); da[dp++] = (byte)(0x80 | (c & 0x3f)); } else if (Character.isSurrogate(c)) { if (sgp == null) sgp = new Surrogate.Parser(); int uc = sgp.parse(c, sa, sp - 1, sl); if (uc < 0) { if (malformedInputAction() != CodingErrorAction.REPLACE) return -1; da[dp++] = repl; } else { da[dp++] = (byte)(0xf0 | ((uc >> 18))); da[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f)); da[dp++] = (byte)(0x80 | ((uc >> 6) & 0x3f)); da[dp++] = (byte)(0x80 | (uc & 0x3f)); sp++; // 2 chars } } else { // 3 bytes, 16 bits da[dp++] = (byte)(0xe0 | ((c >> 12))); da[dp++] = (byte)(0x80 | ((c >> 6) & 0x3f)); da[dp++] = (byte)(0x80 | (c & 0x3f)); } } return dp; }
Prior to Java 8, the code is in sun.io.CharToByteUTF8.java.
public int convert(char[] input, int inOff, int inEnd, byte[] output, int outOff, int outEnd) throws ConversionBufferFullException, MalformedInputException { char inputChar; byte[] outputByte = new byte[6]; int inputSize; int outputSize; charOff = inOff; byteOff = outOff; if (highHalfZoneCode != 0) { inputChar = highHalfZoneCode; highHalfZoneCode = 0; if (input[inOff] >= 0xdc00 && input[inOff] <= 0xdfff) { // This is legal UTF16 sequence. int ucs4 = (highHalfZoneCode - 0xd800) * 0x400 + (input[inOff] - 0xdc00) + 0x10000; output[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07); output[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f)); output[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f)); output[3] = (byte)(0x80 | (ucs4 & 0x3f)); charOff++; highHalfZoneCode = 0; } else { // This is illegal UTF16 sequence. badInputLength = 0; throw new MalformedInputException(); } } while(charOff < inEnd) { inputChar = input[charOff]; if (inputChar < 0x80) { outputByte[0] = (byte)inputChar; inputSize = 1; outputSize = 1; } else if (inputChar < 0x800) { outputByte[0] = (byte)(0xc0 | ((inputChar >> 6) & 0x1f)); outputByte[1] = (byte)(0x80 | (inputChar & 0x3f)); inputSize = 1; outputSize = 2; } else if (inputChar >= 0xd800 && inputChar <= 0xdbff) { // this is in UTF-16 if (charOff + 1 >= inEnd) { highHalfZoneCode = inputChar; break; } // check next char is valid char lowChar = input[charOff + 1]; if (lowChar < 0xdc00 || lowChar > 0xdfff) { badInputLength = 1; throw new MalformedInputException(); } int ucs4 = (inputChar - 0xd800) * 0x400 + (lowChar - 0xdc00) + 0x10000; outputByte[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07); outputByte[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f)); outputByte[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f)); outputByte[3] = (byte)(0x80 | (ucs4 & 0x3f)); outputSize = 4; inputSize = 2; } else { outputByte[0] = (byte)(0xe0 | ((inputChar >> 12)) & 0x0f); outputByte[1] = (byte)(0x80 | ((inputChar >> 6) & 0x3f)); outputByte[2] = (byte)(0x80 | (inputChar & 0x3f)); inputSize = 1; outputSize = 3; } if (byteOff + outputSize > outEnd) { throw new ConversionBufferFullException(); } for (int i = 0; i < outputSize; i++) { output[byteOff++] = outputByte[i]; } charOff += inputSize; } return byteOff - outOff; }
This piece of code is doing what is described in the above table.
For other character set, you can follow the same way to figure out what the bytes are when calling String.getBytes().
One takeaway from this post is that be careful when trying to call String.getBytes().length and expects it's be the same as String.length(), especially when there are low level byte operations in your application, for example, data encryption and decryption.