String.length() vs String.getBytes().length in Java

  Pi Ke        2015-04-01 22:22:23       115,219        0    

In Java, String.length() is to return the number of characters in the string, while String.getBytes().length is to return the number of bytes to represent the string with the specified encoding. By default, the encoding will be the value of system property file.encoding, the encoding name can be set manually as well by calling System.setProperty("file.encoding", "XXX"). For example, UTF-8, Cp1252. In many cases, String.length() will return the same value as String.getBytes().length, but in some cases it's not the same.

String.length() is the number of UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string (thus equal to toCharArray().length). This is usually the same as the number of unicode characters (code points) in the string - except when UTF-16 surrogates are used. Since char is 2 bytes in Java, so String.getBytes().length will be 2x of String.length() if the encoding is UTF-16.

String.getBytes().length is the number of bytes needed to represent the string in the platform's default encoding. For example, if the default encoding was UTF-16, it would be exactly 2x the value returned by String.length(). For UTF-8, the two lengths may not be the same.

The relationship between these two lengths is simple if the string contains only ASCII codes as they will return the same value. But for non-ASCII characters, the relationship is a bit complicated. Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.

Next we will take encoding UTF-8 as an example to illustrate this relationship.

try{
	System.out.println("file.encoding = "+System.getProperty("file.encoding"));
	char c = 65504;
	
	System.out.println("c = "+c);
	
	String s = new String(new char[]{c});
	System.out.println("s = "+s);
	System.out.println("s.length = "+s.length());
	
	byte[] bytes = s.getBytes("UTF-8");
	System.out.println("bytes = "+Arrays.toString(bytes));
	System.out.println("bytes.length = "+bytes.length);
} catch (Exception ex){
	ex.printStackTrace();
}

Let's see what the output is first:

file.encoding = UTF-8
c = ï¿ 
s = ï¿ 
s.length = 1
bytes = [-17, -65, -96]
bytes.length = 3

From the output, we can see the String.getBytes() returns three bytes. How is this happening? Since c is 65504, in hexdecial is 0xFFE0(1111 1111 1110 0000). Based on the UTF-8 definition, we can see:

Bits First
Last
Bytes Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6
  7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Note : The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

It needs three bytes o store this value 65504 in UTF-8. The three bytes are :

11101111 10111111 10100000

The integer value of this three bytes in two's complement form are :

-17 -65 -96

That's why we are getting this above output.

Next let's see the JDK implementation of this conversion. It's in the sun.nio.cs.UTF8.java class in Java 8.

public int encode(char[] sa, int sp, int len, byte[] da) {
    int sl = sp + len;
    int dp = 0;
    int dlASCII = dp + Math.min(len, da.length);

    // ASCII only optimized loop
    while (dp < dlASCII && sa[sp] < '\u0080')
        da[dp++] = (byte) sa[sp++];

    while (sp < sl) {
        char c = sa[sp++];
        if (c < 0x80) {
            // Have at most seven bits
            da[dp++] = (byte)c;
        } else if (c < 0x800) {
            // 2 bytes, 11 bits
            da[dp++] = (byte)(0xc0 | (c >> 6));
            da[dp++] = (byte)(0x80 | (c & 0x3f));
        } else if (Character.isSurrogate(c)) {
            if (sgp == null)
                sgp = new Surrogate.Parser();
            int uc = sgp.parse(c, sa, sp - 1, sl);
            if (uc < 0) {
                if (malformedInputAction() != CodingErrorAction.REPLACE)
                    return -1;
                da[dp++] = repl;
            } else {
                da[dp++] = (byte)(0xf0 | ((uc >> 18)));
                da[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f));
                da[dp++] = (byte)(0x80 | ((uc >>  6) & 0x3f));
                da[dp++] = (byte)(0x80 | (uc & 0x3f));
                sp++;  // 2 chars
            }
        } else {
            // 3 bytes, 16 bits
            da[dp++] = (byte)(0xe0 | ((c >> 12)));
            da[dp++] = (byte)(0x80 | ((c >>  6) & 0x3f));
            da[dp++] = (byte)(0x80 | (c & 0x3f));
        }
    }
    return dp;
}

Prior to Java 8, the code is in sun.io.CharToByteUTF8.java.

 

 public int convert(char[] input, int inOff, int inEnd,
	       byte[] output, int outOff, int outEnd)
throws ConversionBufferFullException, MalformedInputException
{
	char inputChar;
	byte[] outputByte = new byte[6];
	int inputSize;
	int outputSize;

	charOff = inOff;
	byteOff = outOff;

	if (highHalfZoneCode != 0) {
	    inputChar = highHalfZoneCode;
	    highHalfZoneCode = 0;
	    if (input[inOff] >= 0xdc00 && input[inOff] <= 0xdfff) {
		// This is legal UTF16 sequence.
		int ucs4 = (highHalfZoneCode - 0xd800) * 0x400
		    + (input[inOff] - 0xdc00) + 0x10000;
		output[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
		output[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
		output[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
		output[3] = (byte)(0x80 | (ucs4 & 0x3f));
		charOff++;
		highHalfZoneCode = 0;
	    } else {
		// This is illegal UTF16 sequence.
		badInputLength = 0;
		throw new MalformedInputException();
	    }
	}

	while(charOff < inEnd) {
	    inputChar = input[charOff];
	    if (inputChar < 0x80) {
		outputByte[0] = (byte)inputChar;
		inputSize = 1;
		outputSize = 1;
	    } else if (inputChar < 0x800) {
		outputByte[0] = (byte)(0xc0 | ((inputChar >> 6) & 0x1f));
		outputByte[1] = (byte)(0x80 | (inputChar & 0x3f));
		inputSize = 1;
		outputSize = 2;
	    } else if (inputChar >= 0xd800 && inputChar <= 0xdbff) {
		// this is  in UTF-16
		if (charOff + 1 >= inEnd) {
		    highHalfZoneCode = inputChar;
		    break;
		}
		// check next char is valid 
		char lowChar = input[charOff + 1];
		if (lowChar < 0xdc00 || lowChar > 0xdfff) {
		    badInputLength = 1;
		    throw new MalformedInputException();
		}
		int ucs4 = (inputChar - 0xd800) * 0x400 + (lowChar - 0xdc00)
		    + 0x10000;
		outputByte[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
		outputByte[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
		outputByte[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
		outputByte[3] = (byte)(0x80 | (ucs4 & 0x3f));
		outputSize = 4;
		inputSize = 2;
	    } else {
		outputByte[0] = (byte)(0xe0 | ((inputChar >> 12)) & 0x0f);
		outputByte[1] = (byte)(0x80 | ((inputChar >> 6) & 0x3f));
		outputByte[2] = (byte)(0x80 | (inputChar & 0x3f));
		inputSize = 1;
		outputSize = 3;
	    } 
	    if (byteOff + outputSize > outEnd) {
		throw new ConversionBufferFullException();
	    }
	    for (int i = 0; i < outputSize; i++) {
		output[byteOff++] = outputByte[i];
	    }
	    charOff += inputSize;
	}
	return byteOff - outOff;
}

 

This piece of code is doing what is described in the above table.

For other character set, you can follow the same way to figure out what the bytes are when calling String.getBytes().

One takeaway from this post is that be careful when trying to call String.getBytes().length and expects it's be the same as String.length(), especially when there are low level byte operations in your application, for example, data encryption and decryption.

JAVA  STRING  ENCODING  SAMPLE  UTF8 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

After looking away for a few seconds