String.length() vs String.getBytes().length in Java

在Java中，String.length()返回字符串中字符的数量，而String.getBytes().length返回使用指定编码表示字符串所需的字节数。默认情况下，编码将是系统属性file.encoding的值，也可以通过调用System.setProperty("file.encoding", "XXX")手动设置编码名称。例如，UTF-8，Cp1252。在许多情况下，String.length()将返回与String.getBytes().length相同的值，但在某些情况下则不同。

String.length()是表示字符串所需的UTF-16代码单元的数量。也就是说，它是用于表示字符串的char值的数量（因此等于toCharArray().length）。这通常与字符串中Unicode字符（代码点）的数量相同——除非使用了UTF-16代理项。由于char在Java中是2个字节，因此如果编码是UTF-16，则String.getBytes().length将是String.length()的2倍。

String.getBytes().length是使用平台的默认编码表示字符串所需的字节数。例如，如果默认编码是UTF-16，它将恰好是String.length()返回的值的2倍。对于UTF-8，这两个长度可能不同。

如果字符串只包含ASCII代码，则这两个长度之间的关系很简单，因为它们将返回相同的值。但对于非ASCII字符，关系就比较复杂了。在ASCII字符串之外，String.getBytes().length可能更长，因为它计算表示字符串所需的字节数，而length()计算2字节的代码单元。

接下来，我们将以UTF-8编码为例来说明这种关系。

try{
	System.out.println("file.encoding = "+System.getProperty("file.encoding"));
	char c = 65504;
	
	System.out.println("c = "+c);
	
	String s = new String(new char[]{c});
	System.out.println("s = "+s);
	System.out.println("s.length = "+s.length());
	
	byte[] bytes = s.getBytes("UTF-8");
	System.out.println("bytes = "+Arrays.toString(bytes));
	System.out.println("bytes.length = "+bytes.length);
} catch (Exception ex){
	ex.printStackTrace();
}

让我们先看看输出是什么：

file.encoding = UTF-8
c = ï¿ 
s = ï¿ 
s.length = 1
bytes = [-17, -65, -96]
bytes.length = 3

从输出中，我们可以看到String.getBytes()返回三个字节。这是怎么回事？由于c是65504，十六进制是0xFFE0(1111 1111 1110 0000)。根据UTF-8定义，我们可以看到：

位	第一	最后	字节	字节1	字节2	字节3	字节4	字节5	字节6
7	U+0000	U+007F	1	`0xxxxxxx`
11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
26	U+200000	U+3FFFFFF	5	`111110xx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
31	U+4000000	U+7FFFFFFF	6	`1111110x`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

注意：原始规范涵盖了高达31位的数字（通用字符集的原始限制）。2003年11月，RFC 3629将UTF-8限制在U+10FFFF以内，以匹配UTF-16字符编码的约束。这去除了所有5字节和6字节序列，以及大约一半的4字节序列。

它需要三个字节才能在UTF-8中存储这个值65504。这三个字节是：

11101111 10111111 10100000

这三个字节的二进制补码整数值是：

-17 -65 -96

这就是我们得到上述输出的原因。

接下来让我们看看这个转换的JDK实现。它在Java 8中的sun.nio.cs.UTF8.java类中。

public int encode(char[] sa, int sp, int len, byte[] da) {
    int sl = sp + len;
    int dp = 0;
    int dlASCII = dp + Math.min(len, da.length);

    // ASCII only optimized loop
    while (dp < dlASCII && sa[sp] < '\u0080')
        da[dp++] = (byte) sa[sp++];

    while (sp < sl) {
        char c = sa[sp++];
        if (c < 0x80) {
            // Have at most seven bits
            da[dp++] = (byte)c;
        } else if (c < 0x800) {
            // 2 bytes, 11 bits
            da[dp++] = (byte)(0xc0 | (c >> 6));
            da[dp++] = (byte)(0x80 | (c & 0x3f));
        } else if (Character.isSurrogate(c)) {
            if (sgp == null)
                sgp = new Surrogate.Parser();
            int uc = sgp.parse(c, sa, sp - 1, sl);
            if (uc < 0) {
                if (malformedInputAction() != CodingErrorAction.REPLACE)
                    return -1;
                da[dp++] = repl;
            } else {
                da[dp++] = (byte)(0xf0 | ((uc >> 18)));
                da[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f));
                da[dp++] = (byte)(0x80 | ((uc >>  6) & 0x3f));
                da[dp++] = (byte)(0x80 | (uc & 0x3f));
                sp++;  // 2 chars
            }
        } else {
            // 3 bytes, 16 bits
            da[dp++] = (byte)(0xe0 | ((c >> 12)));
            da[dp++] = (byte)(0x80 | ((c >>  6) & 0x3f));
            da[dp++] = (byte)(0x80 | (c & 0x3f));
        }
    }
    return dp;
}

在Java 8之前，代码位于sun.io.CharToByteUTF8.java中。

 public int convert(char[] input, int inOff, int inEnd,
	       byte[] output, int outOff, int outEnd)
throws ConversionBufferFullException, MalformedInputException
{
	char inputChar;
	byte[] outputByte = new byte[6];
	int inputSize;
	int outputSize;

	charOff = inOff;
	byteOff = outOff;

	if (highHalfZoneCode != 0) {
	    inputChar = highHalfZoneCode;
	    highHalfZoneCode = 0;
	    if (input[inOff] >= 0xdc00 && input[inOff] <= 0xdfff) {
		// This is legal UTF16 sequence.
		int ucs4 = (highHalfZoneCode - 0xd800) * 0x400
		    + (input[inOff] - 0xdc00) + 0x10000;
		output[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
		output[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
		output[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
		output[3] = (byte)(0x80 | (ucs4 & 0x3f));
		charOff++;
		highHalfZoneCode = 0;
	    } else {
		// This is illegal UTF16 sequence.
		badInputLength = 0;
		throw new MalformedInputException();
	    }
	}

	while(charOff < inEnd) {
	    inputChar = input[charOff];
	    if (inputChar < 0x80) {
		outputByte[0] = (byte)inputChar;
		inputSize = 1;
		outputSize = 1;
	    } else if (inputChar < 0x800) {
		outputByte[0] = (byte)(0xc0 | ((inputChar >> 6) & 0x1f));
		outputByte[1] = (byte)(0x80 | (inputChar & 0x3f));
		inputSize = 1;
		outputSize = 2;
	    } else if (inputChar >= 0xd800 && inputChar <= 0xdbff) {
		// this is  in UTF-16
		if (charOff + 1 >= inEnd) {
		    highHalfZoneCode = inputChar;
		    break;
		}
		// check next char is valid 
		char lowChar = input[charOff + 1];
		if (lowChar < 0xdc00 || lowChar > 0xdfff) {
		    badInputLength = 1;
		    throw new MalformedInputException();
		}
		int ucs4 = (inputChar - 0xd800) * 0x400 + (lowChar - 0xdc00)
		    + 0x10000;
		outputByte[0] = (byte)(0xf0 | ((ucs4 >> 18)) & 0x07);
		outputByte[1] = (byte)(0x80 | ((ucs4 >> 12) & 0x3f));
		outputByte[2] = (byte)(0x80 | ((ucs4 >> 6) & 0x3f));
		outputByte[3] = (byte)(0x80 | (ucs4 & 0x3f));
		outputSize = 4;
		inputSize = 2;
	    } else {
		outputByte[0] = (byte)(0xe0 | ((inputChar >> 12)) & 0x0f);
		outputByte[1] = (byte)(0x80 | ((inputChar >> 6) & 0x3f));
		outputByte[2] = (byte)(0x80 | (inputChar & 0x3f));
		inputSize = 1;
		outputSize = 3;
	    } 
	    if (byteOff + outputSize > outEnd) {
		throw new ConversionBufferFullException();
	    }
	    for (int i = 0; i < outputSize; i++) {
		output[byteOff++] = outputByte[i];
	    }
	    charOff += inputSize;
	}
	return byteOff - outOff;
}

这段代码正在执行上表中描述的操作。

对于其他字符集，您可以按照相同的方式找出调用String.getBytes()时字节是什么。

这篇文章的一个要点是，在尝试调用String.getBytes().length并期望它与String.length()相同的时候要小心，尤其是在应用程序中存在低级字节操作的情况下，例如数据加密和解密。

String.length() vs String.getBytes().length in Java

RELATED

0 COMMENT

RANDOM FUN

Result of following demo

ABOUT

HOW IT WORKS

FOLLOW US

FEEDBACK