Java字符串如何处理Unicode字符？-杰瑞科技汇

Of course! Let's dive deep into how Java handles Unicode strings. This is a fundamental concept in Java, and understanding it is key to writing robust, internationalized applications.

（图片来源网络，侵删）

The Core Idea: Java Strings are Unicode

At its heart, a String in Java is not just an array of bytes. It is an immutable sequence of characters, where each character is represented by a 16-bit char value using the UTF-16 encoding scheme.

This is the most important takeaway: Java String objects are internally stored as UTF-16.

The `char` Data Type and the BMP

char size: A char in Java is a primitive type that occupies 2 bytes (16 bits).
Basic Multilingual Plane (BMP): The original Unicode standard defined a space of 65,536 (2¹⁶) unique characters, known as the Basic Multilingual Plane. This includes:
- Latin, Greek, Cyrillic alphabets
- Chinese, Japanese, and Korean (CJK) ideographs
- Emojis (most of them)
- Many special symbols.

Since a 16-bit char can perfectly represent any character in the BMP, for a long time, a one-to-one mapping existed between a Java char and a Unicode code point.

Example (BMP Character):

（图片来源网络，侵删）

char euroSymbol = '€'; // The Euro sign is in the BMP
System.out.println(euroSymbol); // Prints €
System.out.println((int) euroSymbol); // Prints 8364, its Unicode code point

The Problem: Supplementary Characters (Beyond the BMP)

As the Unicode standard grew, it surpassed the 65,536-character limit of the BMP. To accommodate millions of new characters (historical scripts, rare symbols, etc.), the Unicode Consortium introduced supplementary characters.

These characters have code points that are greater than U+FFFF (i.e., they require more than 16 bits to represent).

How does Java (with its 16-bit char) handle this?

Java uses a clever mechanism called UTF-16 surrogate pairs.

（图片来源网络，侵删）

A single supplementary character is represented by two consecutive char values.
The first char is the high surrogate (in the range \uD800 to \uDBFF).
The second char is the low surrogate (in the range \uDC00 to \uDFFF).

Example (Supplementary Character - Emoji): The "Grinning Face" emoji (😊) has a Unicode code point of U+1F60A.

// The code point for the grinning face emoji
int codePoint = 0x1F60A;
// This character CANNOT be stored in a single char
// char emoji = '😊'; // This is actually TWO chars!
// To get the number of chars (code units) in the string
String emoji = "😊";
System.out.println("String length in chars: " + emoji.length()); // Prints 2, not 1!
// To get the actual number of characters (code points)
System.out.println("String length in code points: " + emoji.codePointCount(0, emoji.length())); // Prints 1

This distinction is crucial and a common source of bugs.

Key Classes and Methods for Unicode Handling

Because of the surrogate pair issue, you should almost never loop through a String using its char values. Instead, use code points.

`String.codePointAt(int index)`

Returns the Unicode code point (as an int) at the given index. It correctly handles surrogate pairs, looking ahead if necessary.

String emoji = "😊";
int codePoint = emoji.codePointAt(0);
System.out.println("Code Point: " + codePoint); // Prints 128522 (decimal for U+1F60A)
System.out.println("Hex: " + Integer.toHexString(codePoint)); // Prints 1f60a

`String.codePointCount(int beginIndex, int endIndex)`

Returns the number of Unicode code points in the specified range of the String.

String complex = "A😊B"; // A, emoji, B
System.out.println(complex.length()); // Prints 3 (chars: 'A', high surrogate, low surrogate, 'B')
System.out.println(complex.codePointCount(0, complex.length())); // Prints 2 (code points: 'A', emoji, 'B')

`String.offsetByCodePoints(int index, int codePointOffset)`

Offsets an index by a given number of code points, correctly navigating surrogate pairs.

`Character` class utility methods

The java.lang.Character class is full of static methods for checking character properties.

char c = 'a';
System.out.println(Character.isLetter(c));      // true
System.out.println(Character.isDigit(c));      // false
System.out.println(Character.isUpperCase(c));  // false
System.out.println(Character.isLowerCase(c));  // true
System.out.println(Character.isWhitespace(c)); // false
// For code points
int cp = 0x1F60A;
System.out.println(Character.isEmoji(cp)); // true (in recent Java versions)

`String.getBytes()` and Constructors

When you need to convert a String to a byte array (for storage or transmission), you must specify a character encoding. If you don't, you'll use the platform's default encoding, which can lead to data corruption.

❌ WRONG (Platform Dependent):

// The encoding used here depends on the OS (e.g., Windows-1252 on Windows, UTF-8 on Linux)
byte[] bytes = "Héllö".getBytes();

✅ CORRECT (Explicit UTF-8):

// Always specify the encoding!
byte[] utf8Bytes = "Héllö".getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = "Héllö".getBytes(StandardCharsets.UTF_16);

Constructing a String from bytes: You must also specify the encoding to correctly interpret the bytes.

byte[] bytes = {(byte) 0xC3, (byte) 0xA9}; // These are the UTF-8 bytes for 'é'
String s = new String(bytes, StandardCharsets.UTF_8);
System.out.println(s); // Prints é

Practical Examples

Example 1: Reversing a String Correctly

A naive reversal that iterates by char will break supplementary characters.

// --- WRONG WAY ---
String emoji = "😊👍";
String wrongReversed = new StringBuilder(emoji).reverse().toString();
System.out.println("Original: " + emoji);
System.out.println("Wrong Reversed: " + wrongReversed); // Prints 👍😊 (visually, but the code points are in the wrong order)
// --- CORRECT WAY ---
public static String reverseCodePoints(String s) {
    int[] codePoints = s.codePoints().toArray();
    StringBuilder sb = new StringBuilder(s.length());
    for (int i = codePoints.length - 1; i >= 0; i--) {
        sb.appendCodePoint(codePoints[i]);
    }
    return sb.toString();
}
String correctReversed = reverseCodePoints(emoji);
System.out.println("Correct Reversed: " + correctReversed); // Prints 👍😊 (correctly)

Example 2: Counting Characters

String sentence = "I have a 😊 emoji!";
int charCount = sentence.length(); // Counts 16-bit chars
int codePointCount = sentence.codePointCount(0, sentence.length()); // Counts actual characters
System.out.println("Length in chars: " + charCount);       // 16
System.out.println("Length in code points: " + codePointCount); // 15

Summary Table

Task	Method / Approach	Why?
Get number of 16-bit units	`myString.length()`	Fast, but not the number of actual characters for strings with emojis/supplementary chars.
Get number of actual characters	`myString.codePointCount(0, myString.length())`	Correctly counts surrogate pairs as a single character.
Iterate over characters	`myString.codePoints().forEach(...)` or a `for` loop with `codePointAt`	Essential for processing strings with supplementary characters correctly.
Get character at an index	`myString.codePointAt(index)`	Returns the full code point, handling surrogate pairs automatically.
Convert String to bytes	`myString.getBytes(StandardCharsets.UTF_8)`	Ensures consistent, platform-independent encoding. Avoids data loss.
Convert bytes to String	`new String(bytes, StandardCharsets.UTF_8)`	Ensures bytes are interpreted correctly according to the specified encoding.

Conclusion

Java String is Unicode-based: It uses UTF-16 to represent characters.
char is 16-bit: This means a single char can only hold

Java字符串如何处理Unicode字符？

The Core Idea: Java Strings are Unicode

The `char` Data Type and the BMP

The Problem: Supplementary Characters (Beyond the BMP)

Key Classes and Methods for Unicode Handling

`String.codePointAt(int index)`

`String.codePointCount(int beginIndex, int endIndex)`

`String.offsetByCodePoints(int index, int codePointOffset)`

`Character` class utility methods

`String.getBytes()` and Constructors

Practical Examples

Example 1: Reversing a String Correctly

Example 2: Counting Characters

Summary Table

Conclusion

99ANYc3cd6

Java List clear()方法真的能彻底释放内存吗？

Python如何高效解析JSON数据？

Java如何将图片存入MongoDB数据库？

Java DateTime类型如何正确使用与转换？

哪里找免费Python视频教程？

Java Socket编程实例具体怎么实现？

Java字符数组如何转String？

sql 2008教程下载

Java byte数组如何高效转二进制？

python enumerate 参数

Win8 ghost安装教程具体步骤是怎样的？

Java、Web、Android三者如何协同开发？

java 获取 linux 信息

Python allowedhosts如何配置与使用？

iPhone6天线怎么换？步骤有哪些？

华为ec2106v1破解教程

Java字符串如何处理Unicode字符？

The Core Idea: Java Strings are Unicode

The char Data Type and the BMP

The Problem: Supplementary Characters (Beyond the BMP)

Key Classes and Methods for Unicode Handling

String.codePointAt(int index)

String.codePointCount(int beginIndex, int endIndex)

String.offsetByCodePoints(int index, int codePointOffset)

Character class utility methods

String.getBytes() and Constructors

Practical Examples

Example 1: Reversing a String Correctly

Example 2: Counting Characters

Summary Table

Conclusion

相关推荐

Java Socket编程实例具体怎么实现？

The `char` Data Type and the BMP

`String.codePointAt(int index)`

`String.codePointCount(int beginIndex, int endIndex)`

`String.offsetByCodePoints(int index, int codePointOffset)`

`Character` class utility methods

`String.getBytes()` and Constructors