杰瑞科技汇

Java字符串如何处理Unicode字符?

Of course! Let's dive deep into how Java handles Unicode strings. This is a fundamental concept in Java, and understanding it is key to writing robust, internationalized applications.

Java字符串如何处理Unicode字符?-图1
(图片来源网络,侵删)

The Core Idea: Java Strings are Unicode

At its heart, a String in Java is not just an array of bytes. It is an immutable sequence of characters, where each character is represented by a 16-bit char value using the UTF-16 encoding scheme.

This is the most important takeaway: Java String objects are internally stored as UTF-16.


The char Data Type and the BMP

  • char size: A char in Java is a primitive type that occupies 2 bytes (16 bits).
  • Basic Multilingual Plane (BMP): The original Unicode standard defined a space of 65,536 (2¹⁶) unique characters, known as the Basic Multilingual Plane. This includes:
    • Latin, Greek, Cyrillic alphabets
    • Chinese, Japanese, and Korean (CJK) ideographs
    • Emojis (most of them)
    • Many special symbols.

Since a 16-bit char can perfectly represent any character in the BMP, for a long time, a one-to-one mapping existed between a Java char and a Unicode code point.

Example (BMP Character):

Java字符串如何处理Unicode字符?-图2
(图片来源网络,侵删)
char euroSymbol = '€'; // The Euro sign is in the BMP
System.out.println(euroSymbol); // Prints €
System.out.println((int) euroSymbol); // Prints 8364, its Unicode code point

The Problem: Supplementary Characters (Beyond the BMP)

As the Unicode standard grew, it surpassed the 65,536-character limit of the BMP. To accommodate millions of new characters (historical scripts, rare symbols, etc.), the Unicode Consortium introduced supplementary characters.

These characters have code points that are greater than U+FFFF (i.e., they require more than 16 bits to represent).

How does Java (with its 16-bit char) handle this?

Java uses a clever mechanism called UTF-16 surrogate pairs.

Java字符串如何处理Unicode字符?-图3
(图片来源网络,侵删)
  • A single supplementary character is represented by two consecutive char values.
  • The first char is the high surrogate (in the range \uD800 to \uDBFF).
  • The second char is the low surrogate (in the range \uDC00 to \uDFFF).

Example (Supplementary Character - Emoji): The "Grinning Face" emoji (😊) has a Unicode code point of U+1F60A.

// The code point for the grinning face emoji
int codePoint = 0x1F60A;
// This character CANNOT be stored in a single char
// char emoji = '😊'; // This is actually TWO chars!
// To get the number of chars (code units) in the string
String emoji = "😊";
System.out.println("String length in chars: " + emoji.length()); // Prints 2, not 1!
// To get the actual number of characters (code points)
System.out.println("String length in code points: " + emoji.codePointCount(0, emoji.length())); // Prints 1

This distinction is crucial and a common source of bugs.


Key Classes and Methods for Unicode Handling

Because of the surrogate pair issue, you should almost never loop through a String using its char values. Instead, use code points.

String.codePointAt(int index)

Returns the Unicode code point (as an int) at the given index. It correctly handles surrogate pairs, looking ahead if necessary.

String emoji = "😊";
int codePoint = emoji.codePointAt(0);
System.out.println("Code Point: " + codePoint); // Prints 128522 (decimal for U+1F60A)
System.out.println("Hex: " + Integer.toHexString(codePoint)); // Prints 1f60a

String.codePointCount(int beginIndex, int endIndex)

Returns the number of Unicode code points in the specified range of the String.

String complex = "A😊B"; // A, emoji, B
System.out.println(complex.length()); // Prints 3 (chars: 'A', high surrogate, low surrogate, 'B')
System.out.println(complex.codePointCount(0, complex.length())); // Prints 2 (code points: 'A', emoji, 'B')

String.offsetByCodePoints(int index, int codePointOffset)

Offsets an index by a given number of code points, correctly navigating surrogate pairs.

Character class utility methods

The java.lang.Character class is full of static methods for checking character properties.

char c = 'a';
System.out.println(Character.isLetter(c));      // true
System.out.println(Character.isDigit(c));      // false
System.out.println(Character.isUpperCase(c));  // false
System.out.println(Character.isLowerCase(c));  // true
System.out.println(Character.isWhitespace(c)); // false
// For code points
int cp = 0x1F60A;
System.out.println(Character.isEmoji(cp)); // true (in recent Java versions)

String.getBytes() and Constructors

When you need to convert a String to a byte array (for storage or transmission), you must specify a character encoding. If you don't, you'll use the platform's default encoding, which can lead to data corruption.

❌ WRONG (Platform Dependent):

// The encoding used here depends on the OS (e.g., Windows-1252 on Windows, UTF-8 on Linux)
byte[] bytes = "Héllö".getBytes(); 

✅ CORRECT (Explicit UTF-8):

// Always specify the encoding!
byte[] utf8Bytes = "Héllö".getBytes(StandardCharsets.UTF_8);
byte[] utf16Bytes = "Héllö".getBytes(StandardCharsets.UTF_16);

Constructing a String from bytes: You must also specify the encoding to correctly interpret the bytes.

byte[] bytes = {(byte) 0xC3, (byte) 0xA9}; // These are the UTF-8 bytes for 'é'
String s = new String(bytes, StandardCharsets.UTF_8);
System.out.println(s); // Prints é

Practical Examples

Example 1: Reversing a String Correctly

A naive reversal that iterates by char will break supplementary characters.

// --- WRONG WAY ---
String emoji = "😊👍";
String wrongReversed = new StringBuilder(emoji).reverse().toString();
System.out.println("Original: " + emoji);
System.out.println("Wrong Reversed: " + wrongReversed); // Prints 👍😊 (visually, but the code points are in the wrong order)
// --- CORRECT WAY ---
public static String reverseCodePoints(String s) {
    int[] codePoints = s.codePoints().toArray();
    StringBuilder sb = new StringBuilder(s.length());
    for (int i = codePoints.length - 1; i >= 0; i--) {
        sb.appendCodePoint(codePoints[i]);
    }
    return sb.toString();
}
String correctReversed = reverseCodePoints(emoji);
System.out.println("Correct Reversed: " + correctReversed); // Prints 👍😊 (correctly)

Example 2: Counting Characters

String sentence = "I have a 😊 emoji!";
int charCount = sentence.length(); // Counts 16-bit chars
int codePointCount = sentence.codePointCount(0, sentence.length()); // Counts actual characters
System.out.println("Length in chars: " + charCount);       // 16
System.out.println("Length in code points: " + codePointCount); // 15

Summary Table

Task Method / Approach Why?
Get number of 16-bit units myString.length() Fast, but not the number of actual characters for strings with emojis/supplementary chars.
Get number of actual characters myString.codePointCount(0, myString.length()) Correctly counts surrogate pairs as a single character.
Iterate over characters myString.codePoints().forEach(...) or a for loop with codePointAt Essential for processing strings with supplementary characters correctly.
Get character at an index myString.codePointAt(index) Returns the full code point, handling surrogate pairs automatically.
Convert String to bytes myString.getBytes(StandardCharsets.UTF_8) Ensures consistent, platform-independent encoding. Avoids data loss.
Convert bytes to String new String(bytes, StandardCharsets.UTF_8) Ensures bytes are interpreted correctly according to the specified encoding.

Conclusion

  • Java String is Unicode-based: It uses UTF-16 to represent characters.
  • char is 16-bit: This means a single char can only hold
分享:
扫描分享到社交APP
上一篇
下一篇