杰瑞科技汇

Java中Unicode与GBK编码如何转换?

Of course. This is a fundamental and crucial concept in Java programming, especially when dealing with text files, network communication, or user input. Let's break it down clearly.

Java中Unicode与GBK编码如何转换?-图1
(图片来源网络,侵删)

The Core Concept: Unicode vs. Encoding (GBK)

First, it's essential to understand the difference between the character set and the encoding.

  1. Unicode (The Character Set):

    • Think of Unicode as a giant, universal dictionary. It assigns a unique number (a code point) to every character in every language you can imagine, plus symbols, emojis, and even historical scripts.
    • A code point is written as U+XXXX, for example, A is U+0041, is U+4E2D, and is U+1F602.
    • Java's char type and String class are fundamentally based on Unicode. When you write String s = "Hello世界";, Java stores this internally as a sequence of Unicode code points.
  2. GBK (The Encoding):

    • An encoding is a set of rules for how to represent those Unicode code points as bytes for storage or transmission.
    • Unicode defines the "what" (the character), while an encoding like GBK defines the "how" (the byte sequence).
    • GBK is a character encoding standard developed in China. It's an extension of the GB2312 standard and is widely used in Mainland China. It covers all Chinese characters, as well as English, Japanese, Russian, and other symbols.
    • GBK is a variable-width encoding. English characters (like 'A', 'B') are represented by 1 byte, while Chinese characters are represented by 2 bytes.

The Key Takeaway: Java strings are always Unicode in memory. GBK (or any other encoding like UTF-8) only comes into play when you need to convert that Java string into a byte array (for writing to a file or sending over a network) or when you need to convert a byte array (from a file or network) back into a Java string.

Java中Unicode与GBK编码如何转换?-图2
(图片来源网络,侵删)

How to Handle GBK in Java

The most important classes for this are InputStreamReader, OutputStreamWriter, and the StandardCharsets enum.

Reading a GBK-encoded File into a Java String

When you read from a file, you get bytes. You must tell Java how to interpret those bytes as characters. This is where InputStreamReader shines.

Scenario: You have a file named gbk_text.txt encoded in GBK.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
public class ReadGbkFile {
    public static void main(String[] args) {
        // The path to your GBK encoded file
        String filePath = "gbk_text.txt";
        // The charset we are reading the file with
        Charset gbkCharset = Charset.forName("GBK");
        try (
            // FileInputStream reads raw bytes from the file
            FileInputStream fis = new FileInputStream(filePath);
            // InputStreamReader converts the bytes to characters using the specified charset
            InputStreamReader isr = new InputStreamReader(fis, gbkCharset);
            // BufferedReader provides efficient line-by-line reading
            BufferedReader br = new BufferedReader(isr)
        ) {
            String line;
            System.out.println("Reading file with GBK encoding:");
            while ((line = br.readLine()) != null) {
                // At this point, 'line' is a normal Java String, fully Unicode.
                // The magic of InputStreamReader has already happened.
                System.out.println(line);
                System.out.println("The internal representation of '中' is: " + (int) line.charAt(1)); // Should print 20013 (U+4E2D)
            }
        } catch (IOException e) {
            System.err.println("Error reading the file: " + e.getMessage());
        }
    }
}

What's happening here?

Java中Unicode与GBK编码如何转换?-图3
(图片来源网络,侵删)
  1. FileInputStream reads the raw bytes from gbk_text.txt (e.g., the bytes for '中' might be -42 -48).
  2. InputStreamReader takes these bytes and, because it was told to use the GBK charset, correctly translates them into the Unicode character '中'.
  3. The resulting String object now holds the Unicode character, which is what all Java string operations expect.

Writing a Java String to a GBK-encoded File

When you write a string to a file, you must tell Java how to convert your Unicode characters into bytes. This is the job of OutputStreamWriter.

Scenario: You have a Java String and you want to save it to a file using GBK encoding.

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
public class WriteGbkFile {
    public static void main(String[] args) {
        String textToWrite = "This is a test. 这是一个测试。";
        String filePath = "output_gbk.txt";
        // The charset we want to use for writing the file
        Charset gbkCharset = Charset.forName("GBK");
        try (
            // FileOutputStream writes raw bytes to the file
            FileOutputStream fos = new FileOutputStream(filePath);
            // OutputStreamWriter converts characters to bytes using the specified charset
            OutputStreamWriter osw = new OutputStreamWriter(fos, gbkCharset)
        ) {
            osw.write(textToWrite);
            System.out.println("Successfully wrote text to '" + filePath + "' using GBK encoding.");
        } catch (IOException e) {
            System.err.println("Error writing to the file: " + e.getMessage());
        }
    }
}

What's happening here?

  1. Java takes the String textToWrite.
  2. OutputStreamWriter, using the GBK charset, converts each Unicode character into its corresponding byte representation. For example, 'T' becomes 84 (1 byte) and '中' becomes -42 -48 (2 bytes).
  3. FileOutputStream then writes these raw bytes to output_gbk.txt.

Best Practices and Modern Java (NIO.2)

The java.io package (with InputStreamReader/OutputStreamWriter) is classic and works perfectly. However, modern Java (since version 7) provides the java.nio.file (New I/O) package, which is often preferred for its cleaner API and better performance.

NIO.2 Example: Reading and Writing with GBK

The key class here is Files, which has static methods for reading and writing that take a Charset directly.

import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class NioGbkExample {
    public static void main(String[] args) {
        Path readPath = Paths.get("gbk_text.txt");
        Path writePath = Paths.get("output_nio_gbk.txt");
        Charset gbkCharset = Charset.forName("GBK");
        // --- READING with NIO.2 ---
        try {
            // Files.readAllLines() reads the entire file into a List of Strings,
            // correctly decoding the bytes using the provided charset.
            List<String> lines = Files.readAllLines(readPath, gbkCharset);
            System.out.println("Read from file using NIO.2:");
            for (String line : lines) {
                System.out.println(line);
            }
            // --- WRITING with NIO.2 ---
            // You can write a List of Strings or a single String
            String textToWrite = "NIO.2 is great! NIO.2很棒!";
            Files.writeString(writePath, textToWrite, gbkCharset);
            System.out.println("\nSuccessfully wrote text using NIO.2 and GBK encoding.");
        } catch (IOException e) {
            System.err.println("An I/O error occurred: " + e.getMessage());
        }
    }
}

Common Pitfalls and Problems

Problem 1: Reading a GBK file with the wrong encoding (e.g., UTF-8)

If you try to read a GBK-encoded file as if it were UTF-8, you will get garbled characters (mojibake).

  • GBK file contains: D6 D0 B9 E3 (the 2 bytes for "中")
  • UTF-8 decoder sees these bytes and interprets them as:
    • D6 is an invalid start byte for a UTF-8 character.
    • Or, it might try to decode D0 B9 as a single character and E3 as another, resulting in something like .

Problem 2: Writing a Java String to a file without specifying an encoding

This is a very common mistake. If you use a method like Files.write(path, bytes) or new FileOutputStream() without an OutputStreamWriter, Java will use the platform's default charset.

  • On a Windows machine in China: The default might be GBK. Your code might work by accident.
  • On a Linux machine in the US: The default is likely UTF-8. Your Chinese characters will be written as UTF-8 bytes. If another program tries to read it expecting GBK, it will be garbled.

Always be explicit about your character encoding!

Summary

Task Old I/O (java.io) Modern NIO (java.nio.file) Key Point
Read GBK -> Java String new InputStreamReader(fis, "GBK") Files.readAllLines(path, Charset.forName("GBK")) You must specify the encoding to correctly interpret the incoming bytes.
Write Java String -> GBK new OutputStreamWriter(fos, "GBK") Files.writeString(path, text, Charset.forName("GBK")) You must specify the encoding to correctly convert your Unicode characters to bytes.
Internal Java String Representation N/A N/A Always Unicode. The encoding is only for I/O.
Best Practice Good, but more verbose. Preferred. Cleaner, safer, and more powerful. Always be explicit with your Charset. Avoid relying on the platform's default. Use StandardCharsets.UTF_8 for UTF-8 or Charset.forName("GBK") for GBK.
分享:
扫描分享到社交APP
上一篇
下一篇