杰瑞科技汇

Java GBK转UTF-8如何正确实现编码转换?

Of course. Converting text from GBK to UTF-8 in Java is a common task, especially when dealing with data from legacy systems or applications in mainland China. The core of this process involves using Java's character encoding and decoding mechanisms.

Java GBK转UTF-8如何正确实现编码转换?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide, from the basic concept to complete, runnable code examples.

The Core Concept: Reader & Writer vs. String

The most important thing to understand is that a String in Java is always stored in UTF-16 internally. It doesn't have an "encoding" itself. The encoding comes into play when you need to:

  1. Decode a sequence of bytes (e.g., from a file or network) into a String.
  2. Encode a String into a sequence of bytes (e.g., to write to a file or send over a network).

So, when you "convert GBK to UTF-8", you are actually:

  1. Reading bytes that are encoded in GBK and creating a String (UTF-16).
  2. Writing that String to a new location using the UTF-8 encoding.

Method 1: The Correct & Recommended Way (Using InputStreamReader and OutputStreamWriter)

This is the standard, most robust way to handle character encoding when reading from or writing to streams (files, network connections, etc.). It avoids loading the entire file into memory, making it suitable for large files.

Java GBK转UTF-8如何正确实现编码转换?-图2
(图片来源网络,侵删)

Scenario: Converting a File from GBK to UTF-8

Let's say you have a file named input_gbk.txt encoded in GBK.

import java.io.*;
import java.nio.charset.StandardCharsets;
public class GbkToUtf8Converter {
    public static void main(String[] args) {
        // 1. Define source and file paths
        String sourceFilePath = "path/to/your/input_gbk.txt"; // Your GBK encoded file
        String targetFilePath = "path/to/your/output_utf8.txt"; // The new UTF-8 file to be created
        // 2. Use try-with-resources to automatically close streams
        try (
            // Create an InputStream to read the raw bytes from the source file
            FileInputStream fis = new FileInputStream(sourceFilePath);
            // Wrap it in an InputStreamReader that decodes the bytes using the GBK charset
            InputStreamReader isr = new InputStreamReader(fis, "GBK");
            // Wrap the Reader in a BufferedReader for efficient line-by-line reading
            BufferedReader br = new BufferedReader(isr);
            // Create an OutputStream to write raw bytes to the target file
            FileOutputStream fos = new FileOutputStream(targetFilePath);
            // Wrap it in an OutputStreamWriter that encodes characters to UTF-8 bytes
            OutputStreamWriter osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);
            // Wrap the Writer in a BufferedWriter for efficient writing
            BufferedWriter bw = new BufferedWriter(osw)
        ) {
            String line;
            // 3. Read line by line from the GBK file
            while ((line = br.readLine()) != null) {
                // 4. Write each line to the UTF-8 file
                bw.write(line);
                // Add the newline character back, as readLine() strips it
                bw.newLine();
            }
            System.out.println("File converted successfully from GBK to UTF-8.");
        } catch (UnsupportedEncodingException e) {
            System.err.println("Error: The GBK encoding is not supported by this JVM.");
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            System.err.println("Error: One of the files was not found.");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("An I/O error occurred during the conversion.");
            e.printStackTrace();
        }
    }
}

Explanation of the Code:

  1. FileInputStream / FileOutputStream: These read and write raw bytes.
  2. InputStreamReader(fis, "GBK"): This is the key part for reading. It takes the raw byte stream (fis) and uses the "GBK" character set to correctly interpret those bytes as characters, creating a Reader.
  3. BufferedReader: A wrapper for efficiency, allowing us to read the file line by line with readLine().
  4. OutputStreamWriter(fos, StandardCharsets.UTF_8): This is the key part for writing. It takes a Writer and a character set. It takes the characters we write and encodes them into bytes using the specified UTF-8 charset before writing them to the raw byte stream (fos).
  5. StandardCharsets.UTF_8: It's best practice to use the StandardCharsets enum for common encodings like UTF-8, as it's guaranteed to be supported and is more type-safe than using a string literal.

Method 2: The In-Memory Way (Using String Constructors)

This method is simpler but consumes more memory because it loads the entire file content into a String object. It's only suitable for small files.

import java.io.*;
import java.nio.charset.StandardCharsets;
public class GbkToUtf8StringConverter {
    public static void main(String[] args) {
        String sourceFilePath = "path/to/your/input_gbk.txt";
        String targetFilePath = "path/to/your/output_utf8_inmemory.txt";
        try {
            // 1. Read all bytes from the GBK file into a byte array
            byte[] gbkBytes = readAllBytesOrExit(new File(sourceFilePath));
            // 2. Create a String from the byte array, specifying the source encoding (GBK)
            // This decodes the GBK bytes into a UTF-16 String.
            String content = new String(gbkBytes, "GBK");
            // 3. Get the UTF-8 bytes from the String, specifying the target encoding (UTF-8)
            // This encodes the UTF-16 String into UTF-8 bytes.
            byte[] utf8Bytes = content.getBytes(StandardCharsets.UTF_8);
            // 4. Write the UTF-8 bytes to the new file
            Files.write(Files.createFile(Paths.get(targetFilePath)), utf8Bytes);
            System.out.println("File converted successfully from GBK to UTF-8 (in-memory).");
        } catch (UnsupportedEncodingException e) {
            System.err.println("Error: The GBK encoding is not supported by this JVM.");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("An I/O error occurred during the conversion.");
            e.printStackTrace();
        }
    }
    // Helper method to read all bytes from a file
    private static byte[] readAllBytesOrExit(File file) throws IOException {
        return Files.readAllBytes(file.toPath());
    }
}

Note: This example uses java.nio.file.Files, which is available in Java 7 and later. It's a very convenient utility for file operations.

Explanation of the Code:

  1. new String(gbkBytes, "GBK"): This constructor takes a byte array and a charset name. It interprets the bytes using the "GBK" encoding and constructs a String.
  2. content.getBytes(StandardCharsets.UTF_8): This method on the String object takes a charset and returns a byte array representing the string's characters encoded in that charset (in this case, UTF-8).

Important Prerequisite: JVM Support for GBK

For the code to work, your Java Virtual Machine (JVM) must have a font or library that supports the GBK character set. Most modern JVMs do include GBK support by default, especially those distributed for Windows or Chinese-speaking regions.

Java GBK转UTF-8如何正确实现编码转换?-图3
(图片来源网络,侵删)

If you get an UnsupportedEncodingException, it means your JVM doesn't know what "GBK" is. You might need to:

  • Ensure you are using a standard JVM distribution (like Oracle's or OpenJDK's).
  • On Linux, you might need to install a package like libcharset1 or ensure the sun.io converters are available.

Summary: Which Method to Use?

Feature Method 1: InputStreamReader / OutputStreamWriter Method 2: String Constructors
Memory Usage Low. Processes data in chunks (line by line or buffer by buffer). High. Loads the entire file into memory.
Performance Good for large files. Good for small files. Can be faster for tiny files due to less overhead.
Complexity Slightly more verbose due to wrapping streams. Very concise and easy to read.
Recommendation Highly Recommended. The standard, production-ready way. Use only for small files or quick scripts where memory is not a concern.
分享:
扫描分享到社交APP
上一篇
下一篇