杰瑞科技汇

Java中GBK转UTF-8编码如何实现?

Of course. Converting text from GBK to UTF-8 in Java is a common task, especially when dealing with legacy systems or data from regions like Mainland China where GBK is prevalent.

The core principle is to use Java's built-in character encoding support, primarily with InputStreamReader and OutputStreamWriter. These classes act as bridges between byte streams (which deal with raw data) and character streams (which deal with text).

Here’s a comprehensive guide, from the basic concept to complete, practical examples.


The Core Concept: InputStreamReader and OutputStreamWriter

When you read a text file, Java needs to know how to interpret the raw bytes as characters. This is where the character encoding comes in.

  • InputStreamReader(InputStream in, String charsetName): Reads bytes from an input stream and decodes them into characters using the specified charsetName (e.g., "GBK").
  • OutputStreamWriter(OutputStream out, String charsetName): Writes characters to an output stream and encodes them into bytes using the specified charsetName (e.g., "UTF-8").

The process is:

  1. Read: Read bytes from a source file (e.g., data.txt encoded in GBK).
  2. Decode: Use InputStreamReader with the "GBK" charset to convert those bytes into Java's internal char representation (UTF-16).
  3. Encode: Use OutputStreamWriter with the "UTF-8" charset to convert the chars into UTF-8 bytes.
  4. Write: Write the resulting UTF-8 bytes to a destination file (e.g., data_utf8.txt).

Basic Code Example (File to File)

This is the most common scenario: converting an entire text file.

import java.io.*;
public class GbkToUtf8Converter {
    public static void main(String[] args) {
        // Define the source and destination file paths
        String sourceFile = "path/to/your/source_gbk_file.txt";
        String destFile = "path/to/your/destination_utf8_file.txt";
        try {
            // 1. Create input and output streams
            // FileInputStream reads raw bytes from the source file
            FileInputStream fis = new FileInputStream(sourceFile);
            // FileOutputStream writes raw bytes to the destination file
            FileOutputStream fos = new FileOutputStream(destFile);
            // 2. Create readers and writers with the specified charsets
            // InputStreamReader decodes GBK bytes into Java characters (UTF-16)
            InputStreamReader isr = new InputStreamReader(fis, "GBK");
            // OutputStreamWriter encodes Java characters into UTF-8 bytes
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            // 3. Use a buffered reader/writer for efficiency
            BufferedReader br = new BufferedReader(isr);
            BufferedWriter bw = new BufferedWriter(osw);
            // 4. Read from the source and write to the destination line by line
            String line;
            while ((line = br.readLine()) != null) {
                bw.write(line);
                bw.newLine(); // Add a platform-independent newline character
            }
            // 5. Close all resources in reverse order of creation
            // This is crucial to flush any buffered data and release system resources.
            bw.close();
            br.close();
            osw.close();
            isr.close();
            fos.close();
            fis.close();
            System.out.println("File converted from GBK to UTF-8 successfully!");
        } catch (UnsupportedEncodingException e) {
            System.err.println("Error: The specified encoding is not supported.");
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            System.err.println("Error: Source file not found.");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("An I/O error occurred during the conversion.");
            e.printStackTrace();
        }
    }
}

A More Modern Approach (Java 7+): Try-with-resources

The try-with-resources statement automatically closes resources that implement the AutoCloseable interface, making your code cleaner and safer.

import java.io.*;
public class GbkToUtf8ConverterModern {
    public static void main(String[] args) {
        String sourceFile = "path/to/your/source_gbk_file.txt";
        String destFile = "path/to/your/destination_utf8_file.txt";
        // The try-with-resources statement ensures all resources are closed automatically
        try (
            // Readers and Writers are created here
            FileInputStream fis = new FileInputStream(sourceFile);
            FileOutputStream fos = new FileOutputStream(destFile);
            InputStreamReader isr = new InputStreamReader(fis, "GBK");
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            BufferedReader br = new BufferedReader(isr);
            BufferedWriter bw = new BufferedWriter(osw)
        ) {
            // The code to read and write goes here
            String line;
            while ((line = br.readLine()) != null) {
                bw.write(line);
                bw.newLine();
            }
            System.out.println("File converted from GBK to UTF-8 successfully!");
        } catch (UnsupportedEncodingException e) {
            System.err.println("Error: The specified encoding is not supported.");
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            System.err.println("Error: Source file not found.");
            e.printStackTrace();
        } catch (IOException e) {
            System.err.println("An I/O error occurred during the conversion.");
            e.printStackTrace();
        }
    }
}

Converting a String in Memory

Sometimes you don't have a file, but a String variable that contains GBK-encoded characters.

Important Note: A Java String object is always encoded in UTF-16 internally. You can't have a "GBK string". What you have is a byte array that represents GBK text. The conversion process involves decoding the bytes into a String and then encoding that String into a new byte array (UTF-8).

import java.nio.charset.StandardCharsets;
public class GbkStringConversion {
    public static void main(String[] args) {
        // This is a byte array that was encoded using the GBK charset.
        // For demonstration, we'll create it manually. In a real scenario,
        // you might get this from a network request or a file.
        String originalText = "你好,世界!这是一个GBK编码的测试。";
        byte[] gbkBytes = originalText.getBytes("GBK");
        System.out.println("Original GBK Bytes: " + new String(gbkBytes)); // May look like garbled text if default charset is not GBK
        // --- Conversion Process ---
        try {
            // 1. Decode the GBK byte array into a Java String (UTF-16)
            String decodedString = new String(gbkBytes, "GBK");
            System.out.println("Decoded String (internal UTF-16): " + decodedString);
            // 2. Encode the Java String into a UTF-8 byte array
            byte[] utf8Bytes = decodedString.getBytes(StandardCharsets.UTF_8);
            // Or: byte[] utf8Bytes = decodedString.getBytes("UTF-8");
            // You can now use the utf8Bytes, for example, to write to a file
            // or send over a network.
            String finalUtf8Text = new String(utf8Bytes, StandardCharsets.UTF_8);
            System.out.println("Final UTF-8 Text: " + finalUtf8Text);
        } catch (UnsupportedEncodingException e) {
            // This is unlikely as "GBK" and "UTF-8" are standard.
            e.printStackTrace();
        }
    }
}

Practical Utility: Converting an Entire Directory

This is a very useful script for converting all .txt files in a directory and its subdirectories.

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.*;
public class DirectoryConverter {
    public static void main(String[] args) {
        // The directory to scan for GBK files
        Path sourceDir = Paths.get("path/to/your/directory_with_gbk_files");
        // The directory where UTF-8 files will be saved
        Path destDir = Paths.get("path/to/your/utf8_output_directory");
        // Create the destination directory if it doesn't exist
        if (!Files.exists(destDir)) {
            try {
                Files.createDirectories(destDir);
            } catch (IOException e) {
                System.err.println("Could not create destination directory.");
                e.printStackTrace();
                return;
            }
        }
        // Use a try-with-resources for the directory stream
        try (DirectoryStream<Path> stream = Files.newDirectoryStream(sourceDir)) {
            for (Path path : stream) {
                if (Files.isRegularFile(path) && path.toString().endsWith(".txt")) {
                    System.out.println("Processing: " + path.getFileName());
                    convertFile(path, destDir.resolve(path.getFileName()));
                }
            }
        } catch (IOException e) {
            System.err.println("Error reading the source directory.");
            e.printStackTrace();
        }
    }
    public static void convertFile(Path sourcePath, Path destPath) {
        try (
            // Using Files.newBufferedReader/Writer is a convenient shortcut
            BufferedReader br = Files.newBufferedReader(sourcePath, java.nio.charset.StandardCharsets.GBK);
            BufferedWriter bw = Files.newBufferedWriter(destPath, StandardCharsets.UTF_8)
        ) {
            String line;
            while ((line = br.readLine()) != null) {
                bw.write(line);
                bw.newLine();
            }
            System.out.println("  -> Converted to: " + destPath.getFileName());
        } catch (IOException e) {
            System.err.println("Failed to convert " + sourcePath.getFileName() + ": " + e.getMessage());
        }
    }
}

Key Takeaways

  1. Use InputStreamReader and OutputStreamWriter for stream-based I/O (files, network).
  2. Use String(byte[], charset) and String.getBytes(charset) for in-memory conversions between byte arrays and Strings.
  3. Always specify the encoding explicitly (e.g., "GBK", "UTF-8"). Never rely on the platform's default charset (new String(byteArray)), as it leads to inconsistent behavior across different environments.
  4. Prefer try-with-resources to ensure your streams and readers are always closed properly.
  5. Handle UnsupportedEncodingException: While unlikely for standard encodings like GBK and UTF-8, it's good practice to handle it.
分享:
扫描分享到社交APP
上一篇
下一篇