Of course. Converting text from GBK to UTF-8 in Java is a common task, especially when dealing with legacy systems or data from regions like Mainland China where GBK is prevalent.
The core principle is to use Java's built-in character encoding support, primarily with InputStreamReader and OutputStreamWriter. These classes act as bridges between byte streams (which deal with raw data) and character streams (which deal with text).
Here’s a comprehensive guide, from the basic concept to complete, practical examples.
The Core Concept: InputStreamReader and OutputStreamWriter
When you read a text file, Java needs to know how to interpret the raw bytes as characters. This is where the character encoding comes in.
InputStreamReader(InputStream in, String charsetName): Reads bytes from an input stream and decodes them into characters using the specifiedcharsetName(e.g., "GBK").OutputStreamWriter(OutputStream out, String charsetName): Writes characters to an output stream and encodes them into bytes using the specifiedcharsetName(e.g., "UTF-8").
The process is:
- Read: Read bytes from a source file (e.g.,
data.txtencoded in GBK). - Decode: Use
InputStreamReaderwith the "GBK" charset to convert those bytes into Java's internalcharrepresentation (UTF-16). - Encode: Use
OutputStreamWriterwith the "UTF-8" charset to convert thechars into UTF-8 bytes. - Write: Write the resulting UTF-8 bytes to a destination file (e.g.,
data_utf8.txt).
Basic Code Example (File to File)
This is the most common scenario: converting an entire text file.
import java.io.*;
public class GbkToUtf8Converter {
public static void main(String[] args) {
// Define the source and destination file paths
String sourceFile = "path/to/your/source_gbk_file.txt";
String destFile = "path/to/your/destination_utf8_file.txt";
try {
// 1. Create input and output streams
// FileInputStream reads raw bytes from the source file
FileInputStream fis = new FileInputStream(sourceFile);
// FileOutputStream writes raw bytes to the destination file
FileOutputStream fos = new FileOutputStream(destFile);
// 2. Create readers and writers with the specified charsets
// InputStreamReader decodes GBK bytes into Java characters (UTF-16)
InputStreamReader isr = new InputStreamReader(fis, "GBK");
// OutputStreamWriter encodes Java characters into UTF-8 bytes
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
// 3. Use a buffered reader/writer for efficiency
BufferedReader br = new BufferedReader(isr);
BufferedWriter bw = new BufferedWriter(osw);
// 4. Read from the source and write to the destination line by line
String line;
while ((line = br.readLine()) != null) {
bw.write(line);
bw.newLine(); // Add a platform-independent newline character
}
// 5. Close all resources in reverse order of creation
// This is crucial to flush any buffered data and release system resources.
bw.close();
br.close();
osw.close();
isr.close();
fos.close();
fis.close();
System.out.println("File converted from GBK to UTF-8 successfully!");
} catch (UnsupportedEncodingException e) {
System.err.println("Error: The specified encoding is not supported.");
e.printStackTrace();
} catch (FileNotFoundException e) {
System.err.println("Error: Source file not found.");
e.printStackTrace();
} catch (IOException e) {
System.err.println("An I/O error occurred during the conversion.");
e.printStackTrace();
}
}
}
A More Modern Approach (Java 7+): Try-with-resources
The try-with-resources statement automatically closes resources that implement the AutoCloseable interface, making your code cleaner and safer.
import java.io.*;
public class GbkToUtf8ConverterModern {
public static void main(String[] args) {
String sourceFile = "path/to/your/source_gbk_file.txt";
String destFile = "path/to/your/destination_utf8_file.txt";
// The try-with-resources statement ensures all resources are closed automatically
try (
// Readers and Writers are created here
FileInputStream fis = new FileInputStream(sourceFile);
FileOutputStream fos = new FileOutputStream(destFile);
InputStreamReader isr = new InputStreamReader(fis, "GBK");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
BufferedReader br = new BufferedReader(isr);
BufferedWriter bw = new BufferedWriter(osw)
) {
// The code to read and write goes here
String line;
while ((line = br.readLine()) != null) {
bw.write(line);
bw.newLine();
}
System.out.println("File converted from GBK to UTF-8 successfully!");
} catch (UnsupportedEncodingException e) {
System.err.println("Error: The specified encoding is not supported.");
e.printStackTrace();
} catch (FileNotFoundException e) {
System.err.println("Error: Source file not found.");
e.printStackTrace();
} catch (IOException e) {
System.err.println("An I/O error occurred during the conversion.");
e.printStackTrace();
}
}
}
Converting a String in Memory
Sometimes you don't have a file, but a String variable that contains GBK-encoded characters.
Important Note: A Java String object is always encoded in UTF-16 internally. You can't have a "GBK string". What you have is a byte array that represents GBK text. The conversion process involves decoding the bytes into a String and then encoding that String into a new byte array (UTF-8).
import java.nio.charset.StandardCharsets;
public class GbkStringConversion {
public static void main(String[] args) {
// This is a byte array that was encoded using the GBK charset.
// For demonstration, we'll create it manually. In a real scenario,
// you might get this from a network request or a file.
String originalText = "你好,世界!这是一个GBK编码的测试。";
byte[] gbkBytes = originalText.getBytes("GBK");
System.out.println("Original GBK Bytes: " + new String(gbkBytes)); // May look like garbled text if default charset is not GBK
// --- Conversion Process ---
try {
// 1. Decode the GBK byte array into a Java String (UTF-16)
String decodedString = new String(gbkBytes, "GBK");
System.out.println("Decoded String (internal UTF-16): " + decodedString);
// 2. Encode the Java String into a UTF-8 byte array
byte[] utf8Bytes = decodedString.getBytes(StandardCharsets.UTF_8);
// Or: byte[] utf8Bytes = decodedString.getBytes("UTF-8");
// You can now use the utf8Bytes, for example, to write to a file
// or send over a network.
String finalUtf8Text = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println("Final UTF-8 Text: " + finalUtf8Text);
} catch (UnsupportedEncodingException e) {
// This is unlikely as "GBK" and "UTF-8" are standard.
e.printStackTrace();
}
}
}
Practical Utility: Converting an Entire Directory
This is a very useful script for converting all .txt files in a directory and its subdirectories.
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.*;
public class DirectoryConverter {
public static void main(String[] args) {
// The directory to scan for GBK files
Path sourceDir = Paths.get("path/to/your/directory_with_gbk_files");
// The directory where UTF-8 files will be saved
Path destDir = Paths.get("path/to/your/utf8_output_directory");
// Create the destination directory if it doesn't exist
if (!Files.exists(destDir)) {
try {
Files.createDirectories(destDir);
} catch (IOException e) {
System.err.println("Could not create destination directory.");
e.printStackTrace();
return;
}
}
// Use a try-with-resources for the directory stream
try (DirectoryStream<Path> stream = Files.newDirectoryStream(sourceDir)) {
for (Path path : stream) {
if (Files.isRegularFile(path) && path.toString().endsWith(".txt")) {
System.out.println("Processing: " + path.getFileName());
convertFile(path, destDir.resolve(path.getFileName()));
}
}
} catch (IOException e) {
System.err.println("Error reading the source directory.");
e.printStackTrace();
}
}
public static void convertFile(Path sourcePath, Path destPath) {
try (
// Using Files.newBufferedReader/Writer is a convenient shortcut
BufferedReader br = Files.newBufferedReader(sourcePath, java.nio.charset.StandardCharsets.GBK);
BufferedWriter bw = Files.newBufferedWriter(destPath, StandardCharsets.UTF_8)
) {
String line;
while ((line = br.readLine()) != null) {
bw.write(line);
bw.newLine();
}
System.out.println(" -> Converted to: " + destPath.getFileName());
} catch (IOException e) {
System.err.println("Failed to convert " + sourcePath.getFileName() + ": " + e.getMessage());
}
}
}
Key Takeaways
- Use
InputStreamReaderandOutputStreamWriterfor stream-based I/O (files, network). - Use
String(byte[], charset)andString.getBytes(charset)for in-memory conversions between byte arrays and Strings. - Always specify the encoding explicitly (e.g.,
"GBK","UTF-8"). Never rely on the platform's default charset (new String(byteArray)), as it leads to inconsistent behavior across different environments. - Prefer
try-with-resourcesto ensure your streams and readers are always closed properly. - Handle
UnsupportedEncodingException: While unlikely for standard encodings like GBK and UTF-8, it's good practice to handle it.
