Handling Chinese Character Encoding Issues in Oracle Database

  sonic0002        2024-07-12 21:09:14       1,224        0         

When working with Chinese data in Oracle databases, character encoding issues often lead to garbled text, preventing proper display and processing. This is typically because the default character set in Oracle is US7ASCII, which cannot recognize Chinese characters. By modifying the character sets of the database, client, and application, data can be converted to the correct character set, thus avoiding garbled text issues. This article introduces several methods to resolve Chinese character encoding issues in Oracle databases.

Background

Oracle is recognized as a leading DBMS in the industry, capable of being self-managed or accessed through cloud solutions. However, users sometimes encounter garbled text issues during usage. This article will introduce how to address Oracle garbled text issues in the C programming language.

Symptoms of Database Garbled Text

When data retrieved from Oracle displays as garbled text or corrupted characters, it indicates an issue with the character set. Common symptoms include:

Inconsistent Character Sets Between Database and Application

If the application uses a different character set than the database, query results may appear garbled because the application cannot correctly interpret the character set. For example, querying UTF-8 encoded data with a GB2312 encoded application may result in garbled text.

For example:

  • Database Character Set: UTF-8
  • Application Character Set: GB2312

Original Text (UTF-8): 你好, 世界
Garbled Text (GB2312): 浣犲ソ, 涓栫晫

The characters appear as nonsensical symbols because the application interprets the bytes meant for UTF-8 as GB2312.

Illegal Characters in the Database

If the database contains illegal characters, such as control characters or binary data, the results may appear garbled. Similarly, querying a column containing 4-byte characters with a 2-byte character set may also result in garbled text.

For example:

  • Database Character Set: UTF-8
  • Data: Contains control characters or binary data

Original Text: 你好\x01\x02, 世界
Garbled Text: 你好□, 世界

Here, control characters \x01\x02 are displayed as non-printable or placeholder symbols (□).

Incorrect Database Character Set Configuration

An incorrect database character set configuration can cause data stored in the database to appear garbled. For instance, if the database character set is configured to UTF-8 but the actual data is encoded in GB2312, garbled text will appear.

For example:

  • Database Character Set Configured As: UTF-8
  • Actual Data Encoding: GB2312

Original Text (GB2312): 你好, 世界
Garbled Text (Misinterpreted as UTF-8): 娴嬭瘯, 浣跨敤

The data stored as GB2312 is incorrectly interpreted as UTF-8, leading to nonsensical symbols.

Solutions

Specify Character Set When Creating Tables

When creating tables, specifying the character set ensures that data import and reading between tables of different character sets do not result in garbled text. Below is an example of creating a table with UTF-8 character set:

CREATE TABLE mytable (
  id NUMBER(10), 
  name VARCHAR2(50 BYTE)
) 
TABLESPACE mytable 
PCTFREE 10 
INITRANS 1 
STORAGE (
  BUFFER_POOL DEFAULT
)
NOCOMPRESS NOLOGGING
CHARACTER SET UTF8;

Character Set Conversion

During Oracle database migration or data import involving different character sets, character set conversion is necessary. Oracle provides the NLS_CHARACTERSET attribute to control the database character set. Users can change the Oracle database character set using this attribute. For example, the following command changes the Oracle database character set to UTF-8:

ALTER SYSTEM SET NLS_CHARACTERSET = 'UTF8' SCOPE=SPFILE;

Character Set Conversion During Data Import

Oracle provides the IMPORT and EXPDP tools for data import and export, which can be used to convert character sets during data import. For example, the following command converts a GB2312 encoded import file to a UTF-8 encoded Oracle database:

imp userid=test/test file=dump.dmp fromuser=test touser=test2 ignore=y 
full=y feedback=1000000 BUFFER=1000000 
CHARSET=GBK CONTENT=ALL

Use Unicode Character Set

To prevent compatibility issues between different character sets, using the Unicode character set is a good choice. With Unicode, all data in the Oracle database can be accessed and transmitted correctly.

By following these practices, you can avoid and resolve garbled text issues when working with Chinese characters in Oracle databases.

Reference: https://segmentfault.com/a/1190000044984948

DATABASE  UNICODE  ORACLE  CHINESE CHARACTER 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Senior software engineer