When working with Chinese data in Oracle databases, character encoding issues often lead to garbled text, preventing proper display and processing. This is typically because the default character set in Oracle is US7ASCII, which cannot recognize Chinese characters. By modifying the character sets of the database, client, and application, data can be converted to the correct character set, thus avoiding garbled text issues. This article introduces several methods to resolve Chinese character encoding issues in Oracle databases.
Background
Oracle is recognized as a leading DBMS in the industry, capable of being self-managed or accessed through cloud solutions. However, users sometimes encounter garbled text issues during usage. This article will introduce how to address Oracle garbled text issues in the C programming language.
Symptoms of Database Garbled Text
When data retrieved from Oracle displays as garbled text or corrupted characters, it indicates an issue with the character set. Common symptoms include:
Inconsistent Character Sets Between Database and Application
If the application uses a different character set than the database, query results may appear garbled because the application cannot correctly interpret the character set. For example, querying UTF-8 encoded data with a GB2312 encoded application may result in garbled text.
For example:
- Database Character Set: UTF-8
- Application Character Set: GB2312
Original Text (UTF-8): 你好, 世界
Garbled Text (GB2312): 浣犲ソ, 涓栫晫
The characters appear as nonsensical symbols because the application interprets the bytes meant for UTF-8 as GB2312.
Illegal Characters in the Database
If the database contains illegal characters, such as control characters or binary data, the results may appear garbled. Similarly, querying a column containing 4-byte characters with a 2-byte character set may also result in garbled text.
For example:
- Database Character Set: UTF-8
- Data: Contains control characters or binary data
Original Text: 你好\x01\x02, 世界
Garbled Text: 你好□, 世界
Here, control characters \x01\x02 are displayed as non-printable or placeholder symbols (□).
Incorrect Database Character Set Configuration
An incorrect database character set configuration can cause data stored in the database to appear garbled. For instance, if the database character set is configured to UTF-8 but the actual data is encoded in GB2312, garbled text will appear.
For example:
- Database Character Set Configured As: UTF-8
- Actual Data Encoding: GB2312
Original Text (GB2312): 你好, 世界
Garbled Text (Misinterpreted as UTF-8): 娴嬭瘯, 浣跨敤
The data stored as GB2312 is incorrectly interpreted as UTF-8, leading to nonsensical symbols.
Solutions
Specify Character Set When Creating Tables
When creating tables, specifying the character set ensures that data import and reading between tables of different character sets do not result in garbled text. Below is an example of creating a table with UTF-8 character set:
CREATE TABLE mytable (
id NUMBER(10),
name VARCHAR2(50 BYTE)
)
TABLESPACE mytable
PCTFREE 10
INITRANS 1
STORAGE (
BUFFER_POOL DEFAULT
)
NOCOMPRESS NOLOGGING
CHARACTER SET UTF8;
Character Set Conversion
During Oracle database migration or data import involving different character sets, character set conversion is necessary. Oracle provides the NLS_CHARACTERSET
attribute to control the database character set. Users can change the Oracle database character set using this attribute. For example, the following command changes the Oracle database character set to UTF-8:
ALTER SYSTEM SET NLS_CHARACTERSET = 'UTF8' SCOPE=SPFILE;
Character Set Conversion During Data Import
Oracle provides the IMPORT
and EXPDP
tools for data import and export, which can be used to convert character sets during data import. For example, the following command converts a GB2312 encoded import file to a UTF-8 encoded Oracle database:
imp userid=test/test file=dump.dmp fromuser=test touser=test2 ignore=y
full=y feedback=1000000 BUFFER=1000000
CHARSET=GBK CONTENT=ALL
Use Unicode Character Set
To prevent compatibility issues between different character sets, using the Unicode character set is a good choice. With Unicode, all data in the Oracle database can be accessed and transmitted correctly.
By following these practices, you can avoid and resolve garbled text issues when working with Chinese characters in Oracle databases.
Reference: https://segmentfault.com/a/1190000044984948