-
Notifications
You must be signed in to change notification settings - Fork 13k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
translation: Add the translation of the data structure chapter (#1007)
* Add the translation of the data structure chapter. Synchronize the headings in mkdocs-en.yml * Fix a typo
- Loading branch information
Showing
14 changed files
with
477 additions
and
44 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
# Fundamental Data Types | ||
|
||
When we think of data in computers, we imagine various forms like text, images, videos, voice, 3D models, etc. Despite their different organizational forms, they are all composed of various fundamental data types. | ||
|
||
**Fundamental data types are those that the CPU can directly operate on** and are directly used in algorithms, mainly including the following. | ||
|
||
- Integer types: `byte`, `short`, `int`, `long`. | ||
- Floating-point types: `float`, `double`, used to represent decimals. | ||
- Character type: `char`, used to represent letters, punctuation, and even emojis in various languages. | ||
- Boolean type: `bool`, used for "yes" or "no" decisions. | ||
|
||
**Fundamental data types are stored in computers in binary form**. One binary digit is equal to 1 bit. In most modern operating systems, 1 byte consists of 8 bits. | ||
|
||
The range of values for fundamental data types depends on the size of the space they occupy. Below, we take Java as an example. | ||
|
||
- The integer type `byte` occupies 1 byte = 8 bits and can represent \(2^8\) numbers. | ||
- The integer type `int` occupies 4 bytes = 32 bits and can represent \(2^{32}\) numbers. | ||
|
||
The following table lists the space occupied, value range, and default values of various fundamental data types in Java. This table does not need to be memorized, but understood roughly and referred to when needed. | ||
|
||
<p align="center"> Table <id> Space Occupied and Value Range of Fundamental Data Types </p> | ||
|
||
| Type | Symbol | Space Occupied | Minimum Value | Maximum Value | Default Value | | ||
| ------- | -------- | -------------- | -------------------------- | ------------------------- | ---------------- | | ||
| Integer | `byte` | 1 byte | \(-2^7\) (\(-128\)) | \(2^7 - 1\) (\(127\)) | 0 | | ||
| | `short` | 2 bytes | \(-2^{15}\) | \(2^{15} - 1\) | 0 | | ||
| | `int` | 4 bytes | \(-2^{31}\) | \(2^{31} - 1\) | 0 | | ||
| | `long` | 8 bytes | \(-2^{63}\) | \(2^{63} - 1\) | 0 | | ||
| Float | `float` | 4 bytes | \(1.175 \times 10^{-38}\) | \(3.403 \times 10^{38}\) | \(0.0\text{f}\) | | ||
| | `double` | 8 bytes | \(2.225 \times 10^{-308}\) | \(1.798 \times 10^{308}\) | 0.0 | | ||
| Char | `char` | 2 bytes | 0 | \(2^{16} - 1\) | 0 | | ||
| Boolean | `bool` | 1 byte | \(\text{false}\) | \(\text{true}\) | \(\text{false}\) | | ||
|
||
Please note that the above table is specific to Java's fundamental data types. Each programming language has its own data type definitions, and their space occupied, value ranges, and default values may differ. | ||
|
||
- In Python, the integer type `int` can be of any size, limited only by available memory; the floating-point `float` is double precision 64-bit; there is no `char` type, as a single character is actually a string `str` of length 1. | ||
- C and C++ do not specify the size of fundamental data types, which varies with implementation and platform. The above table follows the LP64 [data model](https://en.cppreference.com/w/cpp/language/types#Properties), used for Unix 64-bit operating systems including Linux and macOS. | ||
- The size of `char` in C and C++ is 1 byte, while in most programming languages, it depends on the specific character encoding method, as detailed in the "Character Encoding" chapter. | ||
- Even though representing a boolean only requires 1 bit (0 or 1), it is usually stored in memory as 1 byte. This is because modern computer CPUs typically use 1 byte as the smallest addressable memory unit. | ||
|
||
So, what is the connection between fundamental data types and data structures? We know that data structures are ways to organize and store data in computers. The focus here is on "structure" rather than "data". | ||
|
||
If we want to represent "a row of numbers", we naturally think of using an array. This is because the linear structure of an array can represent the adjacency and order of numbers, but whether the stored content is an integer `int`, a decimal `float`, or a character `char`, is irrelevant to the "data structure". | ||
|
||
In other words, **fundamental data types provide the "content type" of data, while data structures provide the "way of organizing" data**. For example, in the following code, we use the same data structure (array) to store and represent different fundamental data types, including `int`, `float`, `char`, `bool`, etc. | ||
|
||
=== "Python" | ||
|
||
```python title="" | ||
# Using various fundamental data types to initialize arrays | ||
numbers: list[int] = [0] * 5 | ||
decimals: list[float] = [0.0] * 5 | ||
# Python's characters are actually strings of length 1 | ||
characters: list[str] = ['0'] * 5 | ||
bools: list[bool] = [False] * 5 | ||
# Python's lists can freely store various fundamental data types and object references | ||
data = [0, 0.0, 'a', False, ListNode(0)] | ||
``` | ||
|
||
=== "C++" | ||
|
||
```cpp title="" | ||
// Using various fundamental data types to initialize arrays | ||
int numbers[5]; | ||
float decimals[5]; | ||
char characters[5]; | ||
bool bools[5]; | ||
``` | ||
|
||
=== "Java" | ||
|
||
```java title="" | ||
// Using various fundamental data types to initialize arrays | ||
int[] numbers = new int[5]; | ||
float[] decimals = new float[5]; | ||
char[] characters = new char[5]; | ||
boolean[] bools = new boolean[5]; | ||
``` | ||
|
||
=== "C#" | ||
|
||
```csharp title="" | ||
// Using various fundamental data types to initialize arrays | ||
int[] numbers = new int[5]; | ||
float[] decimals = new float[5]; | ||
char[] characters = new char[5]; | ||
bool[] bools = new bool[5]; | ||
``` | ||
|
||
=== "Go" | ||
|
||
```go title="" | ||
// Using various fundamental data types to initialize arrays | ||
var numbers = [5]int{} | ||
var decimals = [5]float64{} | ||
var characters = [5]byte{} | ||
var bools = [5]bool{} | ||
``` | ||
|
||
=== "Swift" | ||
|
||
```swift title="" | ||
// Using various fundamental data types to initialize arrays | ||
let numbers = Array(repeating: Int(), count: 5) | ||
let decimals = Array(repeating: Double(), count: 5) | ||
let characters = Array(repeating: Character("a"), count: 5) | ||
let bools = Array(repeating: Bool(), count: 5) | ||
``` | ||
|
||
=== "JS" | ||
|
||
```javascript title="" | ||
// JavaScript's arrays can freely store various fundamental data types and objects | ||
const array = [0, 0.0, 'a', false]; | ||
``` | ||
|
||
=== "TS" | ||
|
||
```typescript title="" | ||
// Using various fundamental data types to initialize arrays | ||
const numbers: number[] = []; | ||
const characters: string[] = []; | ||
const bools: boolean[] = []; | ||
``` | ||
|
||
=== "Dart" | ||
|
||
```dart title="" | ||
// Using various fundamental data types to initialize arrays | ||
List<int> numbers = List.filled(5, 0); | ||
List<double> decimals = List.filled(5, 0.0); | ||
List<String> characters = List.filled(5, 'a'); | ||
List<bool> bools = List.filled(5, false); | ||
``` | ||
|
||
=== "Rust" | ||
|
||
```rust title="" | ||
// Using various fundamental data types to initialize arrays | ||
let numbers: Vec<i32> = vec![0; 5]; | ||
let decimals: Vec<f32> = vec![0.0, 5]; | ||
let characters: Vec<char> = vec!['0'; 5]; | ||
let bools: Vec<bool> = vec![false; 5]; | ||
``` | ||
|
||
=== "C" | ||
|
||
```c title="" | ||
// Using various fundamental data types to initialize arrays | ||
int numbers[10]; | ||
float decimals[10]; | ||
char characters[10]; | ||
bool bools[10]; | ||
``` | ||
|
||
=== "Zig" | ||
|
||
```zig title="" | ||
// Using various fundamental data types to initialize arrays | ||
var numbers: [5]i32 = undefined; | ||
var decimals: [5]f32 = undefined; | ||
var characters: [5]u8 = undefined; | ||
var bools: [5]bool = undefined; | ||
``` |
Binary file added
BIN
+60.2 KB
docs-en/chapter_data_structure/character_encoding.assets/ascii_table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+20.3 KB
docs-en/chapter_data_structure/character_encoding.assets/unicode_hello_algo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+26.4 KB
docs-en/chapter_data_structure/character_encoding.assets/utf-8_hello_algo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# Character Encoding * | ||
|
||
In computers, all data is stored in binary form, and the character `char` is no exception. To represent characters, we need to establish a "character set" that defines a one-to-one correspondence between each character and binary numbers. With a character set, computers can convert binary numbers to characters by looking up a table. | ||
|
||
## ASCII Character Set | ||
|
||
The "ASCII code" is one of the earliest character sets, officially known as the American Standard Code for Information Interchange. It uses 7 binary digits (the lower 7 bits of a byte) to represent a character, allowing for a maximum of 128 different characters. As shown in the figure below, ASCII includes uppercase and lowercase English letters, numbers 0 ~ 9, some punctuation marks, and some control characters (such as newline and tab). | ||
|
||
![ASCII Code](character_encoding.assets/ascii_table.png) | ||
|
||
However, **ASCII can only represent English characters**. With the globalization of computers, a character set called "EASCII" was developed to represent more languages. It expands on the 7-bit basis of ASCII to 8 bits, enabling the representation of 256 different characters. | ||
|
||
Globally, a series of EASCII character sets for different regions emerged. The first 128 characters of these sets are uniformly ASCII, while the remaining 128 characters are defined differently to cater to various language requirements. | ||
|
||
## GBK Character Set | ||
|
||
Later, it was found that **EASCII still could not meet the character requirements of many languages**. For instance, there are nearly a hundred thousand Chinese characters, with several thousand used in everyday life. In 1980, China's National Standards Bureau released the "GB2312" character set, which included 6763 Chinese characters, essentially meeting the computer processing needs for Chinese. | ||
|
||
However, GB2312 could not handle some rare and traditional characters. The "GBK" character set, an expansion of GB2312, includes a total of 21886 Chinese characters. In the GBK encoding scheme, ASCII characters are represented with one byte, while Chinese characters use two bytes. | ||
|
||
## Unicode Character Set | ||
|
||
With the rapid development of computer technology and a plethora of character sets and encoding standards, numerous problems arose. On one hand, these character sets generally only defined characters for specific languages and could not function properly in multilingual environments. On the other hand, the existence of multiple character set standards for the same language caused garbled text when information was exchanged between computers using different encoding standards. | ||
|
||
Researchers of that era thought: **What if we introduced a comprehensive character set that included all languages and symbols worldwide, wouldn't that solve the problems of cross-language environments and garbled text?** Driven by this idea, the extensive character set, Unicode, was born. | ||
|
||
The Chinese name for "Unicode" is "统一码" (Unified Code), theoretically capable of accommodating over a million characters. It aims to incorporate characters from all over the world into a single set, providing a universal character set for processing and displaying various languages and reducing the issues of garbled text due to different encoding standards. | ||
|
||
Since its release in 1991, Unicode has continually expanded to include new languages and characters. As of September 2022, Unicode contains 149,186 characters, including characters, symbols, and even emojis from various languages. In the vast Unicode character set, commonly used characters occupy 2 bytes, while some rare characters take up 3 or even 4 bytes. | ||
|
||
Unicode is a universal character set that assigns a number (called a "code point") to each character, **but it does not specify how these character code points should be stored in a computer**. One might ask: When Unicode code points of varying lengths appear in a text, how does the system parse the characters? For example, given a 2-byte code, how does the system determine if it represents a single 2-byte character or two 1-byte characters? | ||
|
||
A straightforward solution to this problem is to store all characters as equal-length encodings. As shown in the figure below, each character in "Hello" occupies 1 byte, while each character in "算法" (algorithm) occupies 2 bytes. We could encode all characters in "Hello 算法" as 2 bytes by padding the higher bits with zeros. This way, the system can parse a character every 2 bytes, recovering the content of the phrase. | ||
|
||
![Unicode Encoding Example](character_encoding.assets/unicode_hello_algo.png) | ||
|
||
However, as ASCII has shown us, encoding English only requires 1 byte. Using the above approach would double the space occupied by English text compared to ASCII encoding, which is a waste of memory space. Therefore, a more efficient Unicode encoding method is needed. | ||
|
||
## UTF-8 Encoding | ||
|
||
Currently, UTF-8 has become the most widely used Unicode encoding method internationally. **It is a variable-length encoding**, using 1 to 4 bytes to represent a character, depending on the complexity of the character. ASCII characters need only 1 byte, Latin and Greek letters require 2 bytes, commonly used Chinese characters need 3 bytes, and some other rare characters need 4 bytes. | ||
|
||
The encoding rules for UTF-8 are not complex and can be divided into two cases: | ||
|
||
- For 1-byte characters, set the highest bit to $0$, and the remaining 7 bits to the Unicode code point. Notably, ASCII characters occupy the first 128 code points in the Unicode set. This means that **UTF-8 encoding is backward compatible with ASCII**. This implies that UTF-8 can be used to parse ancient ASCII text. | ||
- For characters of length $n$ bytes (where $n > 1$), set the highest $n$ bits of the first byte to $1$, and the $(n + 1)^{\text{th}}$ bit to $0$; starting from the second byte, set the highest 2 bits of each byte to $10$; the rest of the bits are used to fill the Unicode code point. | ||
|
||
The figure below shows the UTF-8 encoding for "Hello算法". It can be observed that since the highest $n$ bits are set to $1$, the system can determine the length of the character as $n$ by counting the number of highest bits set to $1$. | ||
|
||
But why set the highest 2 bits of the remaining bytes to $10$? Actually, this $10$ serves as a kind of checksum. If the system starts parsing text from an incorrect byte, the $10$ at the beginning of the byte can help the system quickly detect an anomaly. | ||
|
||
The reason for using $10$ as a checksum is that, under UTF-8 encoding rules, it's impossible for the highest two bits of a character to be $10$. This can be proven by contradiction: If the highest two bits of a character are $10$, it indicates that the character's length is $1$, corresponding to ASCII. However, the highest bit of an ASCII character should be $0$, contradicting the assumption. | ||
|
||
![UTF-8 Encoding Example](character_encoding.assets/utf-8_hello_algo.png) | ||
|
||
Apart from UTF-8, other common encoding methods include: | ||
|
||
- **UTF-16 Encoding**: Uses 2 or 4 bytes to represent a character. All ASCII characters and commonly used non-English characters are represented with 2 bytes; a few characters require 4 bytes. For 2-byte characters, the UTF-16 encoding is equal to the Unicode code point. | ||
- **UTF-32 Encoding**: Every character uses 4 bytes. This means UTF-32 occupies more space than UTF-8 and UTF-16, especially for texts with a high proportion of ASCII characters. | ||
|
||
From the perspective of storage space, UTF-8 is highly efficient for representing English characters, requiring only 1 byte; UTF-16 might be more efficient for encoding some non-English characters (like Chinese), as it requires only 2 bytes, while UTF-8 might need 3 bytes. | ||
|
||
From a compatibility standpoint, UTF-8 is the most versatile, with many tools and libraries supporting UTF-8 as a priority. | ||
|
||
## Character Encoding in Programming Languages | ||
|
||
In many classic programming languages, strings during program execution are encoded using fixed-length encodings like UTF-16 or UTF-32. This allows strings to be treated as arrays, offering several advantages: | ||
|
||
- **Random Access**: Strings encoded in UTF-16 can be accessed randomly with ease. For UTF-8, which is a variable-length encoding, locating the $i^{th}$ character requires traversing the string from the start to the $i^{th}$ position, taking $O(n)$ time. | ||
- **Character Counting**: Similar to random access, counting the number of characters in a UTF-16 encoded string is an $O(1)$ operation. However, counting characters in a UTF-8 encoded string requires traversing the entire string. | ||
- **String Operations**: Many string operations like splitting, concatenating, inserting, and deleting are easier on UTF-16 encoded strings. These operations generally require additional computation on UTF-8 encoded strings to ensure the validity of the UTF-8 encoding. | ||
|
||
The design of character encoding schemes in programming languages is an interesting topic involving various factors: | ||
|
||
- Java’s `String` type uses UTF-16 encoding, with each character occupying 2 bytes. This was based on the initial belief that 16 bits were sufficient to represent all possible characters, a judgment later proven incorrect. As the Unicode standard expanded beyond 16 bits, characters in Java may now be represented by a pair of 16-bit values, known as “surrogate pairs.” | ||
- JavaScript and TypeScript use UTF-16 encoding for similar reasons as Java. When JavaScript was first introduced by Netscape in 1995, Unicode was still in its early stages, and 16-bit encoding was sufficient to represent all Unicode characters. | ||
- C# uses UTF-16 encoding, largely because the .NET platform, designed by Microsoft, and many Microsoft technologies, including the Windows operating system, extensively use UTF-16 encoding. | ||
|
||
Due to the underestimation of character counts, these languages had to resort to using "surrogate pairs" to represent Unicode characters exceeding 16 bits. This approach has its drawbacks: strings containing surrogate pairs may have characters occupying 2 or 4 bytes, losing the advantage of fixed-length encoding, and handling surrogate pairs adds to the complexity and debugging difficulty of programming. | ||
|
||
Owing to these reasons, some programming languages have adopted different encoding schemes: | ||
|
||
- Python’s `str` type uses Unicode encoding with a flexible representation where the storage length of characters depends on the largest Unicode code point in the string. If all characters are ASCII, each character occupies 1 byte; if characters exceed ASCII but are within the Basic Multilingual Plane (BMP), each occupies 2 bytes; if characters exceed the BMP, each occupies 4 bytes. | ||
- Go’s `string` type internally uses UTF-8 encoding. Go also provides the `rune` type for representing individual Unicode code points. | ||
- Rust’s `str` and `String` types use UTF-8 encoding internally. Rust also offers the `char` type for individual Unicode code points. | ||
|
||
It’s important to note that the above discussion pertains to how strings are stored in programming languages, **which is a different issue from how strings are stored in files or transmitted over networks**. For file storage or network transmission, strings are usually encoded in UTF-8 format for optimal compatibility and space efficiency. |
Oops, something went wrong.