Primitive Type char
Expand description
A character type.
The char type represents a single character. More specifically, since
‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode
scalar value’.
This documentation describes a number of methods and trait implementations on the
char type. For technical reasons, there is additional, separate
documentation in the std::char module as well.
§Validity and Layout
A char is a ‘Unicode scalar value’, which is any ‘Unicode code point’
other than a surrogate code point. This has a fixed numerical definition:
code points are in the range 0 to 0x10FFFF, inclusive.
Surrogate code points, used by UTF-16, are in the range 0xD800 to 0xDFFF.
No char may be constructed, whether as a literal or at runtime, that is not a
Unicode scalar value. Violating this rule causes undefined behavior.
Unicode scalar values are also the exact set of values that may be encoded in UTF-8. Because
char values are Unicode scalar values and functions may assume incoming str values are
valid UTF-8, it is safe to store any char in a str or read
any character from a str as a char.
The gap in valid char values is understood by the compiler, so in the
below example the two ranges are understood to cover the whole range of
possible char values and there is no error for a non-exhaustive match.
All Unicode scalar values are valid char values, but not all of them represent a real
character. Many Unicode scalar values are not currently assigned to a character, but may be in
the future (“reserved”); some will never be a character (“noncharacters”); and some may be given
different meanings by different users (“private use”).
char is guaranteed to have the same size, alignment, and function call ABI as u32 on all
platforms.
§Representation
char is always four bytes in size. This is a different representation than
a given character would have as part of a String. For example:
let v = vec!['h', 'e', 'l', 'l', 'o'];
// five elements times four bytes for each element
assert_eq!(20, v.len() * size_of::<char>());
let s = String::from("hello");
// five elements times one byte per element
assert_eq!(5, s.len() * size_of::<u8>());As always, remember that a human intuition for ‘character’ might not map to Unicode’s definitions. For example, despite looking similar, the ‘é’ character is one Unicode code point while ‘é’ is two Unicode code points:
let mut chars = "é".chars();
// U+00e9: 'latin small letter e with acute'
assert_eq!(Some('\u{00e9}'), chars.next());
assert_eq!(None, chars.next());
let mut chars = "é".chars();
// U+0065: 'latin small letter e'
assert_eq!(Some('\u{0065}'), chars.next());
// U+0301: 'combining acute accent'
assert_eq!(Some('\u{0301}'), chars.next());
assert_eq!(None, chars.next());This means that the contents of the first string above will fit into a
char while the contents of the second string will not. Trying to create
a char literal with the contents of the second string gives an error:
error: character literal may only contain one codepoint: 'é'
let c = 'é';
^^^Another implication of the 4-byte fixed size of a char is that
per-char processing can end up using a lot more memory:
Implementations§
Source§impl char
impl char
1.83.0 · Sourcepub const MIN: char = '\0'
pub const MIN: char = '\0'
The lowest valid code point a char can have, '\0'.
Unlike integer types, char actually has a gap in the middle,
meaning that the range of possible chars is smaller than you
might expect. Ranges of char will automatically hop this gap
for you:
let dist = u32::from(char::MAX) - u32::from(char::MIN);
let size = (char::MIN..=char::MAX).count() as u32;
assert!(size < dist);Despite this gap, the MIN and MAX values can be used as bounds for
all char values.
§Examples
1.52.0 · Sourcepub const MAX: char = '\u{10ffff}'
pub const MAX: char = '\u{10ffff}'
The highest valid code point a char can have, '\u{10FFFF}'.
Unlike integer types, char actually has a gap in the middle,
meaning that the range of possible chars is smaller than you
might expect. Ranges of char will automatically hop this gap
for you:
let dist = u32::from(char::MAX) - u32::from(char::MIN);
let size = (char::MIN..=char::MAX).count() as u32;
assert!(size < dist);Despite this gap, the MIN and MAX values can be used as bounds for
all char values.
§Examples
1.93.0 · Sourcepub const MAX_LEN_UTF8: usize = 4usize
pub const MAX_LEN_UTF8: usize = 4usize
The maximum number of bytes required to encode a char to
UTF-8 encoding.
1.93.0 · Sourcepub const MAX_LEN_UTF16: usize = 2usize
pub const MAX_LEN_UTF16: usize = 2usize
The maximum number of two-byte units required to encode a char
to UTF-16 encoding.
1.52.0 · Sourcepub const REPLACEMENT_CHARACTER: char = '�'
pub const REPLACEMENT_CHARACTER: char = '�'
U+FFFD REPLACEMENT CHARACTER (�) is used in Unicode to represent a
decoding error.
It can occur, for example, when giving ill-formed UTF-8 bytes to
String::from_utf8_lossy.
1.52.0 (const: 1.81.0) · Sourcepub const unsafe fn from_u32_unchecked(i: u32) -> char
pub const unsafe fn from_u32_unchecked(i: u32) -> char
Converts a u32 to a char, ignoring validity.
Note that all chars are valid u32s, and can be cast to one with
as:
However, the reverse is not true: not all valid u32s are valid
chars. from_u32_unchecked() will ignore this, and blindly cast to
char, possibly creating an invalid one.
§Safety
This function is unsafe, as it may construct invalid char values.
For a safe version of this function, see the from_u32 function.
§Examples
Basic usage:
1.0.0 (const: 1.67.0) · Sourcepub const fn to_digit(self, radix: u32) -> Option<u32>
pub const fn to_digit(self, radix: u32) -> Option<u32>
Converts a char to a digit in the given radix.
A ‘radix’ here is sometimes also called a ‘base’. A radix of two indicates a binary number, a radix of ten, decimal, and a radix of sixteen, hexadecimal, to give some common values. Arbitrary radices are supported.
‘Digit’ is defined to be only the following characters:
0-9a-zA-Z
§Errors
Returns None if the char does not refer to a digit in the given radix.
§Panics
Panics if given a radix smaller than 2 or larger than 36.
§Examples
Basic usage:
Passing a non-digit results in failure:
Passing a large radix, causing a panic:
Passing a small radix, causing a panic:
1.0.0 (const: 1.52.0) · Sourcepub const fn len_utf8(self) -> usize
pub const fn len_utf8(self) -> usize
Returns the number of bytes this char would need if encoded in UTF-8.
That number of bytes is always between 1 and 4, inclusive.
§Examples
Basic usage:
let len = 'A'.len_utf8();
assert_eq!(len, 1);
let len = 'ß'.len_utf8();
assert_eq!(len, 2);
let len = 'ℝ'.len_utf8();
assert_eq!(len, 3);
let len = '💣'.len_utf8();
assert_eq!(len, 4);The &str type guarantees that its contents are UTF-8, and so we can compare the length it
would take if each code point was represented as a char vs in the &str itself:
// as chars
let eastern = '東';
let capital = '京';
// both can be represented as three bytes
assert_eq!(3, eastern.len_utf8());
assert_eq!(3, capital.len_utf8());
// as a &str, these two are encoded in UTF-8
let tokyo = "東京";
let len = eastern.len_utf8() + capital.len_utf8();
// we can see that they take six bytes total...
assert_eq!(6, tokyo.len());
// ... just like the &str
assert_eq!(len, tokyo.len());