Input format
This chapter describes how a source file is interpreted as a sequence of tokens.
See Crates and source files for a description of how programs are organised into files.
Source encoding
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
It is an error if the file is not valid UTF-8.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalization
Each pair of characters U+000D
(CR) immediately followed by U+000A
(LF) is replaced by a single U+000A
(LF).
Other occurrences of the character U+000D
(CR) are left in place (they are treated as whitespace).
Shebang removal
If the remaining sequence begins with the characters #!
, the characters up to and including the first U+000A
(LF) are removed from the sequence.
For example, the first line of the following file would be ignored:
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
As an exception, if the #!
characters are followed (ignoring intervening comments or whitespace) by a [
token, nothing is removed.
This prevents an inner attribute at the start of a source file being removed.
Note: The standard library
include!
macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. Theinclude_str!
andinclude_bytes!
macros do not.
Tokenization
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.