— Compilers and Runtimes, Rust — 6 min read
Rust is a system programming language developed by Mozilla. The language boasts the performance of C/C++ with the safety and ease of use of languages like Python. Rust achieves this by enforcing a strong type system and static analysis of the source code.
This article provides an overview of the compilation process and structure of Rust compiler. The main focus will be on the Mid-level Intermediate Representation (MIR) and understanding it.
The core tool behind Rust which helps statically ensure that the code is memory and concurrency safe is its borrow checker. It ensures the following properties in the source code :
Note: When we say move we are referring to the move semantics. Only memory allocated on the heap is moved.
A simple example explaining move semantics:
1let mut x = vec![1,2,3] // initializes a vector. Vectors are allocated on heap.2let mut y = x; // x moved into y.3y.push(5); // works fine4x.push(5); // compiler throws an error saying "borrow of moved value: `x`"
As mentioned earlier; this article does not focus on how safety is guaranteed in Rust, rather on the mechanism which helps make the guarantees.
More details about the borrow checker here.
All programs that are compiled in Rust are safe, but that doesn't imply all safe programs are accepted by Rust. There are times when the programmer knows that the program is safe and won't lead to any undefined behaviors, but the Rust compiler may not be able to guarantee that. This is where Unsafe blocks come into the picture.
Unsafe blocks gives programmers some extra functionalities, these code snippets have to be enclosed in unsafe blocks. Unsafe blocks allow developers to :
The Unsafe blocks do not bypass the borrow checker though. Bare metal programming (Embedded systems) generally use Unsafe blocks. The Standard library of Rust uses Unsafe blocks extensively. Though it might seem that the safety guarantees are compromised, It is not so. Unsafe does not imply Undefined behavior, it just means that in that section of code the responsibility of safety is on the developer and the compiler cannot make any strong guarantees.
The process of compilation can be understood in two phases:
Parsing phase is the phase where the source code is analyzed for errors and is converted to a simplified form so that it can be translated to machine code easily. All the safety checks are done in this phase.
The Parsing phase consists of Rust Source, HIR and MIR.
MIR makes borrow checking easier, it helps have faster compile times and have better execution times. MIR enables more flexible borrowing rules than earlier which makes writing Rust code much easier.
MIR representation simplifies the code and makes it into a very primitive form to make the analysis easier. It is not recommended to use the MIR output of a code snippet to develop tools for analysis and profiling directly. MIR parsing can be done by accessing the Rust compiler internals in which the MIR is represented in data structures.
If you wish to view the MIR output locally on your system, run the following on the command-line :
1rustc main.rs --emit=mir # where main.rs is the code file
An easier way is to write code on Rust Playground and check the MIR output right there (Option to view MIR output is available in the dropdown menu next to the Run button).
The MIR output actually corresponds to a control flow graph. It is very similar to a flow chart. It helps understand program flow and
Before looking into MIR, Let's look into some definitions which help understand it.
Let's take a look at the MIR for a few code samples. Note that we won't be looking at the whole MIR translation but only the relevant parts. To view the MIR output, copy paste the code on Rust playground and view it.
The variables in the MIR representation do not have any names, as mentioned above they are represented as _x where x is any positive integer.
MIR output generally has comments which make it mostly self-explanatory, but we shall go through a few examples to get a basic understanding of MIR.
1fn main() {2 let mut x = 5;3 x = x + 10;4}
The MIR representation :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:113 let mut _1: i32; // "x" in scope 0 at main.rs:2:9: 2:144 let mut _2: i32; // in scope 0 at main.rs:3:9: 3:105 let mut _3: (i32, bool); // in scope 0 at main.rs:3:9: 3:1567 bb0: {8 StorageLive(_1); // bb0[0]: scope 0 at main.rs:2:9: 2:149 _1 = const 5i32; // bb0[1]: scope 0 at main.rs:2:17: 2:181011 StorageLive(_2); // bb0[2]: scope 1 at main.rs:3:9: 3:1012 _2 = _1; // bb0[3]: scope 1 at main.rs:3:9: 3:1013 _3 = CheckedAdd(move _2, const 10i32); // bb0[4]: scope 1 at main.rs:3:9: 3:151415 assert(!move (_3.1: bool), "attempt to add with overflow") -> bb1; // bb0[5]: scope 1 at main.rs:3:9: 3:1516 }1718 bb1: {19 _1 = move (_3.0: i32); // bb1[0]: scope 1 at main.rs:3:5: 3:1520 StorageDead(_2); // bb1[1]: scope 1 at main.rs:3:14: 3:1521 StorageDead(_1); // bb1[2]: scope 0 at main.rs:4:1: 4:222 return; // bb1[3]: scope 0 at main.rs:4:2: 4:223 }24}
A few points to note and understand here :
const 5i32
is an Rvalue, it is assigned to _1 (Line 2 in source code).const 10i32
and stores the result in _3.tuple (i32, bool)
. This is to check integer overflow during addition. The CheckedAdd function stores the resultant value in _3.0 (the first field in the tuple), and if integer overflow occurred in _3.1.assert(...)
checks if overflow occurred and prints "attempt to add with overflow" and exits, else jumps to the basic block bb1. Notice that this action is a terminator.1fn main() {2 let mut y = 1;3 // scope 14 {5 let z = 2;6 }7 y = 3;8}
The MIR representation :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:113 let mut _1: i32; // "y" in scope 0 at main.rs:2:9: 2:144 scope 1 {5 let _2: i32; // "z" in scope 1 at main.rs:4:13: 4:146 scope 2 {7 }8 }910 bb0: {11 StorageLive(_1); // bb0[0]: scope 0 at main.rs:2:9: 2:1412 _1 = const 1i32; // bb0[1]: scope 0 at main.rs:2:17: 2:1813 StorageLive(_2); // bb0[2]: scope 1 at main.rs:4:13: 4:1414 _2 = const 2i32; // bb0[3]: scope 1 at main.rs:4:17: 4:1815 StorageDead(_2); // bb0[4]: scope 1 at main.rs:5:5: 5:616 _1 = const 3i32; // bb0[5]: scope 1 at main.rs:6:5: 6:1017 StorageDead(_1); // bb0[6]: scope 0 at main.rs:7:1: 7:218 return; // bb0[7]: scope 0 at main.rs:7:2: 7:219 }20}
Something important to note is how _2 (z) is killed (as its scope ends there) before the reassignment of _1 (y) to 3 (which is source code is out of scope1).
A very important thing to not is that, as mentioned above statements in MIR are very primitive, thus keywords like if, for etc are not available. These statements are represented using the terminators mentioned above. The terminators depict a branching in the code flow.
1fn main() {2 let x = 5;3 let y;4 5 if x == 5 {6 y = 1;7 } else {8 y = 0;9 }10}
The Mir Representation :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:113 let mut _1: i32; // "x" in scope 0 at main.rs:2:9: 2:144 let mut _3: bool; // in scope 0 at main.rs:5:8: 5:145 let mut _4: i32; // in scope 0 at main.rs:5:8: 5:967 scope 1 {8 let _2: i32; // "y" in scope 1 at main.rs:3:9: 3:109 scope 2 {10 }11 }1213 bb0: {14 StorageLive(_1); // bb0[0]: scope 0 at main.rs:2:9: 2:1415 _1 = const 1i32; // bb0[1]: scope 0 at main.rs:2:17: 2:1816 StorageLive(_2); // bb0[2]: scope 1 at main.rs:3:9: 3:1017 StorageLive(_3); // bb0[3]: scope 2 at main.rs:5:8: 5:1418 StorageLive(_4); // bb0[4]: scope 2 at main.rs:5:8: 5:919 _4 = _1; // bb0[5]: scope 2 at main.rs:5:8: 5:920 _3 = Eq(move _4, const 1i32); // bb0[6]: scope 2 at main.rs:5:8: 5:1421 StorageDead(_4); // bb0[7]: scope 2 at main.rs:5:13: 5:1422 switchInt(_3) -> [false: bb1, otherwise: bb2]; // bb0[8]: scope 2 at main.rs:5:5: 9:623 }2425 bb1: {26 _2 = const 0i32; // bb1[0]: scope 2 at main.rs:8:9: 8:1427 goto -> bb3; // bb1[1]: scope 2 at main.rs:5:5: 9:628 }2930 bb2: {31 _2 = const 1i32; // bb2[0]: scope 2 at main.rs:6:9: 6:1432 goto -> bb3; // bb2[1]: scope 2 at main.rs:5:5: 9:633 }3435 bb3: {36 StorageDead(_3); // bb3[0]: scope 2 at main.rs:9:5: 9:637 _1 = const 2i32; // bb3[1]: scope 2 at main.rs:11:5: 11:1038 StorageDead(_2); // bb3[2]: scope 1 at main.rs:12:1: 12:239 StorageDead(_1); // bb3[3]: scope 0 at main.rs:12:1: 12:240 return; // bb3[4]: scope 0 at main.rs:12:2: 12:241 }42}
A few points to note and understand here :
_3 = Eq(move _4, const 1i32)
is where the condition is checked.switchInt
leads two 2 branches, if false goto bb1, else if true goto bb2.1fn main() {2 let mut x = 1;3 loop {4 x += 1;5 if x>5 {6 break;7 }8 }9 x = 10;10}
The Mir Representation :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:113 let mut _1: i32; // "x" in scope 0 at main.rs:2:9: 2:144 let mut _2: (i32, bool); // in scope 0 at main.rs:5:9: 5:155 let mut _3: bool; // in scope 0 at main.rs:7:12: 7:156 let mut _4: i32; // in scope 0 at main.rs:7:12: 7:137 scope 1 {8 }910 bb0: {11 StorageLive(_1); // bb0[0]: scope 0 at main.rs:2:9: 2:1412 _1 = const 1i32; // bb0[1]: scope 0 at main.rs:2:17: 2:1813 goto -> bb1; // bb0[2]: scope 1 at main.rs:4:5: 10:614 }1516 bb1: {17 _2 = CheckedAdd(_1, const 1i32); // bb1[0]: scope 1 at main.rs:5:9: 5:1518 assert(!move (_2.1: bool), "attempt to add with overflow") -> bb2; // bb1[1]: scope 1 at main.rs:5:9: 5:1519 }2021 bb2: {22 _1 = move (_2.0: i32); // bb2[0]: scope 1 at main.rs:5:9: 5:1523 StorageLive(_3); // bb2[1]: scope 1 at main.rs:7:12: 7:1524 StorageLive(_4); // bb2[2]: scope 1 at main.rs:7:12: 7:1325 _4 = _1; // bb2[3]: scope 1 at main.rs:7:12: 7:1326 _3 = Gt(move _4, const 5i32); // bb2[4]: scope 1 at main.rs:7:12: 7:1527 StorageDead(_4); // bb2[5]: scope 1 at main.rs:7:14: 7:1528 switchInt(_3) -> [false: bb3, otherwise: bb4]; // bb2[6]: scope 1 at main.rs:7:9: 9:1029 }3031 bb3: {32 StorageDead(_3); // bb3[0]: scope 1 at main.rs:10:5: 10:633 goto -> bb1; // bb3[1]: scope 1 at main.rs:4:5: 10:634 }3536 bb4: {37 StorageDead(_3); // bb4[0]: scope 1 at main.rs:10:5: 10:638 _1 = const 10i32; // bb4[1]: scope 1 at main.rs:12:5: 12:1139 StorageDead(_1); // bb4[2]: scope 0 at main.rs:13:1: 13:240 return; // bb4[3]: scope 0 at main.rs:13:2: 13:241 }42}
Notice how the program flow for the loop looks like :
x=10
1static glob: u32 = 10;23fn main() {4 let x = glob + 5;5}
The Mir Representation :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:3:11: 3:113 let _1: u32; // "x" in scope 0 at main.rs:4:9: 4:104 let mut _2: u32; // in scope 0 at main.rs:4:13: 4:175 let mut _3: (u32, bool); // in scope 0 at main.rs:4:13: 4:216 scope 1 {7 }89 bb0: {10 StorageLive(_1); // bb0[0]: scope 0 at main.rs:4:9: 4:1011 StorageLive(_2); // bb0[1]: scope 0 at main.rs:4:13: 4:1712 _2 = (glob: u32); // bb0[2]: scope 0 at main.rs:4:13: 4:1713 _3 = CheckedAdd(move _2, const 5u32); // bb0[3]: scope 0 at main.rs:4:13: 4:2114 assert(!move (_3.1: bool), "attempt to add with overflow") -> bb1; // bb0[4]: scope 0 at main.rs:4:13: 4:2115 }1617 bb1: {18 _1 = move (_3.0: u32); // bb1[0]: scope 0 at main.rs:4:13: 4:2119 StorageDead(_2); // bb1[1]: scope 0 at main.rs:4:20: 4:2120 StorageDead(_1); // bb1[2]: scope 0 at main.rs:5:1: 5:221 return; // bb1[3]: scope 0 at main.rs:5:2: 5:222 }23}2425static glob: u32 = {26 let mut _0: u32; // return place in scope 0 at main.rs:1:14: 1:1727 bb0: {28 _0 = const 10u32; // bb0[0]: scope 0 at main.rs:1:20: 1:2229 return; // bb0[1]: scope 0 at main.rs:1:1: 1:2330 }31}
The examples above only focus on stack allocations, let us take a look at heap allocated memory management.
1fn main() {2 let x = vec![1,2,3];3}
The Mir Representation (Only the code section of interest) :
1fn main() -> () {2 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:113 let _1: std::vec::Vec<i32>; // "x" in scope 0 at main.rs:2:9: 2:1045 scope 1 {6 }78 bb0: {9 // ... Vector initialization10 goto bb1;11 }1213 bb1: {14 drop(_1) -> bb2; // bb1[1]: scope 0 at main.rs:3:1: 3:215 }1617 bb2: {18 StorageDead(_1); // bb2[0]: scope 0 at main.rs:3:1: 3:219 return; // bb2[1]: scope 0 at main.rs:3:2: 3:220 }21}
Most of the code is boilerplate code for the vector initialization. Whats important to note is the function call drop(_1)
where _1 is the variable x (a Vector) in the source code. The drop function takes a value and calls the destructor of the object on it.
1fn main() {2 let x = vec![1,2,3];3 move_function(x);4}56fn move_function(x: Vec<u32>){78}
The Mir Representation :
1fn move_function(_1: std::vec::Vec<u32>) -> () {2 let mut _0: (); // return place in scope 0 at main.rs:6:30: 6:3034 bb0: {5 drop(_1) -> bb1; // bb0[0]: scope 0 at main.rs:8:1: 8:26 }78 bb1: {9 return; // bb1[0]: scope 0 at main.rs:8:2: 8:210 }11}1213fn main() -> () {14 let mut _0: (); // return place in scope 0 at main.rs:1:11: 1:1115 let _1: std::vec::Vec<u32>; // "x" in scope 0 at main.rs:2:9: 2:1016 let _5: (); // in scope 0 at main.rs:3:5: 3:2117 let mut _6: std::vec::Vec<u32>; // in scope 0 at main.rs:3:19: 3:2018 19 scope 1 {20 }2122 bb0: {23 // ... Vector initialization24 goto bb1;25 }2627 bb1: {28 StorageLive(_6); // bb1[2]: scope 1 at main.rs:3:19: 3:2029 _6 = move _1; // bb1[3]: scope 1 at main.rs:3:19: 3:2030 _5 = const move_function(move _6) -> bb2; // bb1[4]: scope 1 at main.rs:3:5: 3:2131 }3233 bb2: {34 StorageDead(_6); // bb2[0]: scope 1 at main.rs:3:20: 3:2135 StorageDead(_1); // bb2[2]: scope 0 at main.rs:4:1: 4:236 return; // bb2[3]: scope 0 at main.rs:4:2: 4:237 }38}
Notice how _1(x) has been moved into 6 and passed as an argument to move_function
, which implies _1 is not valid in main
from this point. Also, note how drop is not called on _1 in main
, but drop is called in move_function
.
Developers and Researchers are working on exploring the potential of MIR and other use-cases. miri, a interpreter for Rust's MIR is an outcome of as part of an undergraduate research course in 2015 at the University of Saskatchewan. miri can run binaries rust projects and detect certain classes of undefined behavior.
MIR representation opens doors for more static analysis over the source code. Due to it representing the source code in a control-flow graph, flow-sensitive analysis and profiling on the codebase will get much easier.