Everything you'll need to know, from Path to Cow
Strings are pretty complicated in Rust. For those new to Rust from other languages might be confused as to why there are just so many different kinds of strings. Most languages, like Python, Java, and C# only have one kind of string that the programmer has to worry about in most cases.
Before we dive in, it’s helpful to just very briefly look at how strings work in C. This will be very instructive, although we won’t dive into too many details.
In general, in C, there is no explicit “string” type - you actually have to make character arrays
1void example() {
2 char foo[20] = "Hello, World!";
3 printf("%s\n", foo);
4}
This data is stored in an array which holds 20 chars. foo
, then, very explicitly, is the entire array. However, in practice it operates as a pointer to the first character of the array when used. You might notice that “Hello, World!” only has 13 characters. You might also notice that foo
itself contains no information about the length of itself. This is a source of many memory bugs in C - if I tried to do something like foo[22]
the program would crash! Therefore, we typically pad the array with the null character \0
- and when working with the array we can loop through till we encounter the null character, letting us know the string is over.
We mentioned the stack. In this blog, we are going to be discussing strings in Rust, and covering where, exactly, in memory they are stored. It’s well worth refreshing the difference between the stack and the heap before we continue. This won’t be an in depth covering of these concepts though, which is well beyond the scope of this blog.
The stack is a last-in, first-out (LIFO) area of memory. Because of this, memory management on the stack is automatic. As a result of that, adding and removing items from the stack is extremely fast. Generally, when defining primitives they are stored on the stack.
The heap is a large pool of memory which doesn’t have the benefits of automatic memory management. Therefore, we need to manage that memory ourselves - in C you would use calls to malloc
and free
. Typically, in Rust, you won’t explicitly call malloc
and free
like functions yourself, but they are still happening and to understand the performance of your Rust programs it’s helpful to know when these calls are being made. Due to these calls to malloc
and free
using the heap is much slower than using the stack. However, it is dynamic, so data inside it can grow and shrink at runtime and this allows us to handle variable input length data.
Most of the time, strings will be of two different types - either &str
or String
. You can also have &String
but str
is an impossible type. It’s worth pausing to understand a fundamental Rust concept here, the idea of sized. In order to have a type in and of itself (e.g. String
, i32
), we have to know the size of that object. Think about i32
- it’s size is known before compile time. It’s 32 bits. In Rust, sized is represented by the trait Sized
. If something does not implement Sized
then we can only have a reference to it, not the type in and of itself. We’ll look in a little bit at how String
can implement Sized
but str
can not.
Let’s start with str
. This is a slice of bytes, in other words, a [u8]
. The contents are always a valid UTF-8 sequence in Rust. As mentioned, we can only ever have a reference to a str
, most commonly it will show up as a &str
, but we can also use smart pointers, like Box<str>
.
So, where in memory is that [u8]
? With &str
it can actually appear in three possible places.
First, in the read-only data section of the generated binary (known as the “rodata”). This occurs with a string literal - a string which is literally included in the binary. You would get this in the following example
1let a: &str = "Hello, World!";
Here, a
is a &str
- its size is known as it is reference to the actual data in memory, which is a [u8]
in the rodata. The reference is stored on the stack.
Second, on the stack. This is a little less common, but in some instances you could create a Sized
array at compile time, which would be stored on the stack. For example, if you did something like let arr: [i32; 5] = [1, 2, 3, 4, 5];
then this array would be stored on the stack. You could then use std::str::from_utf8
to convert this into a &str
. You’ll probably never do this.
Third, on the heap. This is probably the most common situation. The &str
pointer will be stored on the stack, and the actual data it points to will be on the heap. This will be the case if the reference came from a String
or a Box<str>
.
An alternative way to get a reference to a str
is with Box
- you can have the type Box<str>
. This is because Box
is also a reference type, and the reference is again always stored on the stack. However, there’s a key difference between these two implementations. &str
is unowned - meaning the reference is borrowing data. The underlying data must always outlive the &str
. However, Box<str>
owns the data stored, so when the Box
is dropped the data is also dropped, preventing memory leaks. Also, with Box
the data is always stored on the heap. Box
can be used when you want to force data to be on the heap, for example, if we now did
let a: Box<str> = Box::from("Hello, World!");
What was before a string literal stored in the rodata is now heap data (also note that we use Box::from
here NOT Box::new
as you might be used to - Box::new("hello world")
would actually make a Box<&str>
)
This brings us neatly along to the String
. A String
is actually a struct
which contains three basic elements - a reference to a string buffer, the length of the string buffer and the capacity of the string buffer. Savvy Rust users might notice that this also describes a Vec
very well, and that is not a coincidence. A String
in Rust is actually just a Vec<u8>
in disguise! So to be clear, with a String
what we actually have is an object on the stack which contains three variables of known size, one of which is a reference to data on the heap.
We’re now in a place to understand why String
can implement Sized
but str
can not. A string, in general, can be of any length - from a single character all the way to the complete works of William Shakespeare (and beyond!). If we put that as bytes into an array, we have no idea how long it will be. A single byte? A million bytes? str
represents these bytes - it can’t be known. &str
though is a reference to this data, and we always know the size of references. String
on the other hand, is in and of itself only three variables in a struct - as mentioned, one reference and two numbers. All of these have known sizes, too. In Rust, any struct made of data types which themselves implement Sized
also automatically implements Sized
.
So, why use String
if we already have &str
? One of the big reasons is that String
actually implements an unsafe pointer to the string buffer data (this is nowhere near as scary as it sounds - all pointers in C/C++ are unsafe by Rust standards!). The consequence of this is that it’s possible to mutate the data on the heap. &str
is actually immutable - there’s no way to modify the underlying data. But String
has methods like push_str
which allow the actual data itself to be modified.
It’s also fairly common to see &String
- while this might seem redundant, there’s plenty of situations in which we need to do this. Typically, if we want to pass around a String
we need to do this, otherwise the called function will take ownership. For example, the following code won’t compile
1fn main() {
2 let a: String = "Hello, World!".to_string();
3
4 foo(a);
5 bar(a);
6}
7
8fn foo(b: String) {
9 println!("{}", b);
10}
11
12fn bar(b: String) {
13 println!("{}", b);
14}
This is because foo
takes ownership of a
. There are two ways to fix this problem. We can clone, or we can reference. Cloning would look like
fn main() {
let a: String = "Hello, World!".to_string();
foo(a.clone());
bar(a);
}
fn foo(b: String) {
println!("{}", b);
}
fn bar(b: String) {
println!("{}", b);
}
While this seems like a neat solution at first, keep in mind that clone makes a duplicate in memory of everything - not only do we have a new struct on the stack (three integers), but it also clones the underlying memory in the heap. If you had the entire works of William Shakespeare in memory, you now have two copies of the entire works of William Shakespeare. This is obviously very slow. Only do this if you really need it. The alternative is to reference.
1fn main() {
2 let a: String = "Hello, World!".to_string();
3
4 foo(&a);
5 bar(&a);
6}
7
8fn foo(b: &String) {
9 println!("{}", b);
10}
11
12fn bar(b: &String) {
13 println!("{}", b);
14}
Now, it works. Also, there’s another way to do this
1fn main() {
2 let a: String = "Hello, World!".to_string();
3
4 foo(&a);
5 bar(a.as_str());
6}
7
8fn foo(b: &str) {
9 println!("{}", b);
10}
11
12fn bar(b: &str) {
13 println!("{}", b);
14}
In general, it’s better to make functions take a &str
rather than a &String
, where possible. This is because we can make a &str
from a String
very easily and cheaply. Also notice in the above example we made this reference by &a
and by a.as_str()
- they are completely equivalent. Unlike C++, Rust isn’t a huge fan of operator overloading (or aliases) but there’s a few common ones in the standard library.
So to summarise
&str
- a reference, stored on the stack, pointing at byte data on the rodata, stack or heapString
- object on the stack containing a reference to byte data on the heapOsString
An OsString
is a difference string type which has specific applications when representing data that comes from the operating system. This is because in Rust, String
and &str
are always UTF-8, which can contain zero bytes. But operating systems have different rules
Just like String
and &str
there are two flavours - OsString
and &OsStr
. They are essentially analogies to one another. OsString
is mutable, and all of the same rules about memory exist. &OsStr
is immutable, and again, all of the same rules about memory exist. Again, like &str
, you can’t have a OsStr
- only a &OsStr
. The reason for all of this is that OsString
and &OsStr
can be thought of as wrappers around String
and &str
respectively.
Therefore, OsStr
is stored in the rodata, on the heap or (in extremely rare circumstances) on the stack. It is unsized, requiring the reference, where the reference is stored on the stack. OsString
is sized, and does not require a pointer. It is stored on the stack, but contains a reference to the actual byte data, which is always on the heap.
The most common time that you’ll encounter an OsString
is when working with files on the disc. Consider the following dummy function, which prints out the contents of a folder
1fn list_files_in_directory(dir_path: &Path) -> std::io::Result<()> {
2 // Read the directory entries
3 for entry in fs::read_dir(dir_path)? {
4 let entry = entry?;
5 let file_name: OsString = entry.file_name(); // Get the file name as OsString
6
7 // Print the file name as OsString (platform-specific)
8 println!("File name (OsString): {:?}", file_name);
9
10 // Attempt to convert to UTF-8 &str, if valid
11 match file_name.to_str() {
12 Some(valid_str) => println!("File name (UTF-8): {}", valid_str),
13 None => println!("File name contains non-UTF-8 characters"),
14 }
15 }
16 Ok(())
17}
Notice how the standard library returns file names as an OsString
, which then provides functions for converting that into a Rust String
. For the reasons discussed, this can fail, so make sure to handle those errors!
This is another example where Rust seems needlessly pedantic at first - no other language really bothers with this comparison. However, when Rust effortlessly handles all of these edge cases you’ll be thankful than an unpaired surrogate code unit in a filename didn’t keep you up till 3am searching for a bug.
Path
Path
and PathBuf
are once again analogous to &str
and String
. Paths are in general tricky - different operating systems represent paths in different ways. For example, UNIX systems always use forward slashes e.g. Documents/folder/file.txt
but in Windows a double backslash is also a very common representation e.g. Documents\\folder\\file.txt
.
Many of the same concepts already covered apply once more here - Path
and PathBuf
are actually just thin wrappers around OsStr
and OsString
respectively. Once more, Path
is unsized and requires a reference. The reference is stored on the stack while the byte data can be stored in the rodata, heap or (in very rare circumstances) on the stack. PathBuf
is sized, and stored on the stack. The PathBuf
object itself contains a reference to the underlying byte data, which is always on the heap.
It’s best to use Path
in circumstances where the path isn’t going to be modified and only read. For example, we can use this example function to print some information about a file
1fn print_path_info(path: &Path) {
2 println!("Path: {:?}", path);
3
4 // Check if the path exists
5 if path.exists() {
6 println!("Path exists");
7 } else {
8 println!("Path does not exist");
9 }
10
11 // Check if the path is a file or directory
12 if path.is_file() {
13 println!("It is a file.");
14 } else if path.is_dir() {
15 println!("It is a directory.");
16 }
17
18 // Display the file name if present
19 if let Some(file_name) = path.file_name() {
20 println!("File name: {:?}", file_name);
21 }
22
23 // Display the parent directory if present
24 if let Some(parent) = path.parent() {
25 println!("Parent directory: {:?}", parent);
26 }
27}
But we can use PathBuf when we want to modify a path, consider the following example
1fn build_path() -> PathBuf {
2 let mut path_buf = PathBuf::new();
3
4 // Start with the home directory
5 path_buf.push("/home/user");
6
7 // Add subdirectories or files
8 path_buf.push("documents");
9 path_buf.push("rust_programming");
10 path_buf.push("project.txt");
11
12 // Return the fully built path
13 path_buf
14}
As a tip, if you’re making a library and don’t care if the functions take a Path or a PathBuf, you can use the following to achieve that
1fn foo<T: AsRef<Path>>(path: T) {
2 let path = path.as_ref();
3 // You can now use `path` as a &Path
4 // ...
5}
Cow
_________________________________________
/ Cow is a type, and gets its name from a \
\ concept, that of copy on write. /
-----------------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
Cow
is a funny name, which comes from “copy on write”. The Cow
can essentially contain either a &str
or a String
. It’s useful when you want to do something to the data, which may or may not modify it. In Rust, &str
is immutable. If we want to mutate that data, we will need to copy it into a String
.
The same rules for &str
or String
apply as to where they are stored in memory. &str
is again a pointer, where the actual data is in either the rodata or on the heap (and in very rare instances the stack, but that will probably not happen). The String
is also, just like before, a Rust struct
containing pointers to the heap.
Here’s a demonstrative example of Cow
1use std::borrow::Cow;
2
3fn main() {
4 let hello: &str = "Hello, world!";
5 let mut cow_string: Cow<str> = Cow::Borrowed(hello); // Cow from &str - at this point
6
7 println!("Initially: {}", cow_string); // Output: "Hello, world!"
8 println!("Is Cow borrowed? {}", matches!(cow_string, Cow::Borrowed(_))); // true
9
10 // Modify the Cow (this will cause it to convert into an owned String)
11 cow_string.to_mut().push_str(" How are you?"); // Cow now contains a String
12
13 println!("After modification: {}", cow_string); // Output: "Hello, world! How are you?"
14 println!("Is Cow borrowed? {}", matches!(cow_string, Cow::Borrowed(_))); // false
15
16 // Notice how the original variable `hello` is *still available here*
17 // The `Cow` only *borrowed* it and then made a *clone*
18 // This means we have *two copies of that data in memory*
19}
Notice how by the end of the program we have two copies of the data in memory - the hello
variable’s data "Hello, world!"
still exists in the heap, and also the cow now owns it’s own data "Hello, world! How are you?"
. This is why we need to be thoughtful when using cow - if we aren’t, we can end up with a lot of data being cloned!
While strings might seem daunting in Rust at first, just remember the following key points.
&str
is immutable, String
is mutablestr
is unsized so it always needs a referencePath
and OsStr
are wrappers around str
PathBuf
and OsString
are wrappers around String
str
and String
, and it can change across program executionAs a final piece of advice, Rust generally encourages less abstract thinking than other programming languages. You’ll find it much easier when you think about things physically - what kind of thing do you have: the actual data, or just a reference? What does the data look like? Where in memory is it stored?