A Detailed View of Strings in Rust

Everything you'll need to know, from Path to Cow

Indigo Curnick
September 25, 2024
Articles

Strings are pretty complicated in Rust. For those new to Rust from other languages might be confused as to why there are just so many different kinds of strings. Most languages, like Python, Java, and C# only have one kind of string that the programmer has to worry about in most cases.

Before we dive in, it’s helpful to just very briefly look at how strings work in C. This will be very instructive, although we won’t dive into too many details.

In general, in C, there is no explicit “string” type - you actually have to make character arrays

1void example() {
2    char foo[20] = "Hello, World!";
3    printf("%s\n", foo);
4}

This data is stored in an array which holds 20 chars. foo, then, very explicitly, is the entire array. However, in practice it operates as a pointer to the first character of the array when used. You might notice that “Hello, World!” only has 13 characters. You might also notice that foo itself contains no information about the length of itself. This is a source of many memory bugs in C - if I tried to do something like foo[22] the program would crash! Therefore, we typically pad the array with the null character \0 - and when working with the array we can loop through till we encounter the null character, letting us know the string is over.

We mentioned the stack. In this blog, we are going to be discussing strings in Rust, and covering where, exactly, in memory they are stored. It’s well worth refreshing the difference between the stack and the heap before we continue. This won’t be an in depth covering of these concepts though, which is well beyond the scope of this blog.

The stack is a last-in, first-out (LIFO) area of memory. Because of this, memory management on the stack is automatic. As a result of that, adding and removing items from the stack is extremely fast. Generally, when defining primitives they are stored on the stack.

The heap is a large pool of memory which doesn’t have the benefits of automatic memory management. Therefore, we need to manage that memory ourselves - in C you would use calls to malloc and free. Typically, in Rust, you won’t explicitly call malloc and free like functions yourself, but they are still happening and to understand the performance of your Rust programs it’s helpful to know when these calls are being made. Due to these calls to malloc and free using the heap is much slower than using the stack. However, it is dynamic, so data inside it can grow and shrink at runtime and this allows us to handle variable input length data.

Basic Strings

Most of the time, strings will be of two different types - either &str or String. You can also have &String but str is an impossible type. It’s worth pausing to understand a fundamental Rust concept here, the idea of sized. In order to have a type in and of itself (e.g. String, i32), we have to know the size of that object. Think about i32 - it’s size is known before compile time. It’s 32 bits. In Rust, sized is represented by the trait Sized . If something does not implement Sized then we can only have a reference to it, not the type in and of itself. We’ll look in a little bit at how String can implement Sized but str can not.

Let’s start with str. This is a slice of bytes, in other words, a [u8]. The contents are always a valid UTF-8 sequence in Rust. As mentioned, we can only ever have a reference to a str, most commonly it will show up as a &str, but we can also use smart pointers, like Box<str>.

So, where in memory is that [u8]? With &str it can actually appear in three possible places.

First, in the read-only data section of the generated binary (known as the “rodata”). This occurs with a string literal - a string which is literally included in the binary. You would get this in the following example

1let a: &str = "Hello, World!";

Here, a is a &str - its size is known as it is reference to the actual data in memory, which is a [u8] in the rodata. The reference is stored on the stack.

Second, on the stack. This is a little less common, but in some instances you could create a Sized array at compile time, which would be stored on the stack. For example, if you did something like let arr: [i32; 5] = [1, 2, 3, 4, 5]; then this array would be stored on the stack. You could then use std::str::from_utf8  to convert this into a &str. You’ll probably never do this.

Third, on the heap. This is probably the most common situation. The &str pointer will be stored on the stack, and the actual data it points to will be on the heap. This will be the case if the reference came from a String or a Box<str>.

An alternative way to get a reference to a str is with Box - you can have the type Box<str>. This is because Box is also a reference type, and the reference is again always stored on the stack. However, there’s a key difference between these two implementations. &str is unowned - meaning the reference is borrowing data. The underlying data must always outlive the &str. However, Box<str> owns the data stored, so when the Box is dropped the data is also dropped, preventing memory leaks. Also, with Box the data is always stored on the heap. Box can be used when you want to force data to be on the heap, for example, if we now did

let a: Box<str> = Box::from("Hello, World!");

What was before a string literal stored in the rodata is now heap data (also note that we use Box::from here NOT Box::new as you might be used to - Box::new("hello world")  would actually make a Box<&str>)

This brings us neatly along to the String. A String is actually a struct which contains three basic elements - a reference to a string buffer, the length of the string buffer and the capacity of the string buffer. Savvy Rust users might notice that this also describes a Vec very well, and that is not a coincidence. A String in Rust is actually just a Vec<u8> in disguise! So to be clear, with a String what we actually have is an object on the stack which contains three variables of known size, one of which is a reference to data on the heap.

We’re now in a place to understand why String can implement Sized but str can not. A string, in general, can be of any length - from a single character all the way to the complete works of William Shakespeare (and beyond!). If we put that as bytes into an array, we have no idea how long it will be. A single byte? A million bytes? str represents these bytes - it can’t be known. &str though is a reference to this data, and we always know the size of references. String on the other hand, is in and of itself only three variables in a struct - as mentioned, one reference and two numbers. All of these have known sizes, too. In Rust, any struct made of data types which themselves implement Sized also automatically implements Sized.

So, why use String if we already have &str? One of the big reasons is that String actually implements an unsafe pointer to the string buffer data (this is nowhere near as scary as it sounds - all pointers in C/C++ are unsafe by Rust standards!). The consequence of this is that it’s possible to mutate the data on the heap. &str is actually immutable - there’s no way to modify the underlying data. But String has methods like push_str  which allow the actual data itself to be modified.

It’s also fairly common to see &String - while this might seem redundant, there’s plenty of situations in which we need to do this. Typically, if we want to pass around a String we need to do this, otherwise the called function will take ownership. For example, the following code won’t compile

1fn main() {
2	let a: String = "Hello, World!".to_string();
3	
4	foo(a);
5	bar(a);
6}
7
8fn foo(b: String) {
9	println!("{}", b);
10}
11
12fn bar(b: String) {
13	println!("{}", b);
14}

This is because foo takes ownership of a. There are two ways to fix this problem. We can clone, or we can reference. Cloning would look like

fn main() {
	let a: String = "Hello, World!".to_string();
	
	foo(a.clone());
	bar(a);
}

fn foo(b: String) {
	println!("{}", b);
}

fn bar(b: String) {
	println!("{}", b);
}

While this seems like a neat solution at first, keep in mind that clone makes a duplicate in memory of everything - not only do we have a new struct on the stack (three integers), but it also clones the underlying memory in the heap. If you had the entire works of William Shakespeare in memory, you now have two copies of the entire works of William Shakespeare. This is obviously very slow. Only do this if you really need it. The alternative is to reference.

1fn main() {
2	let a: String = "Hello, World!".to_string();
3	
4	foo(&a);
5	bar(&a);
6}
7
8fn foo(b: &String) {
9	println!("{}", b);
10}
11
12fn bar(b: &String) {
13	println!("{}", b);
14}

Now, it works. Also, there’s another way to do this

1fn main() {
2	let a: String = "Hello, World!".to_string();
3	
4	foo(&a);
5	bar(a.as_str());
6}
7
8fn foo(b: &str) {
9	println!("{}", b);
10}
11
12fn bar(b: &str) {
13	println!("{}", b);
14}

In general, it’s better to make functions take a &str rather than a &String, where possible. This is because we can make a &str from a String very easily and cheaply. Also notice in the above example we made this reference by &a and by a.as_str() - they are completely equivalent. Unlike C++, Rust isn’t a huge fan of operator overloading (or aliases) but there’s a few common ones in the standard library.

So to summarise

  • &str - a reference, stored on the stack, pointing at byte data on the rodata, stack or heap
  • String - object on the stack containing a reference to byte data on the heap

OsString

An OsString is a difference string type which has specific applications when representing data that comes from the operating system. This is because in Rust, String and &str are always UTF-8, which can contain zero bytes. But operating systems have different rules

  • UNIX like systems usually represent strings as non-zero bytes, often interpreted as UTF-8
  • In Windows, strings are non-zero 16 bit values, usually interpreted as UTF-16

Just like String and &str there are two flavours - OsString and &OsStr. They are essentially analogies to one another. OsString is mutable, and all of the same rules about memory exist. &OsStr is immutable, and again, all of the same rules about memory exist. Again, like &str, you can’t have a OsStr - only a &OsStr. The reason for all of this is that OsString and &OsStr can be thought of as wrappers around String and &str respectively.

Therefore, OsStr is stored in the rodata, on the heap or (in extremely rare circumstances) on the stack. It is unsized, requiring the reference, where the reference is stored on the stack. OsString is sized, and does not require a pointer. It is stored on the stack, but contains a reference to the actual byte data, which is always on the heap.

The most common time that you’ll encounter an OsString is when working with files on the disc. Consider the following dummy function, which prints out the contents of a folder

1fn list_files_in_directory(dir_path: &Path) -> std::io::Result<()> {
2    // Read the directory entries
3    for entry in fs::read_dir(dir_path)? {
4        let entry = entry?;
5        let file_name: OsString = entry.file_name(); // Get the file name as OsString
6
7        // Print the file name as OsString (platform-specific)
8        println!("File name (OsString): {:?}", file_name);
9
10        // Attempt to convert to UTF-8 &str, if valid
11        match file_name.to_str() {
12            Some(valid_str) => println!("File name (UTF-8): {}", valid_str),
13            None => println!("File name contains non-UTF-8 characters"),
14        }
15    }
16    Ok(())
17}

Notice how the standard library returns file names as an OsString, which then provides functions for converting that into a Rust String. For the reasons discussed, this can fail, so make sure to handle those errors!

This is another example where Rust seems needlessly pedantic at first - no other language really bothers with this comparison. However, when Rust effortlessly handles all of these edge cases you’ll be thankful than an unpaired surrogate code unit in a filename didn’t keep you up till 3am searching for a bug.

Path

Path and PathBuf are once again analogous to &str and String. Paths are in general tricky - different operating systems represent paths in different ways. For example, UNIX systems always use forward slashes e.g. Documents/folder/file.txt but in Windows a double backslash is also a very common representation e.g. Documents\\folder\\file.txt.

Many of the same concepts already covered apply once more here - Path and PathBuf are actually just thin wrappers around OsStr and OsString respectively. Once more, Path is unsized and requires a reference. The reference is stored on the stack while the byte data can be stored in the rodata, heap or (in very rare circumstances) on the stack. PathBuf is sized, and stored on the stack. The PathBuf object itself contains a reference to the underlying byte data, which is always on the heap.

It’s best to use Path in circumstances where the path isn’t going to be modified and only read. For example, we can use this example function to print some information about a file

1fn print_path_info(path: &Path) {
2    println!("Path: {:?}", path);
3    
4    // Check if the path exists
5    if path.exists() {
6        println!("Path exists");
7    } else {
8        println!("Path does not exist");
9    }
10
11    // Check if the path is a file or directory
12    if path.is_file() {
13        println!("It is a file.");
14    } else if path.is_dir() {
15        println!("It is a directory.");
16    }
17
18    // Display the file name if present
19    if let Some(file_name) = path.file_name() {
20        println!("File name: {:?}", file_name);
21    }
22
23    // Display the parent directory if present
24    if let Some(parent) = path.parent() {
25        println!("Parent directory: {:?}", parent);
26    }
27}

But we can use PathBuf when we want to modify a path, consider the following example

1fn build_path() -> PathBuf {
2    let mut path_buf = PathBuf::new();
3    
4    // Start with the home directory
5    path_buf.push("/home/user");
6
7    // Add subdirectories or files
8    path_buf.push("documents");
9    path_buf.push("rust_programming");
10    path_buf.push("project.txt");
11
12    // Return the fully built path
13    path_buf
14}

As a tip, if you’re making a library and don’t care if the functions take a Path or a PathBuf, you can use the following to achieve that

1fn foo<T: AsRef<Path>>(path: T) {
2	let path = path.as_ref();
3	// You can now use `path` as a &Path
4	// ...
5}

Cow

 _________________________________________
/ Cow is a type, and gets its name from a \
\ concept, that of copy on write.         /
 -----------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Cow is a funny name, which comes from “copy on write”. The Cow can essentially contain either a &str or a String. It’s useful when you want to do something to the data, which may or may not modify it. In Rust, &str is immutable. If we want to mutate that data, we will need to copy it into a String.

The same rules for &str or String apply as to where they are stored in memory. &str is again a pointer, where the actual data is in either the rodata or on the heap (and in very rare instances the stack, but that will probably not happen). The String is also, just like before, a Rust struct containing pointers to the heap.

Here’s a demonstrative example of Cow

1use std::borrow::Cow;
2
3fn main() {
4    let hello: &str = "Hello, world!";
5    let mut cow_string: Cow<str> = Cow::Borrowed(hello); // Cow from &str - at this point
6
7    println!("Initially: {}", cow_string); // Output: "Hello, world!"
8    println!("Is Cow borrowed? {}", matches!(cow_string, Cow::Borrowed(_))); // true
9
10    // Modify the Cow (this will cause it to convert into an owned String)
11    cow_string.to_mut().push_str(" How are you?"); // Cow now contains a String
12    
13    println!("After modification: {}", cow_string); // Output: "Hello, world! How are you?"
14    println!("Is Cow borrowed? {}", matches!(cow_string, Cow::Borrowed(_))); // false
15    
16    // Notice how the original variable `hello` is *still available here*
17    // The `Cow` only *borrowed* it and then made a *clone*
18    // This means we have *two copies of that data in memory*
19}

Notice how by the end of the program we have two copies of the data in memory - the hello variable’s data "Hello, world!" still exists in the heap, and also the cow now owns it’s own data "Hello, world! How are you?". This is why we need to be thoughtful when using cow - if we aren’t, we can end up with a lot of data being cloned!

Conclusion

While strings might seem daunting in Rust at first, just remember the following key points.

  • &str is immutable, String is mutable
  • str is unsized so it always needs a reference
  • Path and OsStr are wrappers around str
  • PathBuf and OsString are wrappers around String
  • Cow is a wrapper around str and String, and it can change across program execution

As a final piece of advice, Rust generally encourages less abstract thinking than other programming languages. You’ll find it much easier when you think about things physically - what kind of thing do you have: the actual data, or just a reference? What does the data look like? Where in memory is it stored?

Subscribe To Our Newsletter - Sleek X Webflow Template

Subscribe to our newsletter

Sign up at Naurt for product updates, and stay in the loop!

Thanks for subscribing to our newsletter
Oops! Something went wrong while submitting the form.