Ask ChatGPT: Use Rust to find duplicate files in a directory

This article is part of a series of ChatGPT experiments, in which I give ChatGPT a prompt, share its response, and offer some reflections.

Prompt

Can you write a Rust program that will scan a directory and all subdirectories and identify duplicate files within them?

Response

Yes, here is a simple program in Rust that scans a directory and all its subdirectories and identifies duplicate files based on their content:

use std::collections::HashMap;
use std::fs;
use std::path::Path;
use std::hash::{Hash, Hasher};
use std::fs::File;
use std::io::Read;

fn hash_file<T: Hash>(file_path: &Path) -> Result<T, std::io::Error> {
    let mut hasher = std::collections::hash_map::DefaultHasher::new();
    let mut file = File::open(file_path)?;
    let mut buffer = [0; 1024];

    while let Ok(bytes_read) = file.read(&mut buffer) {
        if bytes_read == 0 {
            break;
        }
        hasher.write(&buffer[..bytes_read]);
    }

    Ok(hasher.finish())
}

fn find_duplicates<T: Hash>(root: &Path) -> Result<HashMap<T, Vec<String>>, std::io::Error> {
    let mut file_hashes = HashMap::new();

    for entry in fs::read_dir(root)? {
        let entry = entry?;
        let path = entry.path();

        if path.is_dir() {
            continue;
        }

        let hash = hash_file(&path)?;
        file_hashes.entry(hash).or_default().push(
            path.to_string_lossy().to_string()
        );
    }

    let duplicates = file_hashes
        .into_iter()
        .filter_map(|(_, files)| {
            if files.len() > 1 {
                Some((files[0].clone(), files[1..].to_vec()))
            } else {
                None
            }
        })
        .collect();

    Ok(duplicates)
}

fn main() -> Result<(), std::io::Error> {
    let duplicates = find_duplicates(Path::new("."))?;

    for (file, duplicates) in duplicates {
        println!("{} is duplicated in:", file);
        for duplicate in duplicates {
            println!("\t{}", duplicate);
        }
    }

    Ok(())
}

Reflections

This experiment is a follow-up to a ChatGPT prompt that asked for the same solution in Go. I wanted to try it in Rust because I figured there would be less Rust code for ChatGPT to train on, so I could get a feel for how well ChatGPT performs with less reference material. Also, a Rust version of this script should be even faster and more memory efficient than the Go version.

I was initially impressed by the output, partly because ChatGPT offered a full response to the prompt on the first try. With some other code-generation prompts, ChatGPT stopped responding halfway through "typing" the code so I had to re-submit the prompt and let ChatGPT try again.

The Good

I appreciate that ChatGPT is aware of:

  • how to import different Rust modules cleanly in a single import block
  • how to reference those modules from the rest of the code
  • how to give variables and functions meaningful names
  • how to follow indentation conventions in Rust (four spaces per indentation level)
  • how to identify duplicate files by comparing their hashes
  • how to use all of the different characters in the expansive Rust syntax

Clearly, ChatGPT has a strong grasp of where characters should go - there isn't a stray character anywhere. At first glance the code looks complete: all opening brackets, braces, and parentheses have a closing partner.

The Bad

Unfortunately, ChatGPT seems to have a limited understanding of Rust's type system and how to use generics. The code fails to compile because type annotations are incorrect or missing altogether in specific places.

Additionally, this code would only check the files in the immediate directory. It was my intent to check all files in subdirectories as well, which I think is a more common/realistic use case.

Finally, this code wants to use std::hash::Hasher to hash files. This is a built-in hasher that can change between Rust versions, making it difficult to keep types consistent. Also, there's no guarantee that the underlying hashing algorithm used will be secure or performant, as it can change between versions as well.

The Fix

To address these issues and actually get the program to compile, I relied on several sources of information:

  • Google (to find other sources)
  • Stack Overflow
  • The Rust Language Docs
  • Specific Rust Crate Docs
  • GitHub Copilot
  • VS Code rust-analyzer extension
  • ChatGPT (for more specific Rust questions)

Even though ChatGPT didn't give me a complete or valid solution, it did give me a solid starting point. As they say, you can't edit a blank page - it's better to start somewhere and iterate from there than to expect perfection on the first try. I found that ChatGPT did a good job of getting me past the "blank page" stage of the project and on to actually iterating toward my goal.

In the end, I was able to use the general framework provided by ChatGPT to create a working program. You can find it on my GitHub here: Rust: Find Duplicate Files. It uses the BLAKE3 algorithm for fast and secure hashing, and it iterates through subdirectories as my original prompt requested.