Recovering deleted source code on a Mac

You should have backed up. You should have used version control. You didn’t. Now what should you do to get the damn file back?

I’ve just been in the situation. So let’s walk it through together. I hope this saves you time and nerves.

The situation

As you were working on your project, each time you a saved a file a new file has been created and the old one deleted. That’s the way it’s usually done today (for safety) and the effect is that as far as the actual file contents (not the abstract file system) are concerned, new versions of files do not overwrite the old ones. So chances are that, somewhere in the sea of bytes of your unused disk space, are multiple versions of your deleted files.

In case of a damaged volume, file system, the situation is actually the same.

For the sake simplicity I will consider recovering a single source code file from know on. Generalize as necessary.

Hypothetically the file could consist of several fragments, but given today’s file systems and the typically small size of source code files, it’s highly unlikely.

The approach

There are file recovery tools that use file system structure, which has the hypothetical advantage of recovering files including their file name, metadata etc. So I first gave M3 Data Recovery a shot. It didn’t find any version of my file on my HFS+ drive. Actually the number of files it discovered was relatively low. I guess only some of the recently deleted files can be recovered this way.

So instead I opted for a simplistic but effective data carving tool Foremost.

Note: Foremost probably works on any Unix-like OS and is relatively file-system-agnostic. But the details of this guide apply particularly to a Mac environment, i.e. OS X.

The steps

Step 1: Stop using that file system if possible. If the drive is physically damaged you may want to copy it to another drive or image using dd. Otherwise you may as well just boot from another drive and do not mount this one. If this is a low-stakes recovery you may continue using the file system. I’ve been using actually using mine for several days without noticing that the file has been deleted. Just minimize writes to the file system.

Step 2: Download, build and install the foremost tool. The simplest way is to use MacPorts:

Note: Use sudo only if you know what you are doing and are prepared to bear the consequences.

Step 3: Configure foremost to find your files by creating a foremost.conf file. Foremost supports several file types out of the box, but I was searching for a python script with a specific header (which happened to be quite unique on my file system).

So I created this foremost.conf file:

The fields are tab-separated:

  • py will be used as an extension for the recovered files (you can choose whatever you like).
  • y means case-sensitive header.
  • 100000 is the number of bytes to dump starting from the header (as I didn’t have a reliable footer of the file).
  • #!/usr/bin/env\spython3 is the header itself, in which `\s’ denotes a space

I did not use the ASCII options as my source file was not ASCII (UTF-8 with lots of Japanese characters). If you need more information about the config file , have a look at the tool’s man page (man foremost) and the sample configuration file (in /opt/local/etc/foremost.conf.sample if you use MacPorts).

Step 4: Carve the data. You will need the config you just created, a directory to recover files to (on another volume), and path to your file system’s device. Use mount to get the last one.

Output of the mount command:

It tells me that my startup volume (root directory) is on the /dev/disk1 block device, but with foremost you actually have to use the corresponding /dev/rdisk1 character device. (Just prepend r.)

Now let’s run foremost:

Note: Now things might hypothetically get scary with sudo. Back up everything and make sure you know what you’re doing. ;-)

The -v -q flags mean verbose (not too verbose, recommended) and quick (start of each sector is searched for matching headers: use if you know it’s enough for you: it should be if your header is actually a header the file starts with).

There’s also the -d flag: The man page says “Turn on indirect block detection, this works well for Unix file systems.”. This is probably worth trying for larger, fragmented files, but I doubt it works with the HFS+ file systems. (Give it it try if you use ext2.)

Step 5: Clean up the files. The files contain some zeros and garbage after the source code. Here’s a simply Python script to copy the contents of a file A before the first zero to file B

Invoke as python scriptname.py file_A file_B.

The results

Now you have quite a few files (roughly 256 in my case) and it’s up to you to find the one. Note that file metadata (such as file modification dates) are not recovered using this method.

In the next step I used MD5 hashes to identify duplicate files (just a few in my case), then separated syntactically invalid files, sorted the files by size and started diffing… But that’s up to you.

Happy recovery!

PS: Here’s a page about using foremost on Linux and about other similar recover tools.