Wednesday, January 2, 2008

How to turn a mirror in to a RAID

People occasionally ask on the mailing lists and in #opensolaris how to add a disk to a zfs mirror to make a raidz. Today, I received in the mail a new SATA controller and a new disk, so I was left in the same circumstance.

There's a drought of information on the topic on the internet, probably due in large part to the typical deployment of ZFS ( i.e. large shops that have a ton of spare disks laying around, or have otherwise planned out a migration path beforehand ), rather than the small home user.

So, here's what I did:

On a high level, we have to remember what sort of replication we've got for any given RAID level. More accurately, we need to know how many disks can be broken before the whole thing falls apart.

When we've got a single drive, that drive can't die, or we lose everything (obvious). With a mirror, we can't have 2 drives die. A 3-disk RAIDZ ( raid5 ) requires at least 2 operational disks out of 3. So, when moving from a 2 disk mirror to a 3 disk raidz, we obviously don't have enough disks to have both of them operational in full, even if we break the mirror in to a single disk.

But, if we count the number of disks allowed to be dead ( 2 ) at any given time, and the number we have ( 3 ), we can spread them out such that two degraded pools exist. One single-disk ( broken mirror ) and one degraded zpool ( 2 disks + NULL ).

So the procedure we'd use to attain this state is break the mirror, create a zpool with the new disk and the old mirror drive, copy the data over, destroy the old mirror, attach the old second mirror disk to the new raidz.

For the purpose of demonstration, I'll use the disks I've got attached to the system, c2d0, c3d0, and c4d1 .

first, the starting condition:

$ zpool status

pool: xenophanes
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
xenophanes ONLINE 0 0 0
mirror
c3d0 ONLINE 0 0 0
c2d0 ONLINE 0 0 0

errors: No known data errors


Now, to break the mirror:

# zpool detach c2d0


so, what I've got now is a single-disk zpool comprised of c3d0, and two free disks, c2d0 and c4d1.

To create a raidz, you need 3 devices. We only have two. We can solve this problem, however, with sparse files and loopback.

Loopback allows you to use a file the same way you'd use any other block device in /dev. Linux has it ( mount -o loop ), Solaris has it ( lofiadm ). It's a pretty common thing.
A sparse file is a type of file where the filesystem only stores it's beginning and end pointer information, and a size. The actual contents of the file aren't stored until you begin to write to them. This allows us to do things like create a 140GB file on a 140GB disk with plenty of room to spare. And that's precisely what we'll do.

You can create a sparse file easily with dd(1) like so:

$ dd if=/dev/zero of=/xenophanes/disk.img bs=1024k seek=149k count=1


bs is block size, 1kb. seek is the number of blocks to skip ( and is equal to the size of the drive in kb, because of the previous bs= line ), and count tells dd(1) to copy one block.

and we can create a device like so:

# lofiadm -a /xenophanes/disk.img
/dev/lofi/1


So, to recap, here's what we have. We have a zpool, two spare disks ( c2d0 and c4d1 ) and a sparse file the size of those disks hooked up with lofi. And if you'll notice, that's precisely what we need.

From here out, we need to create the raidz, degrade it ( otherwise we'll fill up a sparse file that's the same size as the other disk, it'll run out of space, stuff will break... it won't be pretty )

# zpool create heraclitus raidz c2d0 c4d1 /dev/lofi/1

ta da! a raidz. Now let's break it.

# zpool offline heraclitus /dev/lofi/1 && lofiadm -d /dev/lofi/1 && rm /xenophanes/disk.img

and here's what we get:

# zpool status
pool: heraclitus
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
heraclitus DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/dev/lofi/1 OFFLINE 0 0 0
c4d1 ONLINE 0 0 0
c2d0 ONLINE 0 0 0

errors: No known data errors

pool: xenophanes
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
xenophanes ONLINE 0 0 0
c3d0 ONLINE 0 0 0

errors: No known data errors


as you can see, heraclitus is degraded, but operational.

so, we can just create our filesystems and copy data over

# zfs create heraclitus/home && zfs create heraclitus/opt
# cd /xenophanes/home && cp -@Rp * /heraclitus/home/ && cd /xenophanes/opt && cp -@Rp * /heraclitus/opt


and go have a cup of coffee or 12. When that's complete, we destroy the old pool

# zpool destroy xenophanes


and replace the lofi disk with the old zpool's disk

# zpool replace heraclitus /dev/lofi/1 c3d0


and there you have it. a 3-disk raidz out of a 2-disk mirror. No data juggling, tape drives, or extra disks necessary