There's a drought of information on the topic on the internet, probably due in large part to the typical deployment of ZFS ( i.e. large shops that have a ton of spare disks laying around, or have otherwise planned out a migration path beforehand ), rather than the small home user.
So, here's what I did:
On a high level, we have to remember what sort of replication we've got for any given RAID level. More accurately, we need to know how many disks can be broken before the whole thing falls apart.
When we've got a single drive, that drive can't die, or we lose everything (obvious). With a mirror, we can't have 2 drives die. A 3-disk RAIDZ ( raid5 ) requires at least 2 operational disks out of 3. So, when moving from a 2 disk mirror to a 3 disk raidz, we obviously don't have enough disks to have both of them operational in full, even if we break the mirror in to a single disk.
But, if we count the number of disks allowed to be dead ( 2 ) at any given time, and the number we have ( 3 ), we can spread them out such that two degraded pools exist. One single-disk ( broken mirror ) and one degraded zpool ( 2 disks + NULL ).
So the procedure we'd use to attain this state is break the mirror, create a zpool with the new disk and the old mirror drive, copy the data over, destroy the old mirror, attach the old second mirror disk to the new raidz.
For the purpose of demonstration, I'll use the disks I've got attached to the system, c2d0, c3d0, and c4d1 .
first, the starting condition:
$ zpool status
pool: xenophanes
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
xenophanes ONLINE 0 0 0
mirror
c3d0 ONLINE 0 0 0
c2d0 ONLINE 0 0 0
errors: No known data errors
Now, to break the mirror:
# zpool detach c2d0
so, what I've got now is a single-disk zpool comprised of c3d0, and two free disks, c2d0 and c4d1.
To create a raidz, you need 3 devices. We only have two. We can solve this problem, however, with sparse files and loopback.
Loopback allows you to use a file the same way you'd use any other block device in /dev. Linux has it ( mount -o loop ), Solaris has it ( lofiadm ). It's a pretty common thing.
A sparse file is a type of file where the filesystem only stores it's beginning and end pointer information, and a size. The actual contents of the file aren't stored until you begin to write to them. This allows us to do things like create a 140GB file on a 140GB disk with plenty of room to spare. And that's precisely what we'll do.
You can create a sparse file easily with dd(1) like so:
$ dd if=/dev/zero of=/xenophanes/disk.img bs=1024k seek=149k count=1
bs is block size, 1kb. seek is the number of blocks to skip ( and is equal to the size of the drive in kb, because of the previous bs= line ), and count tells dd(1) to copy one block.
and we can create a device like so:
# lofiadm -a /xenophanes/disk.img
/dev/lofi/1
So, to recap, here's what we have. We have a zpool, two spare disks ( c2d0 and c4d1 ) and a sparse file the size of those disks hooked up with lofi. And if you'll notice, that's precisely what we need.
From here out, we need to create the raidz, degrade it ( otherwise we'll fill up a sparse file that's the same size as the other disk, it'll run out of space, stuff will break... it won't be pretty )
# zpool create heraclitus raidz c2d0 c4d1 /dev/lofi/1
ta da! a raidz. Now let's break it.
# zpool offline heraclitus /dev/lofi/1 && lofiadm -d /dev/lofi/1 && rm /xenophanes/disk.img
and here's what we get:
# zpool status
pool: heraclitus
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
heraclitus DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/dev/lofi/1 OFFLINE 0 0 0
c4d1 ONLINE 0 0 0
c2d0 ONLINE 0 0 0
errors: No known data errors
pool: xenophanes
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
xenophanes ONLINE 0 0 0
c3d0 ONLINE 0 0 0
errors: No known data errors
as you can see, heraclitus is degraded, but operational.
so, we can just create our filesystems and copy data over
# zfs create heraclitus/home && zfs create heraclitus/opt
# cd /xenophanes/home && cp -@Rp * /heraclitus/home/ && cd /xenophanes/opt && cp -@Rp * /heraclitus/opt
and go have a cup of coffee or 12. When that's complete, we destroy the old pool
# zpool destroy xenophanes
and replace the lofi disk with the old zpool's disk
# zpool replace heraclitus /dev/lofi/1 c3d0
and there you have it. a 3-disk raidz out of a 2-disk mirror. No data juggling, tape drives, or extra disks necessary
14 comments:
Very nicely documented. I was expecting you to use zfs send/recv to migrate the data though, any reason you didn't ?
None in particular, just preference I guess.
Indeed. Great post, well written and tremendously useful.
I'm having trouble making the sparse file to be the exact size of the other disks. zfs complains when they are not the same. Can you give me any pointers ?
Great article, thank you for sharing with us.Web Designer
Think something like this is possible for migrating zpools among hosts via network connections? I think I'll try it with iscsi later on today..
John S, I've been looking for clarification on precisely what type(s?) of files can be used as ZFS vdev-elements (vdev-components), if/when using files instead of entire disks or disk-slices (for testing or experiments or something like your example). I've not been able to find anything definitive but your post comes closest, so far. I was assuming that the "mkfile" command --which can be used to create files for use as (part of) swap-space-- would be the appropriate way to create such files for ZFS vdev-components. Now I see your example of using "dd" to create a sparse file. Do you know whether or not this "dd" technique is the only way to do this? Do you know whether or not mkfile-created files would also work? Thanks! --John R Avery
this technique infers that silvering is automated upon adding the real disk.
i have a 3 disk RAID and I want to add a 1 disk member to the RAID.
so the technique might work, assuming the resilver is good.
anyone try this ? i am at the edge of my experience, but will put in the time to compare.
Thanks for that post! We are glad to find someone who puts a lot of thought into their blog instead of just throwing up a bunch of junk!
Indian exporters canada exporters b2b trade leads suppliers directory indian trade portal Indian buyers Indian tenders trade fairs indian distributer
you have a lot of really helpful topics on your blog. This is really helpful to be inspired by your blogs thank you very much
Facebook And Twitter Marketing Company
This is such a great source that you are offering and you give it away for no cost.I appreciate seeing sites that understand the value of offering a source for no cost.. I truly liked studying your publish. Celebrity Blog
Awesome blog. Thanks for sharing lots of information. Its very helpful to me. Hire a PHP Developer
Your blog is really very great ,I really agree with you about the blog.
Cereb
Post a Comment