Wednesday, January 2, 2008

How to turn a mirror in to a RAID

People occasionally ask on the mailing lists and in #opensolaris how to add a disk to a zfs mirror to make a raidz. Today, I received in the mail a new SATA controller and a new disk, so I was left in the same circumstance.

There's a drought of information on the topic on the internet, probably due in large part to the typical deployment of ZFS ( i.e. large shops that have a ton of spare disks laying around, or have otherwise planned out a migration path beforehand ), rather than the small home user.

So, here's what I did:

On a high level, we have to remember what sort of replication we've got for any given RAID level. More accurately, we need to know how many disks can be broken before the whole thing falls apart.

When we've got a single drive, that drive can't die, or we lose everything (obvious). With a mirror, we can't have 2 drives die. A 3-disk RAIDZ ( raid5 ) requires at least 2 operational disks out of 3. So, when moving from a 2 disk mirror to a 3 disk raidz, we obviously don't have enough disks to have both of them operational in full, even if we break the mirror in to a single disk.

But, if we count the number of disks allowed to be dead ( 2 ) at any given time, and the number we have ( 3 ), we can spread them out such that two degraded pools exist. One single-disk ( broken mirror ) and one degraded zpool ( 2 disks + NULL ).

So the procedure we'd use to attain this state is break the mirror, create a zpool with the new disk and the old mirror drive, copy the data over, destroy the old mirror, attach the old second mirror disk to the new raidz.

For the purpose of demonstration, I'll use the disks I've got attached to the system, c2d0, c3d0, and c4d1 .

first, the starting condition:

$ zpool status

pool: xenophanes
state: ONLINE
scrub: none requested

xenophanes ONLINE 0 0 0
c3d0 ONLINE 0 0 0
c2d0 ONLINE 0 0 0

errors: No known data errors

Now, to break the mirror:

# zpool detach c2d0

so, what I've got now is a single-disk zpool comprised of c3d0, and two free disks, c2d0 and c4d1.

To create a raidz, you need 3 devices. We only have two. We can solve this problem, however, with sparse files and loopback.

Loopback allows you to use a file the same way you'd use any other block device in /dev. Linux has it ( mount -o loop ), Solaris has it ( lofiadm ). It's a pretty common thing.
A sparse file is a type of file where the filesystem only stores it's beginning and end pointer information, and a size. The actual contents of the file aren't stored until you begin to write to them. This allows us to do things like create a 140GB file on a 140GB disk with plenty of room to spare. And that's precisely what we'll do.

You can create a sparse file easily with dd(1) like so:

$ dd if=/dev/zero of=/xenophanes/disk.img bs=1024k seek=149k count=1

bs is block size, 1kb. seek is the number of blocks to skip ( and is equal to the size of the drive in kb, because of the previous bs= line ), and count tells dd(1) to copy one block.

and we can create a device like so:

# lofiadm -a /xenophanes/disk.img

So, to recap, here's what we have. We have a zpool, two spare disks ( c2d0 and c4d1 ) and a sparse file the size of those disks hooked up with lofi. And if you'll notice, that's precisely what we need.

From here out, we need to create the raidz, degrade it ( otherwise we'll fill up a sparse file that's the same size as the other disk, it'll run out of space, stuff will break... it won't be pretty )

# zpool create heraclitus raidz c2d0 c4d1 /dev/lofi/1

ta da! a raidz. Now let's break it.

# zpool offline heraclitus /dev/lofi/1 && lofiadm -d /dev/lofi/1 && rm /xenophanes/disk.img

and here's what we get:

# zpool status
pool: heraclitus
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested

heraclitus DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/dev/lofi/1 OFFLINE 0 0 0
c4d1 ONLINE 0 0 0
c2d0 ONLINE 0 0 0

errors: No known data errors

pool: xenophanes
state: ONLINE
scrub: none requested

xenophanes ONLINE 0 0 0
c3d0 ONLINE 0 0 0

errors: No known data errors

as you can see, heraclitus is degraded, but operational.

so, we can just create our filesystems and copy data over

# zfs create heraclitus/home && zfs create heraclitus/opt
# cd /xenophanes/home && cp -@Rp * /heraclitus/home/ && cd /xenophanes/opt && cp -@Rp * /heraclitus/opt

and go have a cup of coffee or 12. When that's complete, we destroy the old pool

# zpool destroy xenophanes

and replace the lofi disk with the old zpool's disk

# zpool replace heraclitus /dev/lofi/1 c3d0

and there you have it. a 3-disk raidz out of a 2-disk mirror. No data juggling, tape drives, or extra disks necessary


Darren Moffat said...

Very nicely documented. I was expecting you to use zfs send/recv to migrate the data though, any reason you didn't ?

JohnS said...

None in particular, just preference I guess.

stevel said...

Indeed. Great post, well written and tremendously useful.

Nigel said...

I'm having trouble making the sparse file to be the exact size of the other disks. zfs complains when they are not the same. Can you give me any pointers ?

seo expert said...

nice post

web design company, web designer,
web design India,website design,web design

suparna said...

Great article, thank you for sharing with us.Web Designer

Dr. Kenneth Noisewater said...

Think something like this is possible for migrating zpools among hosts via network connections? I think I'll try it with iscsi later on today..

John said...

John S, I've been looking for clarification on precisely what type(s?) of files can be used as ZFS vdev-elements (vdev-components), if/when using files instead of entire disks or disk-slices (for testing or experiments or something like your example). I've not been able to find anything definitive but your post comes closest, so far. I was assuming that the "mkfile" command --which can be used to create files for use as (part of) swap-space-- would be the appropriate way to create such files for ZFS vdev-components. Now I see your example of using "dd" to create a sparse file. Do you know whether or not this "dd" technique is the only way to do this? Do you know whether or not mkfile-created files would also work? Thanks! --John R Avery

milf said...

1。那混合物是更缓慢的 ... 但是 Lexus 的即将到来混合版本 ' 将是比气体气体更快的唯一的版本如好地有多马力。不要自夸速度,但是我被吸引轮流开送行为 90,是警察给我一次休息。
... 只是通过在城市乘公交车往返我储蓄过来 $ 5000/yr 与我的以前的汽车,吉普车切诺基相比。超过 5 年,会是 $ 更不用说会进一步增强我的储蓄的最近的比率远足的 20K。这样除非你是在你的父母的地产上吸的一个浪费的儿子,你的声明是一束公牛。

milf said...

3. 45 (90 r/t)
45mpg 天是 2 我的车上> 8 加>>比。那每天是 6 >仑的一笔>蓄, 120 月, 1440 每年者 5040 (根 3.5 元/) ... 加上它发表 1/10th CO2。多愚蠢是它不要骑一个,去算进今天和年龄。
4.缺少了解 ... 是真的,实际上我个人这样那样喜欢它我可能享受所有鼓励;税,合伙用车,免费停车米, prius 业主之间的秘密的信号,等等;这样自私地说那我真地在那里在享受在所有气体汽车业主上的所有权那没有一个想法多少我这辆汽车有的嬉戏。我 junked 我的 SL,郊区对我的 Prius ... 你应该也。

stopher said...

this technique infers that silvering is automated upon adding the real disk.

i have a 3 disk RAID and I want to add a 1 disk member to the RAID.

so the technique might work, assuming the resilver is good.

anyone try this ? i am at the edge of my experience, but will put in the time to compare.

the said...

Top website designing company in India, Nashik provding world class website design and solutions.
Nashik Website Designing, Nasik Webpage Design, Nashik Website Design, Nashik website company, SEO Company Nashik, Nashik Website Development Services

zcocorporation said...
This comment has been removed by the author.
Infobanc said...

Thanks for that post! We are glad to find someone who puts a lot of thought into their blog instead of just throwing up a bunch of junk!
Indian exporters canada exporters b2b trade leads suppliers directory indian trade portal Indian buyers Indian tenders trade fairs indian distributer

Admin said...

you have a lot of really helpful topics on your blog. This is really helpful to be inspired by your blogs thank you very much

Facebook And Twitter Marketing Company

Syed Faizan said...

This is such a great source that you are offering and you give it away for no cost.I appreciate seeing sites that understand the value of offering a source for no cost.. I truly liked studying your publish. Celebrity Blog

Felix Smith said...

Awesome blog. Thanks for sharing lots of information. Its very helpful to me. Hire a PHP Developer

Hamza Shahzad Shahzad said...

Your blog is really very great ,I really agree with you about the blog.