LVM on RAID

An introduction, overview, and quick reference to using LVM on RAID with Linux.

Michael J Evans is a long time Linux user and home server/network administrator with a degree in Computer Engineering Technology. <mjevans1983 AT gmail DOT com>

0.0.3

2009-08-11

mje

New exporter and editorial improvements.

0.0.1

2009-07-22

mje

Style and peer review draft, basic topics complete.

0.0.0

2009-06-23

mje

Draft copy, no polishing passes. Not yet complete.

Introduction

Preface

This document shows how to combine RAID and LVM at a basic level and commonly useful command options at each stage. The manual pages are useful when someone already has a good concept of what each subsystem does and a good idea of how to combine them to produce a desired result. Sometimes a more topic based reference, or instruction guide, is desired. That is precisely what this HOWTO should be. A good introduction and quick topical reference guide.

KB, KiB etc?

I am aware that the iB units exist. However I've never liked the way that they sound, and few programs allow their use as input. Computers operate most effectively in their native base; binary. In binary, engineering units are inherently base 2. To apply base 10 engineering units to such systems is inefficient at best, and quite likely irresponsible. With storage devices from RAM to hard drives using power of two blocks of units, and even addressed as such, it doesn't even make sense to consider that there may be an alternate possibility for alignment.

That is why, in this guide, and in the manuals for any programs mentioned in it, assume data-units are powers of two; that KB means the same thing as KiB. As a general rule, sizes are binary units unless someone is selling capacity. Exceptions only explicitly stated in writing.

Copyright © 2009-06-23, 2009-07-22, 2009-08-11 by Michael Evans.

You are free:

Any of these conditions can be waived if you get permission from the author.

Disclaimer

Use the information in this document at your own risk. I disavow any potential liability for the contents of this document. Use of the concepts, examples, and/or other content of this document is entirely at your own risk.

All copyrights are owned by their owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark.

Naming of particular products or brands should not be seen as endorsements.

You are strongly recommended to take a backup of your system and to verify and update backups at regular intervals.

Credits, Contributions, Resources

I would like to thank all responsible for creating and maintaining software raid and logical volume management kernel and user-space tools for Linux.

I would also like to thank the following guide writers.

**FIXME**

Additional resource beyond the Guides, HOWTOs, and FAQs may be found at these locations.

Basic Concepts

Meta-data / Headers / Labels

There are currently (at least) two popular meta-device systems used in Linux. Software RAID is provided by md. Dynamic mapping of logical volumes across physical ones is provided by dm.

Each of these systems requires it's own header and optionally further accounting data. The location of the headers varies from version to version, with some placed at the end, others in the start, and others 4kb from the start.

RAID (md) version 1 meta-data is at least 128KB (more if optional extras are used, but those are beyond the scope of this HOWTO at this time).

LVM (version 2) labels appear to use -at least- 32KB at the start of a device. **FIXME** I've yet to find a reference that describes how much space will be used.

Redundant Array Independent Devices

Historically also known as Redundant Array of Inexpensive Disks and pushed by vendors as Redundant Array of Independent Disks, I feel that a more proper name is Redundant Array of Independent Devices. This name stresses the four key components of modern RAID solutions. First, for any non-zero (non null) level of RAID redundancy is involved to protect against data-loss if possible. Second, an array of multiple devices are involved, further attempting to partition possible failure. Third, where possible those devices should be independent of each other, both to boost performance and to specify that they are supposed to operate even when another device in the array fails. Finally, devices is used instead of disk or drive to reflect the reality that -any- block device, potentially even any character device, might be used.

There are 4 typically used levels of raid and 3 historic levels.

RAID 0 – Striping only, no redundancy

RAID 1 – Mirrored devices

RAID 5 – Striping with a rotating parity device (each stripe has one parity chunk)

RAID 6 – Striping and rotating parity such that any two devices can fail safely

RAID 4 – (In my opinion) Historic; Striping with a dedicated parity device (each stripe has one parity chunk)

RAID 2 – Historic; see references on raid

RAID 3 – Historic; see references on raid

LVM Components

The Linux logical volume manager allows administrators to dynamically create, shrink, grow, and delete partitions without shifting underlying data on physical devices. As an example, if you were using LVM for all your partitioning and /usr ran out of space, but /tmp was always too large, you could shift that free space. First shrink the file-system on /tmp. Then shrink the mapped area of /tmp accordingly. Next add the free space to /usr's volume. Finally extend /usr's file-system to use the re-allocated space. Or if you add a new storage device you could add the new free space and extend any/all of your file-systems.

LVM is split in to small, simple pieces to reduce complexity and failure points.

Chunks and Stripes

For most RAID levels in use today data is split in to building blocks called chunks . A typical chunk size is 64kb; but usually it must be comprised of a power of two (1, 2, 4, 8 etc) and a multiple of the block device's block size.

A stripe is composed of one chunk per device; the size of a stripe is thus always a multiple of the chunk size.

For raid 1, there is no striping, since each device contains a full copy. For all other levels of modern day RAID stripes are used.

Performance considerations are a factor when striping. Even on devices without parity, large writes are most efficient when distributed nearly evenly across all the data drives. Reads are even more effective when using multiple drives. Using multiple devices yields higher sustained and higher burst read rates. A smaller stripe size may make more sense for applications that are trying to put a small number of records on each device.

Performance should be viewed the most critical and from the most common cases. Larger chunk sizes, to a point, increase performance on large writes, such as archives, videos, even large graphics or compressed audio files. My gut feeling on the matter is that a reasonable chunk size, like the default of 64KB that mdadm currently uses, combined with a flush every few seconds policy would be good enough for most applications.

RAID 0

Striping without parity.

Pros:

Cons:

Example, 4 devices:

Dev0

Dev1

Dev2

Dev3

A

B

C

D

E

F

G

H

RAID 1

Mirror (duplicate) a device.

Pros:

Cons:

Example, 4 devices (3x redundancy, unrealistic, usually only 1 or 2x redundancy):

Dev0

Dev1

Dev2

Dev3

A

A

A

A

B

B

B

B

RAID 5

Striping with rotating parity.

Pros:

Cons:

Example, 5 devices.

Dev0

Dev1

Dev2

Dev3

Dev4

A

B

C

D

P (A-D)

E

F

G

P (E-H)

H

RAID 6

Striping with rotating parity, 2x.

Pros:

Cons:

Example, 5 devices.

Dev0

Dev1

Dev2

Dev3

Dev4

A

B

C

P1 (A-C)

P2 (A-C)

D

E

P1 (D-F)

P2 (D-F)

F

My views on RAID 5-6 vs. RAID 10 (1+0)

In my experience type 5 and 6 are more useful for large, read mostly data, with writes often either in cached batches of small files or large writes that take up multiple stripes. Type 1+0 (either two level 1's used a level 0 stripes, or two level 0's built in to a mirror.) is typically thought to offer the highest performance, at the cost of literally doubling everything to attain redundancy. As a middle ground between duplicating whole systems and merely duplicating storage to reduce downtime it is cost effective for applications seeking extra speed at reasonable cost.

Raid 5 and 6 also begin to make more sense with -at least- 5-7 drives in a stripe. This table will compare the solutions:

Drives

Raid 10 (1x safe)

Raid 5 (1x safe)

Raid 10 (2x safe)

Raid 6 (2x safe)

2

1#*

-

-

-

3

-

2*

1#

1

4

2#

3*

-

2

5

-

4

-

3

6

3

5

2

4

7

-

6

-

5

8

4

7

-

6

9

-

8

3

7

10

5

9

-

8

11

-

10

-

9

12

6

11

4

10

13

-

12

-

11

14

7

13

-

12

15

-

14

5

13

16

8

15

-

14

17

-

16

-

15

18

9

17

6

16

Raid 10 with 2x redundancy always yields 1/3rd storage density, using the other 2/3rds to attain safety in the case of 2 failures. Raid 6 uses only two 'drives' worth of storage and thus begins to make sense from 5 drives up, and may actually be faster in real world loads with 7 or more drives. (I don't yet have data for that many drives, and it will likely be system and workload dependent; maybe you want to test it out on your system if there are no other considerations?) Raid 5 meanwhile has similar considerations after 5-6 drives.

Performance wise the biggest factor I can see affecting the decision is the real write load. More often and more sparsely spaced writes really make more sense on RAID levels that don't involve read-write for replace. Conversely, infrequent or larger writes are where raid 5/6 really shines.

General RAID caveats

Any RAID build on the premise of resisting a given failure of N expects a low risk of failure correlation. However the way things are purchased generally raises that risk. Typically devices used in a RAID array, especially one built by an amateur or someone with little choice, are of the same age, manufacturer and even batch. Even when ordering special 'RAID' drives which are from multiple batches the issue of aging is still a factor.

One possible way around the two issues is to buy drives with similar specs from different manufacturers. Another way around it is to buy drives over time and grow your array during usage. Please test if your system is capable of such behavior. If you're planning on using Linux software raid you might be interested in the test scripts.

RAID 5 and 6, as mentioned above, are more computationally expensive than RAID 0 or 1. However in modern systems it is often less an issue of CPU time, given multiple cores clocked at over a gigahertz each, and more an issue of IO bandwidth to the CPU. **FIXME** Real world data from real systems?

When using software raid it is important to know what your boot-loader supports. In most cases you can get away with /boot on a RAID 1 partition provided you put the RAID descriptor at the end. (This allows, for example, Grub 0.9x to use it like a normal file-system.) Unfortunately that setup works for those boot-loaders for the same reason it is risky when using recovery CDs and similar tools; the file-systems look exactly like a normal file-system even without understanding that data is supposed to be mirrored to other devices.

Layering

Linux is extremely flexible. Both RAID and LVM take any generic block device for input, and provide the same thing as their output. Thus, Linux allows virtually unlimited stacking and combinations an administrator could imagine. However there are a few general principles that should be followed to produce the most effective combinations.

Redundancy layers always come closest to the hardware (unless duplicating whole services or systems).

If there are two or more redundancy layers, the ones that reduce risk the most should be used first. Strictly speaking the only time to use more than one redundancy layer would be when addressing multiple failure modes in different ways; such cases should use this HOWTO as inspiration and command reference only.

  1. Raid 1, 5 or 6 should generally be the first layer, if using redundancy at all.
  2. Next would come raid 0 if in use.
  3. As a final layer LVM on the fast, fault-resistant device provided.

Setup

Partitioning

Partition to a multiple or size smaller than the whole device. Do not rely on the available size offered as an indication that all drives of that type will always be that large. Even from the same batch, some drives arrive with more available sectors than others.

Growth is another consideration; modern drives seem to be following a pattern of doubling in size every few years. My personal observation is that marketing likes to release drives close to multiples of human readable units where possible. Desktop drives since 250 (billion bytes) have hit 500, 1000, 1500, and 2000 billion bytes. Depending on your current drives you would probably want to pick one of the first two, maybe three, as your base unit. 8*250 would yield 2000, alternately 4*500 or 2*1000.

However, don't forget hard drives are sold in engineering units, not powers of two. Example (500G); 500000000000 / 512 equals 976562500 sectors.

Next, reserve some overhead for LVM and MD labels, I recall MD as having a size of 128KB for the Version 1 labels, however LVM I'm not certain on. Aligning to 256k or even a megabyte shouldn't be too bad of an idea or waste too much space on a modern device. Thus 2048 sectors for reserve. Then determine what size you'd like to assign blocks by, for my system I tend to gravitate to a number like 512MB. If the marketed size of the drive is divided by the number of sectors required to produce an LVM block ( 976562500 / (512 * 1024 * 2) ) the result is 931.3 blocks. Rounding that down, it looks like a few hundred MB are actually spare.

If you want to use more than 4 partitions, or more than 2TB of space (A modern 2 trillion byte drive shouldn't be an issue, a 2.5 trillion however, would be.), then you will have to use a more current disk-label, such as the GUID Partition Table. If using GPT make sure that your boot-loader or process supports it in some form. There are patches for grub 0.97, your distribution may already include them; this shouldn't be an issue for most Linux distributions released in 2009 or later.

parted /dev/whatever print

Versions of parted I use seem to ignore gpt as a valid label unless entered interactively OR unless using the -s script flag (but then no syntax warnings are issued...), so I invoke mklabel as a separate command in my scripts.

parted /dev/whatever mklabel gpt

Normal parted commands can be chained as long as their syntax is perfect and complete, you can invoke them from a parted shell as well, including the 'select DEVICE' command to alter the device.

parted /dev/whatever unit s mkpart logical START END mkpart logical START END mkpart logical START END

Most commonly used block devices have variable speed (due to angular momentum on a disc) with the fastest sectors first. Removable media is often the exception; CDs and DVDs are variable length and size, thus only the inner ring is assumed to exist and the spiral must grow from it.

mdadm (RAID)

Once the partitions are created they can be added to raid sets. Typically these variables are used, others exist but are more application specific.

Device Name, some distribution's current stable version of mdadm require a normal device name, /dev/md# or /dev/md/#.

--name=

Assign a name to the device, -after- udev rules are processed it will show up according to the following pattern (checked Gentoo):/dev/disk/by-id/md-name-NAME **FIXME** one version of name didn't work with gentoo's mdadm version.

-c

--chunk=

Chunk size, the length of each write to a drive in a single stripe. In general there are two factors for performance. The first is to make bulk read/write transfers efficient, and the second is to make -partial- write requests efficient. The default size is 64KB, this is probably a good default for most systems. Even moderately sized web pages and graphics are likely to span a stride on smaller systems, while on larger systems one or two files would fill a stripe. For individual drives 32-64KB reads/writes have also been around the sweet-spot of performance over various generations.

-e

--metadata=

Short version: 0.90 for legacy auto-detection support, 1.1 otherwise: don't bother with 1.2 unless documentation explicitly requests that version. Most people will use one of two main choices here. (0||0.90) The older, but currently more likely to be supported and Linux auto-assembled format. (1||1.0) The newer format supports using more than 26 devices in an array, and many other enhancements. There are currently 3 places version 1 metadata can be stored; at the beginning, 4kb from the beginning, or at the end of the device. Since (most/all?) filesystems and LVM begin with the first few sectors of the device use format 1.1, which places the version 1 super block at the start. Version 1 type 0 (1.0) places the data at the end, which boot loaders might require.

-n

--raid-devices=

The number of devices to use for storage and parity.

-x

--spare-devices=

These are 'hot spares'. Spares can be shared though named spare groups, please read the manual page for detailed behavior and edit your mdadm.conf to configure the groups.

-l

--level=

“linear, raid0, 0, stripe, raid1, 1, mirror, raid4, 4, raid5, 5, raid6, 6, raid10, 10, multipath, mp, faulty. Obviously some of these are synonymous.” (mdadm manual)

-p

--layout=

“Finally, the layout options for RAID10 are one of 'n', 'o' or 'f' followed by a small number.” (mdadm manual) The number must be between 2 and the number of devices, it specifies how many copies of data to store (obviously 1 has no redundancy). N requests it back to back, while o duplicates and rotates stripes. **FIXME** pictures?

-b

--bitmap=

internal, none, or an absolute pathname. Please see the manual for mdadm for details. It is used to speed up recovery time by noting what areas were active. However due to write barriers, it performs best when stored on independent devices.

Devices, enumerate the initial devices. For any missing redundancy devices literally use missing  as the device name.

mdadm [mode] < raiddevice > [options] < component-devices >

Example: create a device named /dev/md/100 of type raid10 (combined striping and mirroring), using 4 devices. '-pf2' specifies to make two copies of the data, and to use the far pattern. The far pattern makes the devices operate similarly to a carefully constructed pair of mirrored raid 0 devices that share the same drives, but are setup such that the same data isn't stored on the same drive.

mdadm --create /dev/md/100 --name=test -e1.1 -lraid10 -pf2 -n4 /dev/sd[d-g]4

LVM – Physical Volume CREATE

pvcreate must first be used on a block device (like a fault resistant device offered by RAID) to initialize it for use with the Linux logical volume manager.

-M2

Use the newer version 2 meta-data format.

--dataalignment

-VERY- important for raid devices. Please set this to the size of a a chunk when using an underlying raid device. Format: Size[kKmMgGtT] (Kilo, Mega, Giga, Terra)

pvcreate -M2 --dataalignment 64K

LVM – Volume Group EXTEND (if adding to an existing VG)

vgextend VolumeGroupName PhysicalDeviceList

Example:

vgextend LVMStorage /dev/md/100 /dev/md/102 /dev/disk/by-id/md-name-tera0

LVM – Volume Group CREATE (if starting a VG)

-s

--physicalextentsize

Format: Size[kKmMgGtT] (Kilo, Mega, Giga, Terra)Must be a power of two, should also be a multiple of chunk size.

-c

--clustered

Defaults to y (yes) if clustered locking is enabled. “If the new Volume Group contains only local disks that are not visible on the other nodes, you must specify --clustered n.” (vgcreate manual)

vgcreate -s 256m VolumeGroupName DeviceList

vgcreate -s 256m Storage /dev/disk/by-id/md-name-test /dev/disk/by-id/md-name-newraid /dev/disk/by-id/md-name-etcetera

LVM – Logical Volume CREATE

I'll refrain from mentioning any lvcreate options that are duplicated by RAID and could confuse a relatively new user glancing through this guide. There are additional options and they are documented in the manual for lvcreate.

lvcreate < options > VolumeGroup [PV paths]

-l

--extents

Size by VG units, it can also be in %VG for Volume Group size, %PVS for percent of free space in the give Physical Volume(s), or %FREE for the percentage of free space among the entire Volume Group.

-L

--size

Size in capital letter power of 2 terms, KMGTPE; Mega is default.

-n

--name

What to call the logical volume. Logical volumes appear in udev systems as two names (**FIXME** on all systems?). /dev/mapper/VGname-LVname and /dev/VGname/LVname

-r

--readahead

ReadAheadSectors|auto|none ; the default of auto is sufficient for most systems. none/zero may be desirable for advanced uses.

lvcreate -L 2560G -n massive Storage

File-systems and /etc/fstab

Now there are devices /dev/mapper/VGname-LVname and /dev/VGname/LVname, and there are choices. This is the point I'd usually consider applying any encryption layers I want, since they are build on reliable and re-sizable storage. Another reason to encrypt at this point, as opposed to another, is that volumes need not share the same access key(s).

Whenever you modify filesystems or add new ones, you should review /etc/fstab to make sure the old entries still make sense, and to add the new ones. Please remember to use either UUID or /dev/disk/by-id/md-name-named and /dev/mapper/named devices; the later may be preferable from a documentation perspective your system may require one or the other in order to work with it's native init/early userspace.

One area where I see a huge deficiency on the date I first write this, 2009-06-23, is init/early userspace support. There's a /etc/fstab but there is no Linux standard for communicating to the init image systems what tools they need to contain, and what to do in order to enable and possibly unlock the root file-systems. It seems like every Linux distribution I've tried setting up encryption on has a different idea of how to do it. Having multiple options isn't such a bad idea, however it would be great if they were shared, and maybe if there were a few popular front-runners that could be tested on a single distribution.

File-systems – RAID specific options

ext4 (3/2)

mkfs.ext(4|3|2) should be used to ensure you have the correct default choices. Ext3 is largely ext2 with an -optionally used- journal. Ext3 can be cleanly unmounted then used as an ext2 file-system without issue. Ext4 however, cannot be mounted by older ext-drivers, though the ext4 driver can mount older ext file-systems.

Use the extended-options to enable raid-optimizations:

mkfs.ext4 ... -E COMA,SEPARATED,OPTIONS …

stride=

Set this to the array's chunk size in file-system blocks (usually 4k.)

stripe-width=

This should be set to data-chunk-size * data-chunks-per-stripe (do not count the parity stripes). If you are planning to grow the underlying structure (soon) after setup set this to the final size.

resize=

Reserve space for the file-system to grow to a device containing this many file-system blocks (usually -b 4096). (Useful if adding additional storage and growing everything in phases.)

Maintenance

Setting up a practice sandbox

I don't recommend attempting any of the following steps without first getting a little practice. I recommend getting that practice with an example setup and will outline the process used in some scripts I wrote to partly automate looking at RAID5 expansion and potential reshaping to RAID6.

It operates along these lines:

  1. Create the first file dding from /dev/zero.
  2. Copy to create additional device storage files.
  3. Use losetup to map the device files to loopback devices.
  4. Follow the steps in the previous section that are relevant to a given test.
  5. Add some canary files to the system by dding from /dev/urandom to create files.
  6. Pick any reasonable checksum tool, such as md5sum to generate and check hashes on the test data.
  7. Now you can practice any scenario you want.
  8. Unmount, vgchange the volume groups down, stop the md array, losetup destroy and then remove the device files and cleanup is complete.

LVM – Where is my data living?

lvs -o +devices

(Abridged and converted to a table)

LV

VG

Attr

LSize

Devices

large0

vg

-wi-ao

2.00T

/dev/md23(5000)

large0

vg

-wi-ao

2.00T

/dev/md24(0)

large0

vg

-wi-ao

2.00T

/dev/md25(0)

This information will be very useful when planning what you want to shrink, move, change, etc.

LVM – Resizing a Logical Volume

To shrink work from the top down; to grow work from the bottom up. If your not sure when I mean there is a metaphor in the appendices.

Resizing the file-system:

Here are some commands that may be of interest for shrinking the filesystems. As this is an advanced operation I recommend reading the manual pages in question.

resize2fs [ -fFpPM ] [ -d debug-flags ] [ -S RAID-stride ] device [ size ]

resize_reiserfs [ -s [+|-]size[K|M|G] ] [ -j dev ] [ -fqv ] device

Grow Only file systems

(JFS via kernel driver) mount -o remount,resize /mount/point

xfs_growfs [ -dilnrxV ] [ -D size ] [ -e rtextsize ] [ -L size ] [ -m maxpct ] [ -t mtab ] [ -R size ] mount-point xfs_info [ -t mtab ] mount-point

lvresize -l <[+|-]Extents[%{
VG(relative to volume group size)|
LV(relative to current logical volume size)|
PV(relative to free space on Pvs listed on the command line)|
FREE(relative to free space in the VG)
}]> logical-volume-path [pv-paths] 

lvresize -L <[+|-]Size[kKmMgGtTpPeE]> logical-volume-path [pv-paths] 

A volume can be resized to an absolute or relative (+/-) size. If using extent based sizing it may also be proportional from among a choice of listed references.

LVM – remove a Physical Volume

The plethora of options and extra things can be intimidating; thankfully the defaults will produce the desired results for normal cases. These are the steps for the simple and easy case of moving the storage anywhere else in the volume group but the source physical device; as always there must be adequate free space elsewhere in the VG.

pvmove source-pv [destination PVs] 

vgreduce volume-group-name <physical-volumes> 

RAID – growing a RAID array

Add spare devices

mdadm --add /dev/mdDEV <BlockDevices> 

mdadm --grow /dev/mdDEV -n<NewActiveDeviceCount> 

cat /proc/mdstat 

mdadm --detail /dev/mdDEV 

Resize the PhysicalVolume if used in LVM

pvresize /dev/mdDEV 

The Volume Groups using that PV should now have additional free space.

lvm - pvresize - too many metadata areas for pvresize

This "too many metadata areas for pvresize" is telling you that pvresize doesn't know how to resize with metadata at the end of the device. Follow these steps as a workaround. These steps are potentially dangerous but have been tested and worked for me.

  1. vgcfgbackup -v -f /some/file volumeGroup
  2. vgchange -an volumeGroup
  3. pvcreate -ff --metadatacopies 1 --restorefile /some/file --uuid UUID /dev/device
  4. vgcfgrestore -v -f /some/file volumeGroup
  5. pvresize /dev/device

NOTE: Step 3 requires that you specify the UUID even though it exists in the backup file.

A similar set of operations can be used to increase metadatacopies to 2; however that doesn't seem to make sense until the lvm tools support working with 2 copies of metadata OR they support -easily- reducing/increasing the copies.

You may want to change the 'default' configuration setting for metadatacopies in your lvm configuration file to 1. After doing so, or by explicitly specifying one copy on the command line, you may also be able to do an easier workaround which involves moving on to PVs that contain only one copy of meta-data already.

RAID – recovering from a device failure

This topic should probably have a whole chapter dedicated to it. In general the root cause of the failure and failed device(s) should first be determined. Some RAID failures, including a dreaded controller failure (such as due to silent firmware corruption or hardware failure) can lead to -major- data loss. Other failures are more easily diagnosed and involve more routine drive failure.

Detailed Linux specific RAID information and recovery steps:

RAID – getting back in from a rescue system

I've had to do this twice now my self, thanks to various grub update issues. Each time I've had to search the web until I find a hint to remind me why --assemble --scan doesn't work. The reason it doesn't work is that scan isn't telling mdadm to scan the devices you provide and then assemble them. Instead it's telling mdadm to scan the configuration file and fill in exactly the data you don't have at the moment when it tries to assemble devices. The answer is to examine the devices to generate the configuration file, and then assemble.

You may also want to add --no-degraded as there may be a reason you have to get back in the hard way.

WARNING: Don't do this until you understand that you're clobbering the mdadm.conf file and replacing it with whatever you find.

mdadm --examine --scan /dev/[sh]d* /dev/mapper/* > /etc/mdadm.conf ; mdadm --auto --scan 

Appendices

A – A Metaphor for LVM on RAID

Modern storage exists as barges (hard drives) drifting in a sea of entropy. RAID adds a layer on top of the barges that distributes the weight so that there is a given redundancy to allow it to remain floating even if a pre-determined number of barges sink from beneath it. Logical Volumes are like those shipping areas at major transit hubs where blocks of storage float on top of a bed of rollers so they can move with little resistance.

Growing a logical volume, therefore, is as simple as expanding the units a given division has for storing it's data. As long as there is free space (and the filing system can grow, most modern ones can) the boxes can grow, then the data inside can grow to use the new boxes.

Shrinking a logical volume is slightly more difficult. First the data using those boxes must be reduced, so that empty boxes are left behind; otherwise it's like dumping the data overboard in to that sea, which ruins them. Then it is safe to move those boxes out of the division and put them back in to the unused area list.

B – LVM2 meta data

Since the LVM documentation wasn't kind enough to offer me the label location easily I decided to create a Physical Volume on a 4mb loop-back file. Here's the hexdump.

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000200  4c 41 42 45 4c 4f 4e 45  01 00 00 00 00 00 00 00  |LABELONE........|
00000210  90 37 a9 b6 20 00 00 00  4c 56 4d 32 20 30 30 31  |.7.. ...LVM2 001|
00000220  53 55 57 71 66 55 4c 59  71 58 70 71 38 48 71 36  |SUWqfULYqXpq8Hq6|
00000230  4a 46 67 33 59 6c 49 4c  4a 75 48 4c 4e 70 57 57  |JFg3YlILJuHLNpWW|
00000240  00 00 40 00 00 00 00 00  00 00 03 00 00 00 00 00  |..@.............|
00000250  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000260  00 00 00 00 00 00 00 00  00 10 00 00 00 00 00 00  |................|
00000270  00 f0 02 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000280  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  1b c7 35 60 20 4c 56 4d  32 20 78 5b 35 41 25 72  |..5` LVM2 x[5A%r|
00001010  30 4e 2a 3e 01 00 00 00  00 10 00 00 00 00 00 00  |0N*>............|
00001020  00 f0 02 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

C – mdadm device names

(Retrieved 2009-06-25: http://markmail.org/message/v3rzfdtkwpmchjeh )

From: Neil Brown (nei...@suse.de)

Date: Oct 26, 2008 3:55:49 pm

List: org.kernel.vger.linux-raid

Greeting. This is a Request For Comments....

Device naming in mdadm is a bit of a mess. We have partitioned devices (mdp) and non-partitioned (md) We have names in /dev/md/ (/dev/md/d0) and directly in /dev (/dev/md_d0). We have support for user-friendly names (/dev/md/home) and for "kernel-internal" names (/dev/md0).

All this can produce extra confusion when udev is brought into the picture. And it can leave lots of litter lying around in /dev if we aren't careful (which we aren't).

I hope to release mdadm-3.0 this year, and maybe that gives me a chance to get it "right". I don't want to break backwards compatibility in a big way, but I think I am happy to introduce little changes if it means a more consistent model.

In 2.6.28, partitioned devices (mdp) wont be needed any more as md will make use of the "extended partition" functionality recently added. All md devices can be partitioned. The device number for the partitions will be very different to that of the whole device, but udev should hide all of that. So we don't have to worry too much about mdp devices.

So I think the following is how I want things to work. I am very open to comments and suggestions. Particularly I want to know what (if anything) this will break.

1/ The only device nodes created will be /dev/mdX and /dev/md_dX along with partitions /dev/mdXpY and /dev/md_dXpY as appropriate. These will be created by mdadm in accordance with the "--auto" flag unless something in mdadm.conf says to leave it to udev. In that case, mdadm will create a temporary node (/dev/.mdadm.whatever) and remove it once udev has created the real thing.

2/ There will be various symlinks to these devices. a/ if "symlinks=yes" is given in mdadm.conf, symlinks from /dev/md/X or /dev/md/dX will be created. b/ if udev is configured like on Debian, /dev/disk/by-id/md-name-XXXX and /dev/disk/by-id/md-uuid-UUUU will be created (by udev). c/ If there is a 'name' associated with the array then /dev/md/name will be created as a link. d/ if an explicit device name of /dev/name was given, either on a -A, -B, -C, command or in mdadm.conf, then the 'name' must match the name of the array, and /dev/name will be used as well as /dev/md/name.

3/ For a 'NAME' to be used, with as md-name-NAME or /dev/md/NAME, we need a high degree of confidence that the array was intended for "this" host, or otherwise is not going to conflict with an array that is meant for "this" host. We get this confidence in a number of ways: a/ If the name is listed in /etc/mdadm.conf e.g. ARRAY /dev/md/home UUID=XXXX..... b/ If the name was given on the command line b/ If the name is stored in the metadata of an array which is explicitly identifed in mdadm.conf or by the command line. c/ If the name is of the form "host:name" and "host" matches this host. We then use just "name". d/ If the name is of the form "host:name" and "host" does not match this host, we can still assume that "host:name" is unique and use that. e/ For 0.90 metadata, if the uuid has the host name encoded in it then it was intended for 'this' host.

Thus unsafe names are names extracted from the metadata of arrays which are auto-detected, where there is no hint in the metadata that the array is built for 'this' host.

If the NAME is not known to be safe, we can still assemble the array, but we use a "random" high minor number, and allow it to be found primarily by the by-id/md-uuid-UUUUU... link or some other link created based on array content: e.g. disk/by-label/ Also the array will be assembled "auto-readonly" so no resync etc will happen until the array is actually used.

mdadm-3.0 will be able to support "containers" such as a set of devices with DDF metadata. These can then contain a number of different arrays. If the 'container' is known to be local to 'this' host, then we assume that all contained arrays are too.

I'm contemplating creating a link based on the metadata type with a sequential number. e.g. /dev/md/ddf1 or /dev/md/imsm2. I'm not sure if there should be in /dev/md/ or directly in /dev/. I'm also not sure if I should leave the creation to udev, and whether I should use a small sequential number, or just whatever number was allocated as the minor number of the device.

4/ When we stop an array, mdadm will remove anything from /dev that it probably created. In particular, it will remove the device node as described in 1, any partitions, and any symlinks in /dev or /dev/md which point to any of those. I need to be certain that this won't confuse udev.

5/ I want to enable assembly without having to give an explicit device name, thus requiring mdadm to automatically assign one just as it would for auto-assembly. In particular, the "ARRAY" line in mdadm.conf will no longer require an array name. That would mean that "-Es" wouldn't need to produce an array name (which is not always easy). So: mdadm -Es > /tmp/mdadm.conf mdadm -Asc/tmp/mdadm.conf would leave the choice of device name to the "-A" stage which is the only time that unique non-predictable names can be chosen.

6/ I'm thinking that if the array name given to --create or --assemble looks as though it identifies a metadata type, by having the name of a metadata type followed by some digits, e.g. /dev/ddf0 or /dev/md/imsm3 then we insist that the array have that metadata type. That could mean that a future metadata type might conflict with a previously valid usage, which would be a bore. Maybe if there are trailing digits, then it *must* identify a metadata type, or be "mdNN".

Some issues that all of this needs to address:

1/ People want auto-assembly. I've always fought against it (we don't auto-mount all filesystems do we?). But it is a loosing battle. And on a modern desktop, when you plug in a new drive the filesystem is automatically mounted. So my argument is falling apart.

2/ Auto-assembly of new arrays must not conflict with auto-assembly of previously existing arrays, even if the devices comprising the new arrays are discovered earlier. This is what the 'homehost' concept is for. Your array will only get assembled with a predictable name if it is known to be attached to 'this' host.

3/ Auto-assembly needs to handle incremental arrival of devices correctly. There are no easy solutions to this, particularly when e.g. ext3 can write to the device even when mounted read-only (for journal replay). I think the best that I can do for now is assemble things 'read-auto' to delay any writes a long as possible in the hope that all available devices will be connected by then. Adding in-memory bitmaps for all degraded array to accelerate rebuild would help but won't be in 2.6.28.

4/ auto-assembly needs to do the right thing on a SAN where multiple hosts can each see multiple arrays. Clearly only one host should write to any one array at one time (until I get some cluster-awareness going, which I had hoped to work on this year, but it doesn't look like I will). In this case, I don't think read-auto is enough. We either need to not assemble arrays when aren't known to belong to us, or we need to assemble them read-only and require and explicit rea d-write setting.

So we need some way to know which devices could be visible to other hosts. I could have a global flag in mdadm.conf "Options SAN" I could have a SAN-DEVICES to match "DEVICES", but as just about everything is "/dev/sd*" these days, I don't know if that would work.

Any suggestions concerning this would be welcome.

I'm also wondering if I should include a udev 'rules' file for md in the mdadm distribution. Obviously it would be no more than a recommendation, but it might give me a voice in guiding how udev interacted with mdadm.

Any thoughts of any of this would be most welcome.

Thanks, NeilBrown

-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majo...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

D – raid 5 to raid 5 copying (9 disk to 3 disk)

System: AMD BE-2350 @ 2100 mhz (real), 4gb ECC memory, AM2+ mobo, sata II drives. Reading and writing at up to 30mb/sec (large files); performance seems IO bound, even though using rsync and requesting transfer speed display. Writing to an array with more devices would probably improve performance.

top (P s 60)
Cpu0  :  1.7%us,  9.6%sy,  0.0%ni, 82.4%id,  5.9%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu1  : 17.1%us, 29.1%sy,  0.0%ni,  7.8%id, 41.1%wa,  3.6%hi,  1.2%si,  0.0%st
Mem:   3780080k total,  3760556k used,    19524k free,   230624k buffers
Swap:  4199340k total,      272k used,  4199068k free,  2498056k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7809 root      20   0 25252 1128  368 R   22  0.0   0:20.05 rsync
 7807 root      20   0 25340 1916  896 R   22  0.1   0:19.70 rsync
 3826 root      15  -5     0    0    0 S    7  0.0  22:27.36 md100_raid5
  397 root      20   0     0    0    0 S    2  0.0   0:09.28 pdflush
  398 root      15  -5     0    0    0 S    2  0.0   0:14.81 kswapd0
 6391 root      20   0     0    0    0 S    1  0.0   0:01.27 pdflush
 7786 root      15  -5     0    0    0 S    1  0.0   0:00.94 kjournald2

E – Useful Commands

EXT3/4 make a journal dev (on another disk for maximum performance)

mke2fs -b 4096 -O journal_dev -L LABEL_LOG /dev/disk/by-id/md-name-whatever

EXT3/4 Options I used to make my device ()

mkfs.ext4 -b 4096 -E stride=$((128/4)),stripe-width=$((128/4*6)) -J device=LABEL=LABEL_LOG -L LABEL -m 0.2 -O dir_index,flex_bg,extent,uninit_bg,resize_inode,sparse_super -G 32 -T ext4 -v -i $((1048576/16)) /dev/mapper/whatever

find RELATIVEONLY_PATH -xdev -depth -print0 | cpio --null -pmuvd --block-size=$((128*2*2)) DESTINATION

F – Example test script

I wrote this script to test the support of my current Linux kernel and mdadm to reshape from raid-5 to raid-6. It might not begin working if support does exist, but it does provide a good example for general testing.

# -e

function create_block {
# SizeMB , Name
dd if=/dev/zero bs=1024k count="$1" of="$2"
}

function setup_blockdevs {
# count, Name
ii_setup_bd=0;
while [[ $ii_setup_bd -lt $1 ]]
do
cp "$2" "$2"."$ii_setup_bd"
devs[$ii_setup_bd]=`losetup -f --show "$2"."$ii_setup_bd"`
ii_setup_bd=$(($ii_setup_bd + 1))
done
}

function print_list {
# list
while [[ -n "$1" ]]
do
echo $1
shift
done
}

function delete_blockdevs {
# list, shifted out
while [[ -n "$1" ]]
do
#echo rm $1
losetup -d $1
shift
done
}

create_block 33 test_devs
setup_blockdevs 6 test_devs

## Create array, raid 0
mdadm --create /dev/md90 -e 1.2 -c 64 --verbose --level=5 --raid-devices=3 ${devs[0]} ${devs[1]} ${devs[2]}
uuid=`mdadm --detail /dev/md90 | awk '/UUID/ {print $3}'`
echo "$uuid"
#sleep 5

if [[ -z "$uuid" ]] ; then exit ; fi

## Create filesystem and test data
mkfs.ext4 /dev/md90
mkdir "/tmp/test$uuid"
mount /dev/md90 "/tmp/test$uuid"
pushd "/tmp/test$uuid"
dd if=/dev/urandom of="random-data" bs=1024k count=10
popd
umount "/tmp/test$uuid"

## test
mount /dev/md90 "/tmp/test$uuid"
pushd "/tmp/test$uuid"
md5sum -b random-data > random-data.md5
md5sum -c *.md5
popd
umount "/tmp/test$uuid"

## grow
mdadm --add /dev/md90 ${devs[3]}
mdadm --detail /dev/md90
echo "Trying to grow RAID5 by one device"
mdadm --grow /dev/md90 --level=5 -n4
mdadm --detail /dev/md90
sleep 5

## test
mount /dev/md90 "/tmp/test$uuid"
pushd "/tmp/test$uuid"
md5sum -b random-data > random-data.md5
md5sum -c *.md5
popd
umount "/tmp/test$uuid"

## grow
mdadm --add /dev/md90 ${devs[4]}
mdadm --detail /dev/md90
echo "Trying to grow RAID5 by one device to RAID6"
mdadm --grow /dev/md90 --level=6 -n5
mdadm --add /dev/md90 ${devs[5]}
mdadm --detail /dev/md90
echo "Trying to grow RAID6 by one device, or RAID5 by two to RAID6 (or if that returns failure, just raid 5)"
mdadm --grow /dev/md90 --level=6 -n6 || mdadm --grow /dev/md90 --level=5 -n6
mdadm --detail /dev/md90
sleep 5

## test
mount /dev/md90 "/tmp/test$uuid"
pushd "/tmp/test$uuid"
md5sum -b random-data > random-data.md5
md5sum -c *.md5
popd
umount "/tmp/test$uuid"

sleep 5
mdadm --stop /dev/md90
delete_blockdevs ${devs[@]}
unset uuid
unset devs
unset ii_setup_bd
rm test_devs*

LVM-on-RAID (last edited 2009-11-04 21:57:23 by MichaelEvans)