intellectual mastication

complex procedures, concisely elucidated, for efficient recollection upon subsequent variations of the endeavors herein described


github | about | explore | rss

RAID Intro

No, not the military action. RAID drives are a Redundant Array of Inexpensive Disks, great for when you need more storage in one place than a single drive can provide. It’s a way to use software (or hardware) to glue together a bunch of disks into something that looks to a computer like one big disk.

There’s a bunch of different ways to do the gluing, and they each have pros and cons. If you’re really worried about losing some valuable data, you can get 2 or more identical disks and put the same data on all of them in parallel (like making multiple backups of your data - RAID just makes it super simple). Then you lose one disk and you haven’t lost any data. That’s “RAID 2”. Or, if you take a more cowboy approach, you can do a “RAID 0”, where you just sum the drives together without any redundancy…lose one though, and you lose everything.

The rest of the RAID variations are compromises between these two approaches. The table below describes exactly what the tradeoffs of each RAID level are.

Space, fault tolerance, and performance shown as ratios of single drive performance. Thus, in a 4-drive RAID6 configuration: 1-2/[4] = 1-1/2 = 1/2, which means the available space is 1/2 the total capacity of the 4 drives.

RAID      Minimum   Space      Fault      Read      Write performance
level     drives    efficiency tolerance  perf.

RAID 0    2         1          None       n         n
RAID 1    2         1/n        n − 1      n [a]     1 [c]
RAID 4    3         1 − 1/n    1 [b]      n − 1     n − 1 [e]
RAID 5    3         1 − 1/n    1          n [e]     1/4 [e]
RAID 6    4         1 − 2/n    2          n [e]     1/6 [e]

[a]  Theoretical maximum, as low as single-disk performance in practice.
[b]  Just don't lose the pairity disk.
[c]  If disks with different speeds are used in a RAID 1 array, overall write 
     performance is equal to the speed of the slowest disk. 
[e]  That is the worst-case scenario, when the minimum possible data (a single
     logical sector) needs to be written. Best-case scenario, given 
     sufficiently capable hardware and a full sector of data to write: n − 1.
     This is because data is written in predetermined 'chunk' sizes; if the 
     data is much smaller than the chunk, the whole chunk must still be 
     written.

RAID 0 is just an aggregation of all the disks, with no fault tolerance. If one of the drives dies, the whole thing is toast.

RAID 1 consists of an exact copy (or mirror) of a set of data on two or more disks. The array can only be as big as the smallest member disk. This layout is useful when read performance or reliability is more important than write performance or the resulting data storage capacity. The array will continue to operate so long as at least one member drive is operational.

Random read performance of a RAID 1 array may equal up to the sum of each member’s performance, while the write performance remains at the level of a single disk. However, if disks with different speeds are used in a RAID 1 array, overall write performance is equal to the speed of the slowest disk.

RAID 5 consists of block-level striping with distributed parity. Parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.

RAID 6 is any form of RAID that can continue to execute read and write requests to all of a RAID array’s virtual disks in the presence of any two concurrent disk failures. RAID 6 does not have a performance penalty for read operations, but it does have a performance penalty on write operations because of the overhead associated with parity calculations. RAID 6 can read up to the same speed as RAID 5 with the same number of physical drives.


Expanding a RAID array with an additional disk

Question. Given a RAID array built with mdadm, is it possible to add another disk to it and expand its storage?

Answer. Yes, but it’s annoying. You have to prep the disk, add it, and then expand the filesystem. And you have to do it all by hand.

I tried making a script to do this automatically, but I’m pretty sure I’d never actually trust that script not to screw up - with the amount of sanity checking I was putting into it, it’s better to just put the whole process here and do it manually.

So here goes.

Before we start

This works with a variety of RAID levels - and if your raid array is currently missing a disk, follow this walkthrough to prep the drive and add it as a hot spare, and the computer will take care of the rest.

If you ever want to see what’s up with all your RAID arrays, use this command:

cat /proc/mdstat
  • You’ll need local access to a drive not associated with the RAID array. This is for a small backup file, which (far as I can tell) keeps a second record of how the disks are arranged and how the data is broken up; if the raid alterations fail, this backup file is used to recover everything.
  • Make sure you know what the name of the RAID device is that you’re adding to. This will look something like /dev/md3, but likely with a different number at the end.
  • Figure out how many disks are currently in the RAID array, you’ll need that number later. Below is a quick script that can help. Save it as count_raid_disks.sh.
#!/bin/bash
# usage: bash count_raid_disks.sh 'md3'

[ -z "$1" ] && echo "provide the RAID device, e.g. 'md3'" && exit

wordcount=$(cat /proc/mdstat | grep $1 | wc -w)
echo "number of disks in RAID array $1: $(($wordcount-4))"
  • Same goes for the disk you’re adding. It’ll be something like /dev/sdj, but with a different letter at the end.
  • Also, do some sanity checks. The last thing you want is to realize the drive was already in use somewhere.
# say what your disk is. don't include a partition number ('sdj0') or '/dev/'
new_disk='sdj'

# check that the new disk is not mounted conventionally
mount | grep -q /dev/$new_disk && echo "new disk is already in use" && exit

# check that the new disk is not part of a RAID device
cat /proc/mdstat | grep $new_disk > /dev/null && echo "disk is part of a RAID device" && exit

Prep the new disk

If the disk is clean and ready to roll, then go ahead and wipe it.

Remember, adding the disk to RAID will wipe everything.

This is your final warning.

If you ignore it, run any of the following commands, and then run crying back because you lost some data, then this is my best advice: you shouldn’t be a sysadmin. Maybe you should stick to Windows. And there’s a chance – just a chance – that computers are just not your thing.

# specify the disk.
new_disk='sdj'

# zero out the first 1 GB of the drive; effectively clears all partition tables
sudo dd if=/dev/zero of=/dev/$new_disk bs=1M count=1000 status=progress

# make sure that none of the previous disk write is being held in buffer
sudo sync

# create new partition label
sudo parted /dev/$new_disk mklabel gpt & sync

# create new primary partition
sudo parted -a optimal /dev/$new_disk mkpart primary 0% 100% & sync

# turn on raid on the new addition
sudo parted /dev/$new_disk set 1 raid on & sync

# verify the formatting
parted /dev/$new_disk print

The disk is ready to be added to the array!

add the disk to the RAID array as a spare

First we add the disk as a hot spare. If we wanted, we could stop at that point and we’d have a RAID array of the same size, but with extra redundancy - if a drive were to fail, the hot spare would be automatically used as a replacement. Then the failed drive could be removed later.

Hot spares aren’t strictly necessary, because there’s already redundancy built into the array - RAID 5 can lose 1 disk without data loss, RAID 6 can lose 2 disks. If a disk is lost, the RAID array will need a new disk to be added quickly; this hot spare means the RAID array can immediately begin moving data onto a replacement, instead of waiting around in the high-risk state with less room for disk failure.

# specify the disk
new_disk='sdj'

# specify the raid device
raid_device='md3'

# add the disk as a hot spare
sudo mdadm /dev/$raid_device --add /dev/$new_disk

When the OS containing the RAID array turns on, the RAID array is just a bunch of disks attached to the computer. If they’re supposed to be automatically turned into a single RAID array when the machine starts, you need to say so.

In case you haven’t yet noticed, we’re using a tool called mdadm to work with this RAID array. In fact, these are considered ‘mdadm RAID arrays’. That’s because there’s a lot of leeway in how each RAID variation can be implemented: size of the data chunks, how the records are kept of which piece is where, how the chunks are checked for errors, etc. Some of that (sometimes a lot of that, or even all of that) is up to the programmer who builds their version of RAID. It’s not expected (so far as I know) that one dude’s implementation of RAID be compatible with someone else’s version; it appears that the standard definitions of the RAID levels just focus on the high-level stuff like ‘number of drives that can fail without data loss’.

All that to say, we need to edit an mdadm config file and tell mdadm that we want the RAID array assembled on startup. There’s going to be an entry already for the raid device we’re working on, since what we’re doing is a modification of an existing raid array. Comment out that entry with a #; we’re going to add a new entry in a moment that will reflect the additional disk we just added.

sudo nano /etc/mdadm/mdadm.conf

Then we want to add a new entry, for the updated RAID array.

# identify the raid device
raid_device='md3'

# append current raid configuration
sudo mdadm --detail --brief /dev/$raid_device | sudo tee -a /etc/mdadm/mdadm.conf

We now have a hot spare! If that’s all we wanted, then we’re done: if we were to shutdown and restart the host server, the RAID array would automatically reassemble. If one of the disks were to fail on us right now, the array would automatically detect that, and start moving data over onto the hot spare we just added. (Because that’s the rub: the array is only designed for so many extras, so there’s no way to know what data can be put onto the hot spare until something breaks…and when it does, it’s going to take hours, or even days, to collect the needed data from the remaining drives in the RAID array and copy it over to the automatically-added hot spare [which isn’t spare any more].)

Now, before you keep going, there’s a point here worth mentioning. If you’re trying to add multiple disks to the array, you can do that simultaneously - you don’t need to add them individually, and wait hours/days for the array to grow over each one-by-one. If you have another disk to add, go back and do what we just did, again, with the other disk. Come back when you’ve added it to the RAID array just like the first.

When you’re done with that, the RAID array should have two hot spares attached.

grow the RAID array onto the spare

Now we tell the RAID array that it’s bigger than it thought, and that it should expand itself onto the ‘spare’ that’s now available. We do this by getting the number of disks in the array; let’s say there’s 4 in there now.

bash count_raid_disks.sh 'md3'

Then, we tell the RAID array that hey, actually, there’s (4 preexisting + 1 new hot spare) 5 disks in the array. This is also where we need to know where we can put that backup file.

By the way, if you’re doing that thing where you add multiple hot spares at a time, then here you should tell the RAID array that you have e.g. (4 preexisting disks, and 2 hot spares = ) 6 disks in the array, instead of 5.

# identify the raid device
raid_device='md3'

# 4 preexisting disks + 1 new hot spare disk
new_disk_count=5

# place we can put a small (less than 10mb) backup file
raid_backup_filepath='/root/md3_grow.bak'

# tell the RAID array it has more disks than before
sudo mdadm --grow --raid-devices=5 --backup-file=$raid_backup_filepath /dev/$raid_device

This will take a long time. Possibly days. To view the progress, check the contents of this file:

cat /proc/mdstat

When this is done, check to see if there were any errors. This will require that the disk is unmounted, so do that first (if it’s like mine, and used in passthrough mode by a virtual machine – good luck! That’s going to be a pain and a half to shut down, detach, and free up).

# don't do this until you're done!
umount /dev/md3

By the way, what kind of partition is on that drive? EXT3? EXT4? You’ll need to figure that out if you want to run the right disk checker program. Mine is ext4, but yours might not be. This might help:

# what partitions are associated with the raid array?
ls /dev/md3*

# if that command returned '/dev/md3p3', the following should give the
# partition type
df -T /dev/md3p3

You might have to detach from guest VMs, if that’s relevant, before you get a proper partition read. Also, it only worked for me after I mounted the drive again on the host. So, uh, good luck.

Choose whichever of the following matches your EXT partition type. (If you’re not using EXT_, I’m sorry for you and I have no suggestions. Perhaps a new career?)

fsck.ext4 -f /dev/md3
fsck.ext3 -f /dev/md3

Let it fix any issues, and move on to resizing the partition to expand to fill the available space.

resize2fs /dev/md3

At this point, you should be done!


OpenBSD Router, 2nd draft

Welcome back to the thunderdome: a one-stop-shop to get your brain thoroughly bashed with incomprehensible ideas and ineffable concepts.

That is to say, welcome back to the OpenBSD router discussion. This is the short version: a bit of interstitial text, but mostly just a place to put all the separate config files and scripts that need to be synchronized.

We assume this is a fresh installation of OpenBSD 6.9 on a dedicated physical box which has separate ethernet ports for each subnet and the egress point.

When each of the following pieces have been added, reboot the router. That’s the simplest way to get everything started properly.

network design

                      0.0.0.0/0
+-----------------------------+
| open internet               |
+-+---------------------------+
  |
  |
  |                 123.12.23.2
+-+---------------------------+
| ISP-assigned router         |
| device: em2                 |
+-+---------------------------+
  |
  |
  |               192.168.1.102
+-+---------------------------+
| firewall / router           |
| (OpenBSD)                   |
+-+--+------------------------+
  |  |
  |  |
  |  |                 10.0.1.1/24   +------------+-----------------------+
  |  |  +------------------------+   |            |                       |
  |  +--+ (insecure network)     +---+   +--------+---------+  +----------+----------+
  |     | device: em0            |       |wifi access points|  |laptops, servers, etc|
  |     +------------------------+       +------------------+  +---------------------+
  |
  |
  |                    10.0.2.1/24   +------------+-----------------------+
  |     +------------------------+   |            |                       |
  +-----+ (secure network)       +---+   +--------+---------+  +----------+----------+
        | device: em1            |       |wifi access points|  |laptops, servers, etc|
        +------------------------+       +------------------+  +---------------------+

Note, depending on the OpenBSD firewall/router box configuration, there can also be virtual machines inside that box which are connected to one or more of the subnets.

The router will have an ip address on each of the subnets; this is referred to as the ‘gateway address’, since the router is literally a gateway out of the sub-network. That IP address is usually the first in the available series: i.e., 100.100.1.1 and 100.100.2.1 in the two examples above.

I like having the wireless clients on the same subnet as the wired clients, but that’s my use case. It can be nearly as easy to put all the wifi APs on their own subnet.

router construction

Allow packets to be forwarded between network devices on the OpenBSD machine.

echo 'net.inet.ip.forwarding=1' > /etc/sysctl.conf

Request an IP address using DHCP for the network device port connected to the open internet (em2).

echo 'dhcp' > /etc/hostname.em2

It should (IIRC) now be possible to get online now by restarting the system network connections with the new configuration.

sh /etc/netstart

Define the subnets provided by each network device port.

echo 'inet 10.0.0.1 255.255.255.0 10.0.0.255 description   "secure network"' > /etc/hostname.em0
echo 'inet 10.0.1.1 255.255.255.0 10.0.1.255 description "insecure network"' > /etc/hostname.em1

DHCP (IP address assignment)

Start the dhcpd daemon, and tell it which network device ports will need the dhcp service available.

rcctl enable dhcpd
rcctl set dhcpd flags em0 em1

Modify the dhcpd config file.

/etc/dhcpd.conf

option domain-name "ninthiteration.lab";

subnet 10.0.0.0 netmask 255.255.255.0 {
        option routers 10.0.0.1;
        option domain-name-servers 10.0.0.1;
        range 10.0.0.10 10.0.0.254;
}
subnet 10.0.1.0 netmask 255.255.255.0 {
        option routers 10.0.1.1;
        option domain-name-servers 10.0.1.1;
        range 10.0.1.10 10.0.1.254;
}

PF (firewall)

An exhaustive (believe me, it was exhausting to make) explanation of the contents of this file is in the longer version of this post.

/etc/pf.conf

secure   = "em0"
insecure = "em1"

table <martians> { 0.0.0.0/8 10.0.0.0/8 127.0.0.0/8 169.254.0.0/16     \
                   172.16.0.0/12 192.0.0.0/24 192.0.2.0/24 224.0.0.0/3 \
                   192.168.0.0/16 198.18.0.0/15 198.51.100.0/24        \
                   203.0.113.0/24 }

set block-policy drop
set loginterface egress
set skip on lo0

match in all scrub (no-df random-id max-mss 1440)
match out on egress inet from !(egress:network) to any nat-to (egress:0)

antispoof quick for { egress $secure $insecure }
block in quick on egress from <martians> to any
block return out quick on egress from any to <martians>
block all

pass out quick inet
pass in on { $secure $insecure } inet

DNS

Enable the unbound DNS daemon.

rcctl enable unbound

The unbound daemon puts itself in a chroot after starting, so its config file is buried a little deeper than the others.

/var/unbound/etc/unbound.conf

server:
    interface: 10.0.0.1
    interface: 10.0.1.1
    interface: 127.0.0.1

    access-control: 0.0.0.0/0   refuse
    access-control: 10.0.0.1/24 allow
    access-control: 10.0.1.1/24 allow
    do-not-query-localhost: no
    hide-identity: yes
    hide-version: yes

forward-zone:
        name: "."
        forward-addr: 192.168.43.79  # IP of the upstream resolver

Nameserver

We want router to use its own unbound DNS cache.

/etc/resolv.conf

nameserver 127.0.0.1
nameserver 192.168.43.79

Problem is, this gets overwritten every time dhcp runs, because part of the dhcp protocol is a record of the server which provided the IP address. We have our own such server, and we want to use it instead.

/etc/dhclient.conf

interface "em2" { 
    ignore domain-name-servers; 
}