In a RAID array I had a have periodically lost a drive here and there over the past several months. Iwas always able to readd and resync without losing data. However at some point it looks like some minor corruption happened and this makes DRBD unhappy.
Using fsck did not help either.
Dec 19 06:01:45 storageboxtest4 kernel: [19005.945890] EXT3-fs error (device drbd0): ext3_get_inode_loc: unable to read inode block - inode=22184379........
Before getting into the output here is my typical experience with SMART, there is what I call a "bad disk" with pending and uncorrectable sectors that cannot be reallocated.
It has caused a kernel panic and system crash repeatedly as we can see from the logs.
But SMART says it has "PASSED" its self assessment. SMART is still useful to me but it is more about looking at Current_Pending_Sector.
Any time I have had anything but 0 for that attribute it........
Uh oh
[17925926.174277] block drbd0: Handshake successful: Agreed network protocol version 96
[17925926.174325] block drbd0: conn( WFConnection -> WFReportParams )
[17925926.174342] block drbd0: Starting asender thread (from drbd0_receiver [1682])
[17925926.174432] block drbd0: data-integrity-alg:
[17925926.174581] block drbd0: drbd_sync_handshake:
[17925926.174586] block drbd0: self 2AAE66AF9252D6DB:2815BF........
Everytime I've seen this error "/dev/drbd0: State change failed: (-2) Need access to UpToDate data" it is because DRBD has no disk:
cat /proc/drbd
version: 8.3.13 (api:88/proto:86-96)
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root@sighted, 2012-10-09 12:47:51
0: cs:Connected ro:Secondary/Secondary ds:Diskless/Inconsistent A r-----
ns:0 nr:0 dw:0 dr:0 al........
Tired of checking iotop and seeing that your drbd partition is using 99.99% of io all the time and finding your drbd device performs slow in general?
This is especially an issue in versions of DRBD in the 8.3 tree in particular one documented case is on "8.3.13" but it likely applies to other devices.
The symptoms are that resyncing is fine and normal but any reasonable amount of activity is very slow and lagged and creates a high server load and con........
I have not found the source of this but essentially it seems like drbd and ext4 may not play well but I have to confirm still.
In either case an older DRBD setup with older hard drives seems to have little to no iowait, but the main difference is the drbd partition is ext3 and not ext4. I will experiment and see if that fixes this, then we will know that DRBD and ext4 have issues.........
drbd 8.3 hard drive failure recovery
drbdadm attach r0
DRBD module version: 8.3.10
userland version: 8.3.8
you should upgrade your drbd tools!
0: Failure: (119) No valid meta-data signature found.
==> Use 'drbdadm create-md res' to initialize meta-data area. ........
This is what fixed it:
[root@box13 ~]# dd if=/dev/zero of=/dev/md160 bs=512 count=500
Basically you need to wipe out more than just the 512 byte partition table so 512 bytes * 500 is more than enough to make DRBD happy and think the partition is now empty.
The reason this happens is because it gets confused when there is a previous partition with data on the device you are using.
root@box13 ~]# d........
I used the matching 8.3.13 utilities and it didn't work but strangely the newer 8.3.16 which makes DRBD complain works just fine.
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by root@sighted, 2012-10-09 12:47:51
0: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent A r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:5236960
&am........
On primary node
drbdadm connect all
On secondary node
drbdadm -- --discard-my-data connect all
........
pxe-32 tftp open timeout
The solution was to enable tftp in xinetd with "chkconfig tftp on".
See the troubleshooting below:
chkconfig --list
NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off
acpid 0:off&n........
This assumes that you've at least created the correct partition for your DRBD already.
Notice that I am "diskless", that's because either your DRBD partition doesn't exist/has been renamed (eg. sdb becomes sda when sdb dies and you reboot) or because that drive is really actually dead/gone.
*If you need to permanently change the partition/device for your resource be sure to edit /etc/drbd.conf on both hosts and reload the config.
(replace r0 with........
CPU/Kernel/MB/RAID problem?
Jan 5 12:45:05 testbox kernel: [653298.890004] BUG: soft lockup - CPU#0 stuck for 61s! [hal-acl-tool:4168]
Jan 5 12:45:05 testbox kernel: [653298.890005] Modules linked in: vmnet vmci vmmon binfmt_misc drbd video output input_polldev ocfs2_stackglue ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs k8temp hwmon_vid lp snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi........
I've only used it on Centos, soI thought I'd make a quick Debian guide:
Install the DRBD Package
apt-get install drbd8-utils
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
libswfdec-0.8-0
Use 'apt-get autoremove' to remove them.
The following........
I separated the 2 drives in the RAID 1 array.
1 is the old one /dev/sda and is out of date, while the separated other one /dev/sdc was in another drive and mounted and used with more data (updated).
I wonder how mdadm will handle this:
usb-storage: device scan complete
md: md127 stopped.
md: bind
md: md127: raid array is not clean -- starting background reconstruction
raid1: raid set md127 active with 1 out of 2 m........
This has stumped me a few times because I keep forgetting that Centos 5.5 comes with a default iptables configuration that ends up blocking DRBD traffic,I tried all the normal things and couldn't understand why I couldn't make my normal DRBD config work. So if you have WFConnection problems and have tried the normal "mailing list" fixes, check your firewall status first!
Both Nodes Say the Following:
version: 8.3.8 (api:88/prot........
heartbeat is stopped for some reason
Anyway hnode2 was active and the services are running fine but I see heartbeat has been stopped somehow.
Here is the last log I see of heartbeat:
[quote:23c84415f5]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 9/1762471 ms age 0 [pid16738/MST_CONTROL]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 716/51784021 152624/74519 [pid16738/MST_CONTROL]
Sep 9 17:15:32........
Clustering LinksI thought this might be interesting for people with spare time.
[b:6423c19973]Great clustering article from Linux Mag[/b:6423c19973]
http://www.linux-mag.com/2003-11/clusters_01.html
[b:6423c19973]General Linux cluster information[/b:6423c19973]
http://www.gdargaud.net/Hack/ClusterNotes.html#HighA
http://www.faqs.org/docs/Linux-HOWTO/Cluster-HOWTO.html#s3
http://www.yolinux.com/TUTORIALS/LinuxClustersAndFileSys........
The results are still not flattering and are nothing close to native performance. Unless GlusterFS has a "DRBD-like" option to delay writes over the network and to only read from the client side, I don't see how performance can ever improve much more.
After doing some client optimizations Iadded more to the score:
Start Benchmark Run: Sun Nov 29 00:37:44 PST 2009
00:37:44 up 3 min, 1 user, load average: 0.01........
I've tried to find a good sensible solution to cluster with and each technology has it's pros and cons and there is no perfect solution and I've found a lot of "exaggerations" in the applications, benefits and performance of these different filesystems.
DRBD
I first started off with DRBD and Ihave to say it does live up to the hype, is quite reliable (although it can be annoying to match up the kernel module and user applications since they must match and whe........