heartbeat is stopped for some reason

heartbeat is stopped for some reason

Anyway hnode2 was active and the services are running fine but I see heartbeat has been stopped somehow.

Here is the last log I see of heartbeat:

[quote:23c84415f5]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 9/1762471 ms age 0 [pid16738/MST_CONTROL]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 716/51784021 152624/74519 [pid16738/MST_CONTROL]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 200276 total malloc bytes. pid [16738/MST_CONTROL]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/14 ms age 405180540 [pid16741/HBFIFO]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 321/581 30772/13815 [pid16741/HBFIFO]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 32600 total malloc bytes. pid [16741/HBFIFO]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373810 [pid16742/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 340/657021 33264/15511 [pid16742/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 42008 total malloc bytes. pid [16742/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373820 [pid16743/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 340/394 25136/11458 [pid16743/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 25220 total malloc bytes. pid [16743/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373820 [pid16744/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 352/657052 34784/16543 [pid16744/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 43528 total malloc bytes. pid [16744/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373820 [pid16745/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 353/1244439 34868/16587 [pid16745/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 35812 total malloc bytes. pid [16745/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373820 [pid16746/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 364/657082 36304/17575 [pid16746/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 44840 total malloc bytes. pid [16746/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373830 [pid16747/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 364/454 28176/13522 [pid16747/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 36472 total malloc bytes. pid [16747/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373850 [pid16748/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 376/657112 37824/18607 [pid16748/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 46360 total malloc bytes. pid [16748/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/0 ms age 603373850 [pid16749/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 376/484 29696/14554 [pid16749/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 37992 total malloc bytes. pid [16749/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/1140417 ms age 40 [pid16750/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 388/30411871 39344/19639 [pid16750/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 51588 total malloc bytes. pid [16750/HBWRITE]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: MSG stats: 0/518348 ms age 30 [pid16751/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: cl_malloc stats: 389/10885817 39428/19683 [pid16751/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: RealMalloc stats: 40928 total malloc bytes. pid [16751/HBREAD]
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: Current arena value: 0
Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: These are nothing to worry about.
[/quote:23c84415f5]


Now when I start it again I get these messages (I've got these before). What alternative do I have? The services were running on hnode2 I can't just unmount and stop OpenVZ in a production environment.

I also never stopped heartbeat, it was obviously fine until early this evening 5:12 PM and between now 11:48PM
[quote:23c84415f5]Starting High-Availability services:
2008/09/09_23:45:21 CRITICAL: Resource drbddisk::r0 is active, and should not be!
2008/09/09_23:45:21 CRITICAL: Non-idle resources can affect data integrity!
2008/09/09_23:45:21 info: If you don't know what this means, then get help!
2008/09/09_23:45:21 info: Read the docs and/or source to /usr/share/heartbeat/ResourceManager for more details.
CRITICAL: Resource drbddisk::r0 is active, and should not be!
CRITICAL: Non-idle resources can affect data integrity!
info: If you don't know what this means, then get help!
info: Read the docs and/or the source to /usr/share/heartbeat/ResourceManager for more details.
2008/09/09_23:45:21 CRITICAL: Non-idle resources will affect resource takeback!
2008/09/09_23:45:21 CRITICAL: Non-idle resources may affect data integrity!
[ OK ]
[/quote:23c84415f5]


 

hnode1 complains after restarting heartbeat:

[quote:d10093f29f]Sep 9 23:44:50 hnode1 heartbeat: [31055]: info: Heartbeat restart on node hnode2.ca
Sep 9 23:44:50 hnode1 heartbeat: [31055]: info: Status update for node hnode2.ca: status init
Sep 9 23:44:50 hnode1 heartbeat: [31055]: info: Status update for node hnode2.ca: status up
Sep 9 23:44:51 hnode1 heartbeat: [31055]: info: Status update for node hnode2.ca: status active
Sep 9 23:44:51 hnode1 heartbeat: [31055]: ERROR: should_drop_message: attempted replay attack [hnode2.ca]? [gen = 1220406280, curgen = 1220406281]
Sep 9 23:44:51 hnode1 heartbeat: [31055]: info: remote resource transition completed.
Sep 9 23:44:51 hnode1 heartbeat: [31055]: ERROR: No one owns our local resources!
Sep 9 23:44:51 hnode1 heartbeat: [31055]: ERROR: No one owns our local resources!
Sep 9 23:44:51 hnode1 heartbeat: [31055]: ERROR: should_drop_message: attempted replay attack [hnode2.ca]? [gen = 1220406280, curgen = 1220406281]
Sep 9 23:44:55 hnode1 heartbeat: [31055]: ERROR: should_drop_message: attempted replay attack [hnode2.ca]? [gen = 1220406280, curgen = 1220406281][/quote:d10093f29f]


 

hnode2 shortly after 5:17PM I installed and ran tiobench, I wonder if that did it?

Sep 9 17:15:32 hnode2 heartbeat: [16738]: info: These are nothing to worry about.
Sep 9 17:18:05 hnode2 python: gethostby*.getanswer: asked for "apt.sw.be IN AAAA", got type "SOA"
[b:04bcff7251]Sep 9 20:18:17 hnode2 yum: Installed: tiobench - 0.3.3-1.2.el5.rf.i386
[/b:04bcff7251]


 

CRITICAL: Resource drbddisk::r0 is active, and should not be!
CRITICAL: Non-idle resources can affect data integrity!
info: If you don't know what this means, then get help!
info: Read the docs and/or the source to /usr/share/heartbeat/ResourceManager for more details.
2008/09/09_23:45:21 CRITICAL: Non-idle resources will affect resource takeback!
2008/09/09_23:45:21 CRITICAL: Non-idle resources may affect data integrity!

[quote:a0972a9e65] # What this means is that if you have a shared disk and it's already mounted
# before you start heartbeat, then you could have it mounted simultaneously
# on both sides. If this happens then your disk data is toast!
# So, this is sometimes VERY BAD INDEED!
#
[/quote:a0972a9e65]


 

I ran tiobench again and heartbeat never died


 

Worse of all I checked the logs on hnode1 and it never seemed to realize hnode2 heartbeat was down.



Tags:

heartbeat, reasonheartbeat, hnode, active, sep, info, msg, stats, pid, mst_control, cl_malloc, realmalloc, malloc, bytes, arena, hbfifo, hbwrite, hbread, ve, unmount, openvz, pm, availability, _, resource, drbddisk, idle, docs, usr, resourcemanager, takeback, ok, complains, restarting, restart, node, ca, update, init, should_drop_message, replay, gen, curgen, transition, completed, installed, tiobench, python, gethostby, getanswer, quot, apt, sw, aaaa, soa, bcff, yum, rf, disk, mounted, simultaneously, logs,

Latest Articles

  • How high can a Xeon CPU get?
  • bash fix PATH environment variable "command not found" solution
  • Ubuntu Linux Mint Debian Redhat Youtube Cannot Play HD or 4K videos, dropped frames or high CPU usage with Nvidia or AMD Driver
  • hostapd example configuration for high speed AC on 5GHz using WPA2
  • hostapd how to enable and use WPS to connect wireless devices like printers
  • Dell Server Workstation iDRAC Dead after Firmware Update Solution R720, R320, R730
  • Cloned VM/Server/Computer in Linux won't boot and goes to initramfs busybox Solution
  • How To Add Windows 7 8 10 11 to GRUB Boot List Dual Booting
  • How to configure OpenDKIM on Linux with Postfix and setup bind zonefile
  • Debian Ubuntu 10/11/12 Linux how to get tftpd-hpa server setup tutorial
  • efibootmgr: option requires an argument -- 'd' efibootmgr version 15 grub-install.real: error: efibootmgr failed to register the boot entry: Operation not permitted.
  • Apache Error Won't start SSL Cert Issue Solution Unable to configure verify locations for client authentication SSL Library Error: 151441510 error:0906D066:PEM routines:PEM_read_bio:bad end line SSL Library Error: 185090057 error:0B084009:x509 certif
  • Linux Debian Mint Ubuntu Bridge br0 gets random IP
  • redis requirements
  • How to kill a docker swarm
  • docker swarm silly issues
  • isc-dhcp-server dhcpd how to get longer lease
  • nvidia cannot resume from sleep Comm: nvidia-sleep.sh Tainted: Linux Ubuntu Mint Debian
  • zfs and LUKS how to recover in Linux
  • [error] (28)No space left on device: Cannot create SSLMutex Apache Solution Linux CentOS Ubuntu Debian Mint