So recently I had temperature issues with my server; long story short – my fan controller molex fell out and thus my server got rather warm rather quickly – oops!
Problem rectified easily, however i wanted to add some more depth to my monitoring of my server temperatures with Opsview. To do this, i used lm_sensors to get the temperatures, which i can then turn into service checks (check the site for a blog on how to do this).
The problem i had however, was that there were 2 ‘temp1’ sensors, and it wasnt obvious what these were:
root@server:/media# sensors coretemp-isa-0000 Adapter: ISA adapter Core 0: +36.0°C (high = +82.0°C, crit = +100.0°C) Core 1: +35.0°C (high = +82.0°C, crit = +100.0°C) Core 2: +39.0°C (high = +82.0°C, crit = +100.0°C) Core 3: +34.0°C (high = +82.0°C, crit = +100.0°C) it8718-isa-0290 Adapter: ISA adapter in0: +1.28 V (min = +0.00 V, max = +4.08 V) in1: +1.86 V (min = +0.00 V, max = +4.08 V) in2: +3.25 V (min = +0.00 V, max = +4.08 V) +5V: +2.88 V (min = +0.00 V, max = +4.08 V) in4: +0.64 V (min = +0.00 V, max = +4.08 V) in5: +0.08 V (min = +0.00 V, max = +4.08 V) in6: +0.11 V (min = +0.00 V, max = +4.08 V) in7: +3.07 V (min = +0.00 V, max = +4.08 V) Vbat: +3.28 V fan1: 1268 RPM (min = 0 RPM) fan2: 0 RPM (min = 0 RPM) fan3: 1962 RPM (min = 10 RPM) fan4: 0 RPM (min = 10 RPM) temp1: +39.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor temp2: +29.0°C (low = +127.0°C, high = +60.0°C) sensor = thermal diode temp3: -2.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor intrusion0: ALARM nouveau-pci-0100 Adapter: PCI adapter fan1: 0 RPM temp1: +63.0°C (high = +95.0°C, hyst = +3.0°C) (crit = +115.0°C, hyst = +2.0°C) (emerg = +130.0°C, hyst = +10.0°C)
It turns out renaming these sensors is rather easy! Firstly, copy the name of the chip that the sensors are running on – in my case, i wanted to rename ‘temp1 and temp2’ from it8718-isa-0290 to ‘DIMM1’ and ‘DIMM2’ – so to do this, i added a new file in /etc/sensors.d/ called ‘mobo’ (you can call it anything you like), and in here i added the following lines:
root@server:/media# cat /etc/sensors.d/mobo chip "it8718-isa-0290" label temp1 "DIMM1Temperature" label temp2 "DIMM2Temperature"
Now, when I run ‘sensors’ i get the correct output:
it8718-isa-0290 Adapter: ISA adapter in0: +1.28 V (min = +0.00 V, max = +4.08 V) in1: +1.86 V (min = +0.00 V, max = +4.08 V) in2: +3.25 V (min = +0.00 V, max = +4.08 V) +5V: +2.88 V (min = +0.00 V, max = +4.08 V) in4: +0.64 V (min = +0.00 V, max = +4.08 V) in5: +0.08 V (min = +0.00 V, max = +4.08 V) in6: +0.11 V (min = +0.00 V, max = +4.08 V) in7: +3.07 V (min = +0.00 V, max = +4.08 V) Vbat: +3.28 V fan1: 1268 RPM (min = 0 RPM) fan2: 0 RPM (min = 0 RPM) fan3: 1962 RPM (min = 10 RPM) fan4: 0 RPM (min = 10 RPM) DIMM1 Temperature: +39.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor DIMM2 Temperature: +29.0°C (low = +127.0°C, high = +60.0°C) sensor = thermal diode temp3: -2.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
And thats that – now when i run my checks via Opsview i can be sure im getting the temperatures from my DIMM’s and not from a northbridge sensor or something else:
root@server:/home/sam# sudo /usr/local/nagios/libexec/check_lm_sensors-3.1.1/check_lm_sensors --sanitize --high DIMM1Temperature=70,85 LM_SENSORS OK - DIMM1Temperature=39.0|DIMM1Temperature=39.0;70;85;; root@server:/home/sam# sudo /usr/local/nagios/libexec/check_lm_sensors-3.1.1/check_lm_sensors --sanitize --high DIMM2Temperature=70,85 LM_SENSORS OK - DIMM2Temperature=29.0|DIMM2Temperature=29.0;70;85;;
Cool eh..