"synchronisation lost" errors for Solaris NTP server

solaris_1977 · December 10, 2019, 3:07pm

Hi,

This is Solaris 9, which is service as NTP server for many unix clients. At backend, it it synching time with three GPS clocks. From past few days, I am noticing time reset to 1 second. Is this a problem ?
I was assuming that if it is a network issue or GPS clock connectivity issue, it should lost sync only with one device. But I see, it is saying "synchronisation lost" for all three devices.

ntp-serv10 # ntpq -p
     remote           refid      st t when poll reach   delay   offset    disp
==============================================================================
*sea-gps-clock1. .GPS.            1 u  358 1024  377     1.54   -0.904    0.64
 172.28.42.204   .GPS.            1 u   30  512  270    40.76    0.163 16000.0
+172.28.34.204   .GPS.            1 u 1315 1024  376    77.07   -0.799    7.93
ntp-serv10 # cat /var/adm/messages | grep ntp | tail -10
Dec  8 20:45:46 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.005217 s
Dec  8 20:45:46 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec  8 20:49:24 ntp-serv10 snmptrapd[15131]: [ID 702911 daemon.warning] localhost [UDP: [127.0.0.1]:-31114]: Trap , DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (893346033) 103 days, 9:31:00.33, SNMPv2-MIB::snmpTrapOID.0 = OID: DISMAN-EVENT-MIB::mteTriggerFired, DISMAN-EVENT-MIB::mteHotTrigger = STRING: status exec ntp, DISMAN-EVENT-MIB::mteHotTargetName = STRING: , DISMAN-EVENT-MIB::mteHotContextName = STRING: , DISMAN-EVENT-MIB::mteHotOID = OID: UCD-SNMP-MIB::extResult.5, DISMAN-EVENT-MIB::mteHotValue = INTEGER: 1, UCD-SNMP-MIB::extNames.5 = STRING: ntpcheck, UCD-SNMP-MIB::extOutput.5 = STRING: PROBLEM: NTP is not synchronized to peer
Dec  8 20:50:06 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec  9 13:08:02 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) 0.998904 s
Dec  9 13:08:02 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec  9 13:13:15 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 172.28.34.204, stratum=1
Dec  9 13:13:14 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003379 s
Dec  9 13:13:14 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec  9 13:17:34 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
ntp-serv10 #

Please advice.

Thanks

Neo · December 10, 2019, 9:09pm

I don't think it is a big deal.

But, if I were you, I would install chrony . (just takes a minute or two) and compare the results.

jim_mcnamara · December 10, 2019, 9:11pm

Are these remote devices GNSS? Why you need 3 evades me. Some specialized clocks will drop connections that act to hog resources.

And why do you need to ping those NTP servers so often? That may be why you are getting dropped.... assuming it did work previously.

solaris_1977 · December 11, 2019, 1:35am

These are GPS clocks, some kind of harware device. Not sure if those are GNSS, it is maintained by some other team. But I can get more information about it.

So far, I didn't notice any issue on any server or our internal devices, neither any app or DB team reported.
Those messages, stated in my first post are from /var/adm/messages. It is monitoring team, who has set up to create a ticket based on these kinds of alerts. So it is little noise from management, why synchronization is lost to GPS devices and why time is drifted back around 1 second. I am just trying to find these answers.

Another thing I noticed is, poll says 1024. That means 1024 seconds without guidance means slow sync, slow adjustments etc. Am I understanding it correctly? If yes, should "minpoll 4 maxpoll 8" entry in ntp.conf for all 3 GPS devices help?

MadeInGermany · December 11, 2019, 12:28pm

No, poll is the poll interval.
When there is a good and reliable peer for a long time then ntpd will double the poll interval.

So a small poll interval means there is high dispersion(=jitter).

Your bad device is the 172.28.42.204 that is still at the initial 16000 dispersion.
Please test connectivity with

ping -s 172.28.42.204 1400 100

The default values in ntp.conf are okay.
You should be more worried about security, and add a restriction like

restrict default notrap nomodify nopeer noquery
restrict 127.0.0.1

(As a quick alternative to a replacement of ntpd with chronyd.)

solaris_1977 · December 11, 2019, 1:36pm

Thanks for explanation.

ntp-serv10 # ntpq -p
remote refid st t when poll reach delay offset disp

*sea-gps-clock1. .GPS. 1 u 416 1024 377 1.54 -1.297 1.82
172.28.42.204 .GPS. 1 u 744 1024 0 44.45 2.298 16000.0
+172.28.34.204 .GPS. 1 u 814 1024 277 77.00 -1.162 1.37
ntp-serv10 #
ntp-serv10 # ping -s 172.28.42.204 1400 100
PING 172.28.42.204: 1400 data bytes
1408 bytes from 172.28.42.204: icmp_seq=0. time=56. ms
1408 bytes from 172.28.42.204: icmp_seq=1. time=44. ms
1408 bytes from 172.28.42.204: icmp_seq=2. time=45. ms
1408 bytes from 172.28.42.204: icmp_seq=3. time=44. ms
1408 bytes from 172.28.42.204: icmp_seq=4. time=44. ms
1408 bytes from 172.28.42.204: icmp_seq=5. time=45. ms
^C
----172.28.42.204 PING Statistics----
6 packets transmitted, 6 packets received, 0% packet loss
round-trip (ms) min/avg/max = 44/46/56
ntp-serv10 #

Here is my current conf file :

How will I explain, if synchronisation is being lost to all these devices? Is it dragging back by (approx) 1 second? I see these messages for today morning too

ntp-serv10 # cat /var/adm/messages | grep ntp | tail -10
Dec 9 13:13:15 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 172.28.34.204, stratum=1
Dec 9 13:13:14 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003379 s
Dec 9 13:13:14 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 9 13:17:34 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec 11 06:19:58 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) 0.999029 s
Dec 11 06:19:58 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 11 06:24:58 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec 11 06:24:57 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003025 s
Dec 11 06:24:57 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 11 06:30:18 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
ntp-serv10 #

MadeInGermany · December 11, 2019, 1:58pm

I am not sure how it behaves.
I would disable the bad peer in ntp.conf

server 192.168.70.16 prefer
#bad#server 172.28.42.204
server 172.28.34.204

driftfile /var/ntp/ntp.drift
statsdir /var/ntp/ntpstats/
filegen peerstats file peerstats type day enable
filegen loopstats file loopstats type day enable
filegen clockstats file clockstats type day enable

restrict default notrap nomodify nopeer noquery
restrict 127.0.0.1

Neo · December 12, 2019, 12:48am

You should consider installing chrony and doing a comparative test.

See this thread:

NTP synchronised problem in our Centos 7.6 node

The person having issue (above) with ntpd decided to move to chrony due to security considerations (the right decision in my view).

In all my servers, I have disabled ntpd for the same reason (security) and I only run chrony on all servers these days.

ntpd has a very bad and buggy track record (see discussion referenced above).

PS: What version of ntpd are you currently running? I went back and reread all the posts in this thread and did not see the version mentioned.

ntpq --version

Seems to me the first question to answer is the version of ntp you are running. Lots of people (I have seen over the years) are running obsolete versions, buggy versions, flawed versions, or all of the above.

solaris_1977 · December 12, 2019, 3:10pm

It is NTP v3.

It is production NTP server, so being little more caution before changing anything.

In last part of message, it again reported same lost yesterday. Does it say that, that time is dragging behind by approx second and then NTP service reset it, to bring it back ? Or in absense of any diagnostic tool (like chrony), it is difficult to say this statement ?

ntp-serv10 # pkginfo -l | grep -i ntp
   PKGINST:  SUNWntpr
      NAME:  NTP, (Root)
      DESC:  Network Time Protocol v3, NTP Daemon and Utilities (xntpd)
   PKGINST:  SUNWntpu
      NAME:  NTP, (Usr)
      DESC:  Network Time Protocol v3, NTP Daemon and Utilities (xntpd)
ntp-serv10 # pkginfo -l SUNWntpr
   PKGINST:  SUNWntpr
      NAME:  NTP, (Root)
  CATEGORY:  system
      ARCH:  sparc
   VERSION:  11.9.0,REV=2002.04.06.15.27
   BASEDIR:  /
    VENDOR:  Sun Microsystems, Inc.
      DESC:  Network Time Protocol v3, NTP Daemon and Utilities (xntpd)
    PSTAMP:  crash20020406153653
  INSTDATE:  Sep 20 2006 17:11
   HOTLINE:  Please contact your local service provider
    STATUS:  completely installed
     FILES:       17 installed pathnames
                   8 shared pathnames
                   4 linked files
                  10 directories
                   1 executables
                   9 blocks used (approx)

ntp-serv10 # pkginfo -l SUNWntpu
   PKGINST:  SUNWntpu
      NAME:  NTP, (Usr)
  CATEGORY:  system
      ARCH:  sparc
   VERSION:  11.9.0,REV=2002.04.06.15.27
   BASEDIR:  /
    VENDOR:  Sun Microsystems, Inc.
      DESC:  Network Time Protocol v3, NTP Daemon and Utilities (xntpd)
    PSTAMP:  leo20040603152123
  INSTDATE:  Sep 20 2006 17:11
   HOTLINE:  Please contact your local service provider
    STATUS:  completely installed
     FILES:        9 installed pathnames
                   4 shared pathnames
                   4 directories
                   5 executables
                 938 blocks used (approx)

ntp-serv10 # cat /var/adm/messages | grep -i ntp
Dec 11 11:54:33 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec 11 14:52:01 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) 0.999041 s
Dec 11 14:52:01 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 11 14:56:54 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 172.28.34.204, stratum=1
Dec 11 14:56:53 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003380 s
Dec 11 14:56:53 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 11 15:01:34 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1

Neo · December 13, 2019, 7:41am

That is all the reason to move to chrony . Production servers should have software which is less vulnerable.

See the many NTP security vulnerabilities here:

https://www.cvedetails.com/vulnerability-list/vendor_id-2153/NTP.html

Having servers in production is not a good reason to run insecure code when you could be running much more secure code that works the same or better.

Also, based on my experience, there are no issues cutting over to chrony from ntpd , especially if your version of ntp is keeping time correctly, and even if it was not, chrony is designed to slowly bring system time into compliance.

See also:

NTP NTP : CVE security vulnerabilities, versions and detailed report

https://www.cvedetails.com/product/3682/NTP-NTP.html?vendor_id=2153

jlliagre · December 13, 2019, 7:31pm

NTP might be the least of the security issues here.

Running such an outdated and unpatched version of Solaris (17 years old!) in production is quite unreasonable. There are certainly hundreds of major vulnerabilities on that server. Moreover, assuming a firewall is protecting the server and NTP is the only visible service, you might have issues compiling a recent version of chrony for Solaris 9 anyway.

solaris_1977 · December 13, 2019, 8:10pm

Yes, this internal server, not exposed to internet. It is only NTP service which is open to GPS clock.
I am planning migrate NTP services to RHEL 7.8, which can give better capabilities for handling and troubleshooting.
But we are in change-freeze right now, so can't proceed till January first week.
My concern was more of a managerial concern. Monitoring team scans messages and as soon as they see messages like below, they created a ticket and management gets panic "oh, so our NTP server is dragging time by 1 second and it can impact its 100s of client?".

ntp-serv10 # cat /var/adm/messages | grep -i ntp | tail -10
Dec 12 17:05:55 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003699 s
Dec 12 17:05:55 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 12 17:10:31 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 172.28.34.204, stratum=1
Dec 12 17:11:16 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec 13 01:39:01 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) 0.999076 s
Dec 13 01:39:01 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 13 01:43:54 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
Dec 13 01:43:53 ntp-serv10 xntpd[15247]: [ID 774427 daemon.notice] time reset (step) -1.003393 s
Dec 13 01:43:53 ntp-serv10 xntpd[15247]: [ID 204180 daemon.info] synchronisation lost
Dec 13 01:49:14 ntp-serv10 xntpd[15247]: [ID 854739 daemon.info] synchronized to 192.168.70.16, stratum=1
ntp-serv10 #

BTW, 172.28.42.204 clock was showing disp as 16000 and then it set to 0.70 by itself and now again I see it at 16000

ntp-serv10 # ntpq -p
     remote           refid      st t when poll reach   delay   offset    disp
==============================================================================
*sea-gps-clock1. .GPS.            1 u  144 1024  377     1.42   -1.026    1.54
 172.28.42.204   .GPS.            1 u  758 1024    0    40.77    0.211 16000.0
+172.28.34.204   .GPS.            1 u  397 1024  375    77.09   -0.831    0.40
ntp-serv10 #
ntp-serv10 # ntpq -p
     remote           refid      st t when poll reach   delay   offset    disp
==============================================================================
*sea-gps-clock1. .GPS.            1 u   70 1024  377     1.56   -0.568    0.89
+172.28.42.204   .GPS.            1 u  278 1024  377    40.56   -0.500    0.70
+172.28.34.204   .GPS.            1 u  323 1024  367    79.24    0.702    0.60
ntp-serv10 #

Neo · December 13, 2019, 11:36pm

Well stated.

Let me be more to the point.

It is a total waste of time to be replying to anyone who is running a 17 year old OS (with a seriously flawed and out-of-date version of NTP), which could be replaced in a day for free with a modern OS (more secure, more reliable, not seriously flawed, and do a much better job for a NTP application).

The original poster is wasting our time, showing a lack of concern for our time, to ask us to sort out a problem on a 17 year old operating system (and not telling us before hand the version(s) they are running), which could be replaced by any "normal" system admin in less than a hour (for free, and do a better and more reliable job).

This is why I wish everyone here at unix.com would slow down (including myself at times) and stop answering questions from posters until the posters first describe the operation system, version numbers, etc. Some here are good at this, some of us are good at this sometimes and then forgot to ask, others seem to like to bypass the "understanding" phase and just post answers without any concern for the user's OS, versions, etc.

Everyone here (including me sometimes, but not often) needs to slow down and ask people who post questions to describe the OS, version, etc. before providing "quick" answers to questions. Jumping to "answers" before having the "right understanding" is not teaching people how to solve problems, it is contributing to the problem (in my view).

Perhaps I need to change the forum rules and make this a posting requirement in 2020?

Editorial Comment:

As a side note, the reason that most computers are hacked with ransomware or other easily acquired malware (easily purchased on the dark web) is that they are running unpatched, antiquated systems and obsolete code. Every system admin, organization and company must keep their computer operating systems up-to-date, fully patched and upgraded to the latest versions. This is very basic. Do not run vulnerable, obsolete code and antiquated operating systems. Update your operation systems, update your apps, make and maintain backups (onsite and offsite). Manage your IT systems, please.

MadeInGermany · December 14, 2019, 2:28am

It was stated in post#1 that the OS is Solaris 9, and we all know it's outdated.
Later it was stated that it is not hooked to the Internet, so there is no direct threat.
It is pointless to further ride that dead horse.

There is equal config for 3 input devices and only one gets wrong. If the fault would be on the Solaris box then all 3 would be wrong - but it's one.
I keep saying this one input device is wrong.
If there is no alert on other systems then it's perhaps because their ntpd/chronyd is more fault tolerant.

Neo · December 14, 2019, 4:30am

First, not having something "connected" to the Internet is no excuse for running obsolete code (at least to me, maybe to you it is). In my many decades of cybersecurity work, I have never seen the (bad) cybersecurity policy ... " if the host is not connected to the Interest, feel free to never upgrade obsolete code and feel free to call it 'beating a dead horse' if anyone suggests you upgrade".

FWIW, I have servers not connected to the Internet, but I keep them upgraded. Maybe I forgot to read this "it's a dead horse if not connected to the Internet policy"... LOL

So, in my view it is not a "dead horse" to encourage people to secure their systems, upgrade obsolete servers, and not run obsolete code; especially when it is trivia (and basically free) to replace.

You are free to disagree, of course; but I am free to disagree back (and I will push back).

In fact, if you run 17 year old server code and call up any company for support, the first thing they will tell you is "we do not support that version, so please upgrade and call us back when you do".

It's really basic, everyone should run servers and apps with the latest code and if you have an NTP server which is buggy, the first think you should do it upgrade it, not the last.

Also, we at unix.com should be encouraging people to run the latest version of all software and to insure the code they are running is a free of defects as possible.

Feel free to disagree, of course; but don't expect me to agree with this "it's beating a dead horse to encourage people to update buggy 17 year old code" worldview But of course, you are free to reply with any and all technical approaches you want. It's always good to have many different ideas and approaches.

Additional Info:

The security issues raised when running obsolete security is basically irrelevant to "connected to the Internet or not" as MIG and the OP have mentioned. IT security is defined (in brief) as (1) confidentiality, (2) integrity and (3) availability. You do not need a "hacker from the dark web" to have an IT security issue. Running obsolete software which is known to be buggy is a larger cause of availability issues than "hackers from the web". In fact, in my many years as a leading expert in cybersecurity, the biggest security breeches mostly / always come from "insiders" (not outside hackers). In my view, running 17 year old, known to be buggy software, is a much larger security breech by "insiders" (who permit and encourage this kind of bad configuration management) than worrying about "hackers from the scary Internet".