DX Unified Infrastructure Management

  • 1.  Issue with NAS

    Posted Jun 03, 2014 11:18 PM

    Hi,

     

    Since morning we are facing issue with NAS and it fails very often. These are the log files that are recorded before nas going down thereby leading to a production down.

     

    un  3 09:27:01:287 [2388] nas: ExecEvent: Rule='windows_email' on nimid='JX44480691-40650' with ACTION:EMAIL, age:121417s, status:smileysurprised:K
    Jun  3 09:27:01:318 [10948] nas: dbBeginTransaction actLogRun, OK - rc:0
    Jun  3 09:27:01:318 [10948] nas: dbCommitTransaction actLogRun, OK - rc:0
    Jun  3 09:27:01:381 [12188] nas: nsLookup ip:rkat1011107sw20 => 'rkat1011107sw20.wsgc.com' in 143ms
    Jun  3 09:27:01:381 [12188] nas: getnameinfo returned hostname = wscorp-olive1v1.wsgc.com
    Jun  3 09:27:01:381 [12188] nas: nsLookup ip:10.3.0.5 => 'wscorp-olive1v1.wsgc.com' in 6ms
    Jun  3 09:27:01:381 [12188] nas: getnameinfo returned hostname = 255.255.255.255
    Jun  3 09:27:01:396 [1352] nas: RREPLY: status=OK(0) <-192.168.41.104/48002  h=37 d=16
    Jun  3 09:27:01:396 [1352] nas: Successfully attached to HUB.
    Jun  3 09:27:01:396 [2976] nas: dbBeginTransaction dbsRun, OK - rc:0
    Jun  3 09:27:01:396 [2976] nas: dbCommitTransaction dbsRun, OK - rc:0
    Jun  3 09:27:01:396 [2976] nas: dbsRun committed 2 requests. 0 remaining in queue...
    Jun  3 09:27:01:412 [12188] nas: nsLookup ip:dlpdbprdrk1 => 'dlpdbprdrk1.wsgc.com' in 17ms
    Jun  3 09:27:01:412 [12188] nas: getnameinfo returned hostname = 255.255.255.255
    Jun  3 09:27:01:428 [2976] nas: dbBeginTransaction dbsRun, OK - rc:0
    Jun  3 09:27:01:428 [2976] nas: dbCommitTransaction dbsRun, OK - rc:0
    Jun  3 09:27:01:428 [2976] nas: dbsRun committed 1 requests. 0 remaining in queue...
    Jun  3 09:27:03:259 [1352] nas: RREQUEST: hubpost <-192.168.41.104/48002  h=244 d=736
    Jun  3 09:27:03:259 [1352] nas: SREQUEST: license_info ->192.168.41.104/48002
    Jun  3 09:27:03:259 [1352] nas: RREPLY: status=OK(0) <-192.168.41.104/48002  h=37 d=208
    Jun  3 09:27:03:259 [1352] nas: sockClose:00000000025DEFE0:192.168.41.104/60703
    Jun  3 09:27:03:259 [1352] nas: SREQUEST: _close ->192.168.41.104/48002
    Jun  3 09:27:03:259 [1352] nas: Device_Approver APPROVED:  dev_id: 'D24891D936FC61A86D7EC58636D8AF2A9' from '/wsgc/HUB2/ecbagentrk1v/cdm'
    Jun  3 09:27:03:259 [1352] nas: dbBeginTransaction subscriber, OK - rc:0
    Jun  3 09:27:03:259 [1352] nas: SqliteExecuteCallback: sqlite3_finalize returned:0
    Jun  3 09:27:03:995 [12188] nas: ptNetIpToHost - getaddrinfo failed for sw-lak-ce500 with return value 11001
    Jun  3 09:27:03:995 [12188] nas: nsLookup ip:sw-lak-ce500 => 'sw-lak-ce500' in 2583ms

     

     

    I also would like to say that we are having issues with sqlite DB. So any idea on the return code that is generated here? Please advice.

     

     

    Thanks,

    Ananda Guberan K



  • 2.  Re: Issue with NAS

    Posted Jun 04, 2014 10:14 AM

    Does the nas chain-fail? Or does it fail, then comes back online, runs for a while, before failing again?

     

    There are some message field values and length that can cause the nas to die, but in those cases, the message isn't removed from the queue, so the nas dies each time it tries to read it.

     

    It's also fine to use an sqllite editor to edit your nas sqllite db file. I've previously used http://sourceforge.net/projects/sqlitebrowser/ for this.



  • 3.  Re: Issue with NAS

    Posted Jun 04, 2014 04:35 PM
    The nas queue fails and I just noticed only once that it failed, then connected back and again failed. But 99.99% it drops in the first phase. And yes, when I flush the queue using Dr Nimbus I see alerts from one of our custom probe being flushed out but they dont appear in the console. How do we get rid of those messages. I guess we will be back in shape once we get rid of those alerts.


  • 4.  Re: Issue with NAS

    Posted Jun 04, 2014 04:38 PM
    And I would go through the link on sqllite editor. But I am unsure on what parameter to modify. For me, I have no clues on the return value 0. Please give me little insight and I will proceed from there. Thanks a lot.


  • 5.  Re: Issue with NAS
    Best Answer

    Posted Jun 05, 2014 12:03 AM

    Usually a return value of 0 indicates that everything is okay, and then other return values indicate an issue and possibly the nature of the issue based on which value was returned. The only error I see in your log is return value 11001 from ptNetIpToHost, which looks like it would be a "normal" error when a DNS lookup from IP to hostname fails.

     

    If the NAS is crashing without anything reported in the log, it does seems like a bad alarm message in the queue is a likely culprit.



  • 6.  Re: Issue with NAS

    Posted Jun 06, 2014 05:20 PM

    Yes, I noticed that as well. The bad alarm message comes from one of our custom probe. But I simply can't disable them. Need to work on mitigating them and some environment change.



  • 7.  Re: Issue with NAS

    Posted Jun 06, 2014 07:04 PM

    Any idea what would be making the NAS unhappy with those alarms?

     

    Any fields very long? Any special characters? Anything fields that might contain unexpected data?



  • 8.  Re: Issue with NAS

    Posted Jun 06, 2014 11:00 PM
    Hi,



    Any idea what this error means.



    sockClose:0000000002F76570:192.168.41.104/54846



    Is it a network issue?





    Thanks,

    Ananda Guberan K


  • 9.  Re: Issue with NAS

    Posted Jun 09, 2014 05:00 PM

    That might not be an error. A socket can close just because the connection is no longer needed. That is probably the most common reason for sockets to close. It is possible that the sockClose message only appears in the log in the event of an error, but based on the contents of the message, it does not look like it is specifically reporting an error. You probably need to get further information about the sockClose message from someone who has access to the source code.



  • 10.  Re: Issue with NAS

    Posted Jun 10, 2014 03:12 PM
    Guess something wrong on the java end is causing this error. Let me get in touch with the right team. Thanks for the information :smileyhappy: