Hi to all !
Currently we have problems with some applications that we have registered in cem.
We see a high number of transactions with decoding errors:
When we check the tim log we see the message:
Fri Jan 25 07:19:45 2019 13487 ! Warning: w15: ssl_analyze: No data due to handshake failure [190.***.xx.xx]:49263->[10.2xx.xx.xx]:443
We discard a certificate error because we managed to decode a very low volume (something that would not happen if the certificate or the key were wrong).
Yesterday when we tried to load the certificate and the key in wireshark to validate if it could decode the traffic we got the following error:
Does anyone have any idea what may be happening?
Dear Aldo: You may have multiple issues:
1) Non-compliant SSL/TLS Traffic
2) Network data corruption issues causing OOO or dropped packets.
3) An invalid private key. (Are you loading a pem file? Just the private key? (Not the public key and certificate?)
I would rather say either the SSL Key format is wrong. The TIM requires a key in PEM Format. pkcs12 is wrong.If you look at the key, you should see something like "===== BEGIN PRIVATE KEY==========" as a first line.If that is not the case, they key has the wrong format and needs to be converted first.
Thank you for your response !.
The pem file have body with (begin private key, end private key)_
Bag Attributes Microsoft Local Key set: <No Values> localKeyID: 01 00 00 00 friendlyName: 158E528ADB8E4052839B889955D9754A Microsoft CSP Name: Microsoft RSA SChannel Cryptographic ProviderKey Attributes X509v3 Key Usage: 10 -----BEGIN PRIVATE KEY-----MIIEwAIBADANBgkqhkiG9w0BAQEFAASCBKowggSmAgEAAoIBAQD6Vd7cRZxAeKEd
-----END PRIVATE KEY-----
Is there anything else I could validate in the certificate?
Thanks a lot !
That is how the key is supposed to look like. Why is there an error message in wireshark then?Are you absolutely sure that you didn't copy the content of the key with Copy &n Paste into another file (from another machine or so?). The charset used can break its content.
I generated the pem (with private key) with the cem documentation command:
openssl pkcs12 - in C:\dbfiles\output.p12 -nodes - out C:\dbfiles\pkcs12out.txt
The strange thing is that I manage to decode a very small part of the traffic, this would seem to indicate that it is not a certificate issue or wrong key.
We have other applications that come in the same network span that work correctly.
The error I got in wireshark if it seems strange to me.
I opened a case with support but the problem could not be solved.
Are you absolutely sure that some communications have been decrypted?
"The TIM needs to see the IKE (Initial Key Exchange) Parameters, to be able to compute the necessary parameters to decrypt the traffic. The IKE Parameters are generally exchanged on a session start between browser and Web-Server. If, and it does not matter how, the TIM does not see these IKE Parameters, it cannot decrypt the traffic. And this means the entire HTTPS Session the user may have with the Web-Server.In that specific case – we lose the entire session (HTTPS) and there is nothing we can do about it.If TIM doesn't see the SSL handshake that sets up an SSL session, it can't decrypt any data using that SSL session. "
Note also that the TIM will wait a while before marking a SSL Session interception as failed. So the amount of total connections and connections with decode failure will always have a difference during normal operations.
Eventually you could do us all a favor, open a case with Broadcom (CA), and get the cem-healtch-check fieldpack under https://github.com/CA-APM/cem-healthcheck-scripts
In there please run the following log collections: SYS, TIM and provide us a performance pack with: TIMPERF
You can look into all logs if you want to remove sensitive information. The TIMPERF is just a gnu zipped tar archive of the TIMs collected protocolstats enabling us to see the health through different times of the day.
In case the TIM itself has access to the WebServer you are Monitoring, please collect a cipher compatibility table using the CIPHER call, and pointing it to the WebServers IP/Port. This will generate a list of compatible ciphers the TIM will be able to decrypt.
Then we can discuss the details
In fact some transactions are captured and in the cem interface I see them as defects and in port 443:
I had opened a case (01279070) and hired broadcom services to review the problem and executed the check script for tim / mtp but I still do not have an exact cause of the problem.
I'm going to ask the client's security team if, when they generate the pem, they copy the contents of the file.
Thank you so much for the answers.
if the TIM is actually decrypting data that is really encrypted and that, for the provided screenshot you can only confirm by looking at the packet capture and see if that communication is really encrypted. Indeed sometimes you see https in the header, but it is not encrypted (happened very often for static image files in the past).
If that is confirmed, the 2 remaining options are - as already mentioned by Hal:
1. Bad traffic quality
2. non support cipher combinations.
That - we'll have to check.
I will have a look at the case file in the meantime. Meybe I'll see something.
When I check the logs of the TIM I see that it manages to decode queries with SHA256 ciphers (webserver 10.254.254.140):
Trace: w5: Version: TLS 1.2 CipherSuite - TLS_RSA_WITH_AES_256_CBC_SHA256 (61) [108.171.***.***]:56356->[10.254.254.140]:443
but I do not see any detail in the queries that fail:
Fri Jan 25 09:22:20 2019 13449 ! Warning: w12: ssl_analyze: No data due to handshake failure [190.232.***.***]:11551->[10.254.254.140]:443
But if it is about unsupported cipher problems, it would not have values other than 0 in the SSL server status column of TIM:
It is very strange this!
Thank you for your help!
So - having looked at the case files - the Ciphers used are also: TLS_RSA_WITH_AES_128_CBC_SHA (0x002f) which is fine with TIM. And you are right, it cannot decode the sessions. So we probably have to look elsewhere.
From the received packet capture, I extracted some stats.
The part I don't like is this here:
=== Manual TCP Analysis ! ==================================================== Total segments: 94481 Out of Order segments: 191 TCP Retransmission: 955 Duplicate ACK: 2656 ACKed unseen Segment: 2222
It shows that the traffic is congestioned, but the "ACKed unseen Segment" indicate that the ACK for a transmitted packet has been seen, the packet itself however has not. In the 83Mb pcap, 2222 packets have not been seen.Note that I do the test on previously filtered packets, to make sure we have seen the 3way handshake and the FIN message. So the entire communication is there.
What I still don't understand is the amount of missed decoding errors. There should be some, but not that many.
I just remember that you had sent me a performance report on that MTP.
Did you guys fix the plugin error? If the TIM's are crashing all the time as they did before, it will be luck to actually for a TIM to intercept the IKE. Maybe you could provide me a new TIMPERF dataset so we can check what the health of that device is.
I'm going to disable the json request and response plugin to see if we see any improvement.
On the other hand we have another application that does not use this plugin (it is web) that works without problems and the volume of capture is quite high.
The other applications in the MTP have decoding errors but in a small volume.
I will disable the plugin and update the case and the community post !
I have uploaded to the 01279070 case the execution of the tim performance scripts and pcap capture after disabling the json request / response plugin.
The results have been the same as those I had before (almost all traffic with decoding problems).
Please can you help us with a review of these files?
I will answer in full in the case (Will add the full reports). But for short - I suspect that you have too many packets coming in.
If you look at the Percentile Display:
You can see that you not only have the max-number of connections, but also the max number of Packets forwarded to an MTP over a 1Gbps connection. Red is the nr. of "Host for comparison" in that percentile window, Orange is Your MTP.This results in the TIMs not being able to actually handle the load if the filters are not set right. Small requests (especially SOAP/XML) can bring down an MTP with as little *** 20Mbps of traffic.
What happens is that the MTP forwards the traffic to the TIM Workers:
The CPU load will spike:
And you see the gray (shows the max load of the CPU's - here 1600% - as 16CPUs on the MTP, 100% per CPU Core), and the TIMs can't follow. They either crash, or are too slow to free the buffer filesystem. The ram-file system which serves as buffer on the MTP will fill up, and the nqcap-daemon can't write new data to that filesystem:
These packets will be discarded and dropped. And the TIM process itself will not see it.With time, the TIMs will process old data, and eventually only receive partial data - which when it is encrypted, is really a matter of chance to have the complete communication flow & IKE parameters.
So - IMHO - if you see very few valid transactions, I am not surprised.
What you can do here, is fine tune the filters, so that the MTP is receiving only the traffic it is supposed to monitor, and nothing else -> hardware filters on the Napatech board. Not the CEM WebServer filters - as this is too late as the packets have hit the MTP already.
Currently we have configured filters in napatech (by IP ranges and ports):
The bandwidth that receives the MTP I see that it is below the total that supports the gigabit interface:
What else could be generating so much overhead that you can not manage despite having leaked everything possible.
Did you limit it to TCP traffic only (ICMP, UDP disabled - which according to your filters should not pass. Any other filters active?). Are you also using the MTP for ADA data analysis? or are you sending in the headers only to the TIM too?
What the MTP tells me, is that there is way too much very small packets coming in.
Is there traffic being sent to the MTP that is composed of WebServices? If there is lots of webservices traffic going over, these are almostn one packet requests and one packet answers. These are killing the TIM. If you don't monitor that (XML), you can disable the XML Parser with: (Tim settings UI).
We don't have additional filters.
We don't use ADA (I do not know why the check comes out as if we used it).
I will disable the processing of xml according to the recommendation.
Thanks a lot for your responses !
I disabled the xml parser but I do not see any improvement we still have a very high number of decoding errors.
The filters of the napatech are the most restrictive possible (by range of IP's and ports).
There is something else that we could review ?
The way I would proceed from here is: save the filters and limit the filter-set to one only IP Address you know you need to monitor, and where the HTTPS is known to work. This will tell us that at least the MTP/CEM/APM setup works.The next step will be to identify which server IP Address is causing trouble - so we can look at the traffic that causes the issues.
I admit that the performance logs are "weird" - in the way that some things don't make sense.
I have tried with the recommendation to have in the hardware filter 1 only IP and port but I see that there is no improvement regarding the proportion of decoding errors:
Tim SSL Servers:
The remaining applications that run in our MTP do not present this problem (the decoding errors are close to 10 ~ 15% marked blue):
What else could I do or validate?
If all the rest works, but not that IP - we need to check the SSL happening between that server and the Clients.
I'll send you the instructions through the case file.
I consulted to the client and he tells me that he uses akamai for static content. Could you please indicate what type of changes are required in TIM / MTP to be able to validate if that is the cause of the problem.
This was introduced in 10.3 Not exactly your issue. But good to know
For the AKAMI TCP connection reset scenarios current TIM BT reporting logic does premature BT close when the BT is incomplete state. This leads to incorrect CEM reports w.r.to BT time/size and false defects.
Akamai environment uses a persisted connections for content delivery and transactions of the BT will be served through multiple persisted connections using unique user session.To handle this premature situation against akamai connection reset case, we need to add a special processing logic/implementation (flag/property based solution) on TIM side(i.e. HTTP Analyzer and Transaction Analyzer modules). So that other customer environments won’t impact with this solution or other customer can enable this feature if they have this kind of problem with AKAMAI. TIM Settings property details:Below TIM setting property is required to fix the BMO/AKAMAI issues (and the same can be used for other customers if applicable). Name/Value: EnableCDNConnectionRSTForBT/1 (default is “0”, feature is not enabled) Note: The above setting is a dynamic property and it doesn’t require TIM restart
Thanks for your response.
I have set the flag for akamai compatibility but I do not see much improvement in the decoding error rate.
Is there anything else I could validate?