- Sunday ~3AM -6AM PST – Level 3 / CenturyLink goes down. Cloudflare is affected taking down major services and websites globally. All traffic to Level 3 returns 500 error codes (Missing or Invalid Routes).
- Sunday ~6AM PST – Cloudflare reroutes traffic around CenturyLink, some services are still affected particularly those on or near CenturyLink endpoints.
- Sunday ~9 AM PST – Level 3 / CenturyLink comes back online.
- Monday ~6:00 AM PST we started receiving reports of lack of audio data to the Dialer.
- Monday ~10 AM PST we noticed that is was only effecting certain accounts, making it more difficult to troubleshoot and isolate. We begin a session with Amazon to investigate any network troubles that could be causing this.
- Monday at 1:05 PM PST – We identified the issue and began patching Dialer servers. The issue was that Google’s STUN Servers were not identifying our domain names correctly causing bogus connections when connecting to dialers via WebRTC. This was identified to have been a badly cached DNS route by Google due to the Sunday disturbance at Level 3. We switched to our Telco Provider’ss STUN Servers to mitigate further issues and also to have access to logs and service our Telco can provide that Google will not.
- Monday ~1:45PM PST – Last of the fixes concluded around 1:45PM and everyone resumed operations.
Steps to Monitor and Prevent from Occurring Again
- Only use Telnyx STUN and TURN Servers for WebRTC Signaling instead of relying on Google’s Public STUN Service.
- This provides a reliable source of STUN and TURN specifically for our purposes.
- We have a great relationship and quick response from Telnyx, so if any issues or changes occur with their service we are notified and we can investigate logs and traffic data.
- Removed Google and alternative STUN and TURN Servers from the Client Side Webphone.
- Switch to IBM Quad9 DNS (126.96.36.199) internally instead of Cloudflare (188.8.131.52) due to the recent upsets at Cloudflare and their upstream providers.
- If any similar issues arise investigate Dialer DNS’s settings first.
- Change TTL for all Dialer Subdomains back to 300 to prevent Global Outage DNS Cache issues like this from effecting us for longer than 5 minutes.
We were tearing our hair out at the beginning of this fiasco because nothing had changed over the weekend. No development, no updates, and Friday all was normal. Once we started getting reports, we started scouring logs, checking connections, restarted servers, recommissioning servers to change the hardware zones, and nothing was working. All systems were nominal. HTTP Traffic was perfect, load was perfect, web sockets were connecting without a problem, the only issue was that audio data was lagging out, not connecting, not being received and generally not working.
With over 10 Dialers now inoperable, things were beginning to escalate but upon checking all accounts we realized that a few were still working fine. We then assumed that it was a networking issue at Amazon with the specific instances, albeit strange that they should all happen at the same time. We have never such a massive failure on multiple Dialer’s before, because we have many subsystems to separate concerns and keep the system going should one fail. But without audio, it is hard to run a call center. There was also no indication as to errors or problems occurring in any logs on any of the machines. We then contacted Amazon AWS and spent a few hours with their networking technicians and support to no avail. RTP Traffic WAS flowing, it WAS being accepted, it JUST wasn’t working on the Webphone.
In the case of a single server having this issue we would have taken action to install Softphones to circumvent the issue temporarily, but deploying Softphones across offices, during COVID with many working from home, would have taken just as much if not more time than to continue to investigate the issue so we made the decision to keep investigating. Also, Softphones don’t play well with some of our cluster setups that require multiple registrations for load balancing, so this was not an ideal solution.
An Epiphany: It’s Not DNS, There’s No Way it’s DNS, It was DNS, but with a Twist due to the Sunday Event…
We checked DNS as one of our first checklists as to issues that could be, we updated servers with different DNS settings, moving from Cloudflare DNS Servers to IBM, thinking maybe the servers were being effected by the issues at CloudFlare and were broadcasting the wrong IP address, had a bad cache or something strange. This was not the case and didn’t help much, but suddenly however, one of the servers started working while we were testing this theory. We had no idea why. 30 Minutes later another one started working, Audio just started flowing. I got to thinking about the Sunday Event and how this could be a ripple effect of the Event.
I was up at 3AM Sunday enjoying a leisurely video game when everything came crashing down, cannot login. Multiple Major Game Services and other services all around the country and world were suddenly inaccessible. Most of these services use CloudFlare, and one of CloudFlate’s Upstream providers had just thoroughly and globally crashed, causing requests to Cloudflare that are forwarded to websites to return as if they did not exist, or were problematic.
I thought…what if servers across the world did a DNS Lookup request during this event, and then got corrupted, it would then point to an invalid IP Address, or no Address, which would effect our subdomains individually. Subdomains have different refresh rates for DNS Cache, anywhere from 300 seconds to a full day. And we just so happened to have most of ours set to a full day due to DNS Lookup issues our overseas clients were sometimes having due to poor internet.
We then started looking at the internals of how the Webphone works, because we were able to get the softphone working without issue, so it must have to do with WebRTC, or a Protocol Client-side that is not negotiating properly between the client and server.
The fact that one of our servers came back meant that the DNS cache for that Dialer’s subdomain suddenly expired and refreshed on it’s own, so it now pointed at the valid IP.
But the question was WHERE?! What service had this invalid cache? Nothing was returning invalid or erroneous values in our server logs. Our Wireshark and SIP Tracing was literally just showing NO data being received by the client even though it was being sent! Everything else was connected and working! The website was accessible, pings worked, there seemed to be no DNS Cache issue as we were able to access via HTTP, SSH, and pure RTP without any issues!
The Answer was: Google’s Stun Servers
We found the answer inside the Webphone. The Webphone prefers our telephony companies STUN Server per our configuration, and then falls back to a global standard Google Operated STUN Server that ships as default with most telephony systems that work with WebRTC. Google itself had an invalid pointer to our servers, the bad cache was on their end where we couldn’t see it, likely due to the Sunday disturbance that returned error messages for so many websites across the internet when requested. We also found that the Dialer’s themselves, outside of the client, had Google set as the Default STUN Service, which ships as the standard default STUN service for most WebRTC applications. This was making Google the preferred STUN service and although our Carrier’s Telnyx, was sometimes checked, it was not always the case. This was the missing key.
The offending Server was: stun.l.google.com:19302
What is STUN and TURN?
- STUN Technical:
- Session Traversal Utilities for NAT (STUN) is a standardized set of methods, including a network protocol, for traversal of network address translator (NAT) gateways in applications of real-time voice, video, messaging, and other interactive communications.
- STUN English
- STUN determines where to send audio data to your computer when you are on a network at home or at a coffee shop. Since you share an IP Address with everyone on your network, you need to be identified as to where you are to send the audio stream.
- TURN Technical
- Traversal Using Relays around NAT (TURN) is a protocol that assists in traversal of network address translators (NAT) or firewalls for multimedia applications. It may be used with the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). It is most useful for clients on networks masqueraded by symmetric NAT devices.
- TURN English
- Similar to STUN, except when we can’t figure out where exactly you are on your home network due to incorrect or strict settings, TURN is used as a fallback to Relay the Audio information between yourself and a server.
We REMOVED Google’s Stun Server on the Dialer Backend, replacing it with Telnyx’s ( which we preferred anyways because we have access, communications and a good relationship with Telnyx to debug things quickly when issues arise ) and SUDDENLY audio is working.
We quickly deployed this to two other servers with success, and after a few minutes of manually configuring each Dialer we were back in action.
Likely, this issue would have cleared itself up in a few more hours due to DNS Cache refresh as the 24 hour mark approached for those subdomains configured with the 24 hour TTL (Time to Live). But we would have never known the root cause. Now we know.
As of now we have changed our TTL to 5 minutes for all domains and subdomains to prevent DNS Cache from being an issue in the future.
We have also removed Google and other alternate STUN and TURN Servers from our list of options on both our Webphone Client and Asterisk backend and are sticking with Telnyx.
As an extra level of protection and due to the notorious and concerning routing issues Cloudflare has been encountering the past few months ( See July 17th Cloudflare Outage caused by a single line of incorrect configuration code ) we also have opted to change our server’s DNS Settings to IBM’s Quad9 ( as we had already changed from Google’s DNS Long ago due to issues, ping and speed ).
Thanks and Mentions
I want to thank the wonderful staff at Amazon Web Services, particularly Joshua K., for staying on the phone with us for hours testing, analyzing, and for looking into the server specs and instances we have with them to find any outliers. Their customer service and cloud service was performing stellar as always.
I also want to thank the Good Folks and our Telephony Carrier at Telnyx of which we have been with nearly since their inception for being there for us and assisting with theories and ideas as to what this problem could have been. It is nice to know that they have our backs and we can reach out as friends and colleagues. Thank you Shreya, Rogelio and Zach for the assistance and conversation and for maintaining a stellar telephony product with amazing prices and awesome customer service.
Conclusion and Appreciation for our Clients
Lastly, thank you, our clients, for bearing with us during unforeseen event. It’s not every day that a backbone of the internet crashes so hard leaving ripples like this; I am glad that the Sunday Event didn’t happen Monday as it would have been even worse to track down, little less possible to correct.
We strive to make a better and stronger product in our niche and are always here for you guys.
Today, your revenue was lost, your agents are disappointed in us for not being able to make sales, and it was your payroll that was running on the clock as you waited.
Please know we worked our hardest to solve this issue and make the right decisions to get it fixed properly and as timely as we could. We considered stopgap solutions, but it would have taken longer to implement a stopgap than to get to the root cause and correct it.
2020 has been a rough year, and fortunately we have been able to keep up with the demand with people using our system nationally and internationally and we appreciate everyone of you, your feedback, and daily conversations in Slack.
We have optimized our services to be able to be used from anywhere, and added tons of features and functionality per your suggestions that have benefited all on the platform.
We will keep building and fighting for you and your business.
I was surprised and warmed with the calls of concern, and how understanding you all have been today; not a single angry email, Slack message or phone call. This makes me proud to be working with you all and in this industry.
I know it is not much, but we will be offering credits for this one day of downtime that will be directly applied to your current monthly invoices, I know it pales in comparison to the lost revenue of nearly a full day of productivity, but we want to do something to make things right and show our appreciation for you all being our clients.
– Alexander Conroy aka “Geilt the Architect”