The Event – Technical Analysis of Downtime Due to Lack of Audio Data via WebRTC RTP on August 31st, 2020.

General Timeline:

  • Sunday ~3AM -6AM PST – Level 3 / CenturyLink goes down. Cloudflare is affected taking down major services and websites globally. All traffic to Level 3 returns 500 error codes (Missing or Invalid Routes).
  • Sunday ~6AM PST – Cloudflare reroutes traffic around CenturyLink, some services are still affected particularly those on or near CenturyLink endpoints.
  • Sunday ~9 AM PST – Level 3 / CenturyLink comes back online.
  • Monday ~6:00 AM PST we started receiving reports of lack of audio data to the Dialer.
  • Monday ~10 AM PST we noticed that is was only effecting certain accounts, making it more difficult to troubleshoot and isolate. We begin a session with Amazon to investigate any network troubles that could be causing this.
  • Monday at 1:05 PM PST – We identified the issue and began patching Dialer servers. The issue was that Google’s STUN Servers were not identifying our domain names correctly causing bogus connections when connecting to dialers via WebRTC. This was identified to have been a badly cached DNS route by Google due to the Sunday disturbance at Level 3. We switched to our Telco Provider’ss STUN Servers to mitigate further issues and also to have access to logs and service our Telco can provide that Google will not.
  • Monday ~1:45PM PST – Last of the fixes concluded around 1:45PM and everyone resumed operations.

Read about CenturyLink / Level 3 Outage Post Mortem by Cloudflare

Steps to Monitor and Prevent from Occurring Again

  • Only use Telnyx STUN and TURN Servers for WebRTC Signaling instead of relying on Google’s Public STUN Service.
    • This provides a reliable source of STUN and TURN specifically for our purposes.
    • We have a great relationship and quick response from Telnyx, so if any issues or changes occur with their service we are notified and we can investigate logs and traffic data.
  • Removed Google and alternative STUN and TURN Servers from the Client Side Webphone.
  • Switch to IBM Quad9 DNS (9.9.9.9) internally instead of Cloudflare (1.1.1.1) due to the recent upsets at Cloudflare and their upstream providers.
    • If any similar issues arise investigate Dialer DNS’s settings first.
  • Change TTL for all Dialer Subdomains back to 300 to prevent Global Outage DNS Cache issues like this from effecting us for longer than 5 minutes.

The Beginning

We were tearing our hair out at the beginning of this fiasco because nothing had changed over the weekend. No development, no updates, and Friday all was normal. Once we started getting reports, we started scouring logs, checking connections, restarted servers, recommissioning servers to change the hardware zones, and nothing was working. All systems were nominal. HTTP Traffic was perfect, load was perfect, web sockets were connecting without a problem, the only issue was that audio data was lagging out, not connecting, not being received and generally not working.

Suddenly, Exceptions

With over 10 Dialers now inoperable, things were beginning to escalate but upon checking all accounts we realized that a few were still working fine. We then assumed that it was a networking issue at Amazon with the specific instances, albeit strange that they should all happen at the same time. We have never such a massive failure on multiple Dialer’s before, because we have many subsystems to separate concerns and keep the system going should one fail. But without audio, it is hard to run a call center. There was also no indication as to errors or problems occurring in any logs on any of the machines. We then contacted Amazon AWS and spent a few hours with their networking technicians and support to no avail. RTP Traffic WAS flowing, it WAS being accepted, it JUST wasn’t working on the Webphone.

In the case of a single server having this issue we would have taken action to install Softphones to circumvent the issue temporarily, but deploying Softphones across offices, during COVID with many working from home, would have taken just as much if not more time than to continue to investigate the issue so we made the decision to keep investigating. Also, Softphones don’t play well with some of our cluster setups that require multiple registrations for load balancing, so this was not an ideal solution.

An Epiphany: It’s Not DNS, There’s No Way it’s DNS, It was DNS, but with a Twist due to the Sunday Event…

We checked DNS as one of our first checklists as to issues that could be, we updated servers with different DNS settings, moving from Cloudflare DNS Servers to IBM, thinking maybe the servers were being effected by the issues at CloudFlare and were broadcasting the wrong IP address, had a bad cache or something strange. This was not the case and didn’t help much, but suddenly however, one of the servers started working while we were testing this theory. We had no idea why. 30 Minutes later another one started working, Audio just started flowing. I got to thinking about the Sunday Event and how this could be a ripple effect of the Event.

I was up at 3AM Sunday enjoying a leisurely video game when everything came crashing down, cannot login. Multiple Major Game Services and other services all around the country and world were suddenly inaccessible. Most of these services use CloudFlare, and one of CloudFlate’s Upstream providers had just thoroughly and globally crashed, causing requests to Cloudflare that are forwarded to websites to return as if they did not exist, or were problematic.

I thought…what if servers across the world did a DNS Lookup request during this event, and then got corrupted, it would then point to an invalid IP Address, or no Address, which would effect our subdomains individually. Subdomains have different refresh rates for DNS Cache, anywhere from 300 seconds to a full day. And we just so happened to have most of ours set to a full day due to DNS Lookup issues our overseas clients were sometimes having due to poor internet.

We then started looking at the internals of how the Webphone works, because we were able to get the softphone working without issue, so it must have to do with WebRTC, or a Protocol Client-side that is not negotiating properly between the client and server.

The fact that one of our servers came back meant that the DNS cache for that Dialer’s subdomain suddenly expired and refreshed on it’s own, so it now pointed at the valid IP.

But the question was WHERE?! What service had this invalid cache? Nothing was returning invalid or erroneous values in our server logs. Our Wireshark and SIP Tracing was literally just showing NO data being received by the client even though it was being sent! Everything else was connected and working! The website was accessible, pings worked, there seemed to be no DNS Cache issue as we were able to access via HTTP, SSH, and pure RTP without any issues!

The Answer was: Google’s Stun Servers

We found the answer inside the Webphone. The Webphone prefers our telephony companies STUN Server per our configuration, and then falls back to a global standard Google Operated STUN Server that ships as default with most telephony systems that work with WebRTC. Google itself had an invalid pointer to our servers, the bad cache was on their end where we couldn’t see it, likely due to the Sunday disturbance that returned error messages for so many websites across the internet when requested. We also found that the Dialer’s themselves, outside of the client, had Google set as the Default STUN Service, which ships as the standard default STUN service for most WebRTC applications. This was making Google the preferred STUN service and although our Carrier’s Telnyx, was sometimes checked, it was not always the case. This was the missing key.

The offending Server was: stun.l.google.com:19302

What is STUN and TURN?

  • STUN Technical:
    • Session Traversal Utilities for NAT (STUN) is a standardized set of methods, including a network protocol, for traversal of network address translator (NAT) gateways in applications of real-time voice, video, messaging, and other interactive communications.
  • STUN English
    • STUN determines where to send audio data to your computer when you are on a network at home or at a coffee shop. Since you share an IP Address with everyone on your network, you need to be identified as to where you are to send the audio stream.
  • TURN Technical
    • Traversal Using Relays around NAT (TURN) is a protocol that assists in traversal of network address translators (NAT) or firewalls for multimedia applications. It may be used with the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). It is most useful for clients on networks masqueraded by symmetric NAT devices.
  • TURN English
    • Similar to STUN, except when we can’t figure out where exactly you are on your home network due to incorrect or strict settings, TURN is used as a fallback to Relay the Audio information between yourself and a server.

The Results

We REMOVED Google’s Stun Server on the Dialer Backend, replacing it with Telnyx’s ( which we preferred anyways because we have access, communications and a good relationship with Telnyx to debug things quickly when issues arise ) and SUDDENLY audio is working.

We quickly deployed this to two other servers with success, and after a few minutes of manually configuring each Dialer we were back in action.

Likely, this issue would have cleared itself up in a few more hours due to DNS Cache refresh as the 24 hour mark approached for those subdomains configured with the 24 hour TTL (Time to Live). But we would have never known the root cause. Now we know.

As of now we have changed our TTL to 5 minutes for all domains and subdomains to prevent DNS Cache from being an issue in the future.

We have also removed Google and other alternate STUN and TURN Servers from our list of options on both our Webphone Client and Asterisk backend and are sticking with Telnyx.

As an extra level of protection and due to the notorious and concerning routing issues Cloudflare has been encountering the past few months ( See July 17th Cloudflare Outage caused by a single line of incorrect configuration code ) we also have opted to change our server’s DNS Settings to IBM’s Quad9 ( as we had already changed from Google’s DNS Long ago due to issues, ping and speed ).

Thanks and Mentions

I want to thank the wonderful staff at Amazon Web Services, particularly Joshua K., for staying on the phone with us for hours testing, analyzing, and for looking into the server specs and instances we have with them to find any outliers. Their customer service and cloud service was performing stellar as always.

I also want to thank the Good Folks and our Telephony Carrier at Telnyx of which we have been with nearly since their inception for being there for us and assisting with theories and ideas as to what this problem could have been. It is nice to know that they have our backs and we can reach out as friends and colleagues. Thank you Shreya, Rogelio and Zach for the assistance and conversation and for maintaining a stellar telephony product with amazing prices and awesome customer service.

Conclusion and Appreciation for our Clients

Lastly, thank you, our clients, for bearing with us during unforeseen event. It’s not every day that a backbone of the internet crashes so hard leaving ripples like this; I am glad that the Sunday Event didn’t happen Monday as it would have been even worse to track down, little less possible to correct.

We strive to make a better and stronger product in our niche and are always here for you guys.

Today, your revenue was lost, your agents are disappointed in us for not being able to make sales, and it was your payroll that was running on the clock as you waited.

Please know we worked our hardest to solve this issue and make the right decisions to get it fixed properly and as timely as we could. We considered stopgap solutions, but it would have taken longer to implement a stopgap than to get to the root cause and correct it.

2020 has been a rough year, and fortunately we have been able to keep up with the demand with people using our system nationally and internationally and we appreciate everyone of you, your feedback, and daily conversations in Slack.

We have optimized our services to be able to be used from anywhere, and added tons of features and functionality per your suggestions that have benefited all on the platform.

We will keep building and fighting for you and your business.

I was surprised and warmed with the calls of concern, and how understanding you all have been today; not a single angry email, Slack message or phone call. This makes me proud to be working with you all and in this industry.

I know it is not much, but we will be offering credits for this one day of downtime that will be directly applied to your current monthly invoices, I know it pales in comparison to the lost revenue of nearly a full day of productivity, but we want to do something to make things right and show our appreciation for you all being our clients.

– Alexander Conroy aka “Geilt the Architect”

TLDialer – Vicidial Integration Beta

Our Integrated Vici solution is ready to go live, and is available for Beta testing to any of our users using our Vici solution.

It should be seamless and require no setup on your end, just shoot us an email if you would like to try it out.

TLDCRM can now control VICIDIAL instances that we manage and maintain. We have written a high level service plugin for Vicidial that follows all their standards and takes some liberties with the data they provide. We did a lot of reverse engineering and you can be sure we will be adding more and more features as time goes on, particularly based on client feedback.

This page will be continuously updated with progress and status notes regarding TLDialer

TLDCRM Specific Changes

  • Statuses and Custom Statuses can now be mapped to VICI Statuses.
    • When Vici is enabled, you can set custom VICI Status codes in the Fields section when editing a status or custom status. This way you can choose the mappings, otherwise it will revert to default statuses we have selected to match our internal CRM statuses. Unknown statuses will default to”Other[O]”.
  • Setting Status and Callback on leads emits an event we catch to pass to VICI, all Dialer work should not effect workflow for non dialer users. We can use these events later for other dialer integrations if needed!
  • When Vici is disabled, but a user is Enabled via User Settings, that user will load all dialer resources where necessary. This is a good way to only allow certain agents to load the Dialer if you want to play around with it before giving it to everyone else.
  • TLDialer will now show on the Agent Dashboard as a Button Link when enabled for the user or account.

Working

  • Exit TLDialer
  • Login / Logout
    • Select Campaign.
    • Automatically Set Ingroups.
  • Show Campaign
    • Show Ingroups
  • Park / Grab.
  • Pause / Resume.
  • Record / Stop Recording.
  • Load Lead on Incoming Call.
  • Update Status on TLD Status Change.
  • Hangup Call.
  • Disposition Vici Lead from TLDCRM Lead.
    • Set Callback Disposition and Schedule from TLDCRM Lead / Callback form.
  • Automatic Hangup on Disposition
  • Queue
    • Show current Queue count
  • Transfers / Conferences
    • Show List of Users and their Statuses
      • Disable Transfers those not in READY or PAUSED Status
    • Park and Transfer to Queue (Ingroup)
    • Ring and Transfer to Queue (Ingroup)
    • Release and Transfer to Queue (Ingroup)
    • Park and Transfer to Agent
    • Ring and Transfer to Agent
    • Release and Transfer to Agent
    • Hangup All Lines
    • Hangup Conference Line
    • Leave Conference (Remove Self)
  • Manual Dialing
    • Dial current lead from TLD to Vici
      • Clicking the dialpad on a phoen number in a lead will initial a call when paused.
      • It will search for a lead in Vici to load based on reference_id (VICI lead_id), vendor_lead_code (TLD Lead ID), and then phone number. If no reference ID is set in TLD for a lead found, it will updated it and connect the VICI lead to the CRM lead.
  • DTMF Tones
    • Dial Pad
    • Send Typed in Digits
      • Supports Letters! Converts letters to numbers for you.
        • Example: 1-888-646-WOLF(9653)
  • Update vendor_lead_code with TLDCRM Lead ID on creation of TLDCRM Lead from Vici
    • Currently happens via Dialer Pop, not via direct insert. We may add that option later.
  • Update lead data in VICI when pressing Save in TLDCRM
    • Lead must have both Reference ID (Vici’s Lead ID) and TLD Lead ID.
  • Qualification Time
    • Show Qualification time for the particular inbound vendor,
    • Show countdown timer based on time the customer called in.
  • Jump back to Current Call Lead while Navigating CRM.

Pending

  • Queue
    • Show Queue List, with links to TLD leads where available.
  • Group Aliases
    • Show List of Group Aliases (Outbound Caller ID)
    • Select Group Alias on Transfers / Call
  • Qualification Time
    • Update lead cost when hanging up before Qualification Time

Bugfix

  • 2018-4-10
    • Fixed issue with Disposition Status 0, Trash. Now works.
  • 2018-4-13
    • Fixed issue when refreshing session disconnects but cant get back in.
    • Fixed issue when disconnecting phone, now ends session and hangs up any clients
    • Fixed issue with Pause not showing reliably.
    • Fixed issue with Transfer button to agents not showing based on correct statuses

Issues

  • Working on behavior when you refresh the screen, currently disconnects like Vici native but sometimes leaves a stuck session.