January 29, 2013 Leave a comment
Things have been quiet as we’ve been heads down working and are ready to share some of our progress. One item the CAF team has been working on is improving the eduroam health monitoring infrastructure behind the scenes. This is in response to intermittent reports where eduroam doesn’t work as well as it should have. By enhancing the monitoring infrastructure it allows us to better assess how well (or not) eduroam in Canada is performing and identify any improvements that can be made. This helps maintain the quality of the eduroam service as good as it has been or better as eduroam spreads further across Canada.
What does the end user see?
For the most part, end users experience a reality of either it works or doesn’t and as a rule, things appear to work smoothly. This is sometimes deceptive, but unintentionally so. Some devices mask the number of retries they make attempting to get online and all you may see is the checkmark beside the eduroam SSID indicating you are connected. What is not seen though are that some devices retry aggressively multiple times anywhere between 5 to hundreds of times in the span of a few minutes to get online. Multiply this by the number of devices you carry (laptop, phone, tablet etc) and maybe a wrong password in one device and you can get a glimpse at what the problem could be if not handled well.
While the end user doesn’t see or realize this is happening under the hood,these transactions are visible at the Canadian eduroam servers — of course only for traffic originating in Canada. This style of activity is taken into consideration and is part of the monitoring practices and metrics we track. We don’t always have a lot to go on other than the destination and origin due to the encryption of the traffic but that is enough for us to engage and inform the target sites that something may be going on or has occurred.
Analyzing the Data So Far
With over a million successful monthly sign-ons since November 2012, we’ve had a lot of data to analyze! As a starting point, we are looking at requests that result in a ‘No Reply’ response in our logs at our root Canadian eduroam servers, which would indicate that a participant’s RADIUS server is temporarily offline.
Right now the traffic patterns show a 10% ‘No Reply’ overall rate for RADIUS authentication requests. These requests appear in spikes like the above graph of 24hr of eduroam traffic. It may be that this is an artifact of the UDP based protocol or potentially how ‘chatty’ mobile devices could be but either way our goal is to understand what it means and how we reduce the problem from current levels and in turn improve the eduroam service.
What Canadian eduroam Sites May See Next
CANARIE will be analyzing log files a few times a week and may reach out to individual eduroam site contacts to clarify anomalies as we encounter them. We know time is precious and diagnosing a transient issue is difficult so if we do contact you we will try and provide a detailed report about the time period in question. We use Splunk, a commercial log analysis tool with our custom reports that can pinpoint the issue and timeframe in question save diagnosis time. Even with tools like Splunk we still manually assess when to escalate to a site to ensure that it’s worth digging into and appreciate your help to go the ‘last mile’ with your local RADIUS and network logs.