APPLICATION AWARE TRIGGERED QUALITY OF SERVICE (AATQoS)
Jared Valentine (hidden at xmission.com)
v1.2 March 13, 2008
The most recent version of this document can be found here: http://www.xmission.com/~hidden/aatqos
BACKGROUND / PROBLEM DESCRIPTION
I live in a multi-computer household. There are desktops, laptops, media centers, and servers on the network. When I'm not out visiting clients, I typically work out of my home office. My only option for broadband is a Qwest 1.5mbps DSL circuit (1.5Mbps Down / 1.0Mbps Up). My problem, and the part that frustrates me to no end, is our Vonage Voice over IP phone line. Don't get me wrong, I've been quite pleased with the Vonage service. What frustrates me is the contention for bandwidth that causes the Vonage audio quality to suffer.
Some people say "well, don't use the Internet while you use the phone", but that's not realistic. Everyone uses bandwidth in different ways and at different times of the day. All it takes is one person downloading something, anything, and my Vonage quality tanks. Windows Updates, AntiVirus updates, YouTube, inline "video" ads on webpages, e-mail with large attachments, or my daughter watching an online preview of the next "My Little Pony" movie all make callers sound robotic and unintelligible. That doesn't even consider heavier traffic like FTP, NNTP, Bittorrent, online backups, etc. The future holds even more contention for limited bandwidth as media companies rush to make HD movies available for download over the internet and streaming IPTV takes off.
For those of you who are "gamers", feel free to substitute the terms "VoIP" and "phone call" in this article with your favorite online game (World of Warcraft, Quake, Counterstrike, etc). You know what I mean... every millisecond counts and a congested line can be the difference between an enjoyable game and being pwnd by a 14-year-old with a better ping than you. For those of you who are wondering what this looks like, here's a quick chart that shows average latency vs download utilization on my DSL line:
From 0% utilization up to about 800kbps latency hovers around 50ms. Between 800kbps and 1.1Mbps the latency slowly rises to 100ms. Then, a real interesting thing happens between 1.1Mbps and the full 1.5Mbps. Latency shoots through the roof and settles down around the 630-700ms range. I've spent countless hours on Google searching for the perfect traffic
prioritization solution. Almost all of the QoS-related search
results emphasize that "you can prioritize your outbound
traffic, not your inbound." True prioritization requires control of
both ends of the connection. Since I don't have control of my
ISP's routers, this doesn't help either.
- http://www.faqs.org/docs/Linux-HOWTO/ADSL-Bandwidth-Management-HOWTO.html#AEN149 (see 3.5.1)
- http://vonage.nmhoy.net/qos.html
- http://www.aetherwide.com/articles/voip-pf.html
UNACCEPTABLE SOLUTIONS
I was able to implement one QoS scheme that provided perfect sounding VoIP calls. Unfortunately, it came at a cost that was too much to bear. I applied a heavy-handed traffic policing scheme to my WAN router that limited all incoming TCP traffic to 900kbps, which is a little more than half my 1.5Mbps DSL connection. The Vonage service uses UDP packets so they would be allowed to pass through the rate limit uninhibited. By limiting all TCP traffic to 900k things sounded great.
I tried rate-limits in the 1.0/1.1Mbps range, but things sounded just "okay". At 900k + 100k for a Vonage call keept the latency down in the 70ms range and things sounded perfect! Why is this solution listed as unacceptable? Wasted bandwidth. Even when I wasn't on a call, that rate-limit was eating up more than 1/3 of my available bandwidth. (The ADSL-Bandwidth-Management-HOWTO referenced above posts a similar problem and resolution, and recommends against it due to the high cost). I felt I was on to something here, though - so I continued on.
For a short time, I used a one-click script that would telnet to my router and enable the traffic policing. I would launch that shortcut before I made or answered a call, and then launched a different script that would disable the policing when the call was complete. That worked great except for when my computer was off, rebooting, blue-screened, or being tinkered upon. And of course, it didn'thelp at all when I wasn't in my home office.
I started investigating alternative traffic-shaping products looking for a better solution. I looked at dd-wrt on a Linksys WRT54G and loaded pfSense on an old PC. I read all sorts of IOS manuals and QoS guides for a Cisco 1801 DSL router, the same for a 3Com 3031 DSL router. No matter what I tried or looked at, it always came back to the same paradigm: "you can only prioritize your outbound traffic, not the inbound."
THE HOLY GRAIL - AATQoS
For me, the holy grail of traffic prioritization would limit all incoming data traffic to 900k, only_while_I_was_on_the_phone_. I would call this perfect scheme "Application Aware Triggered Quality of Service" or "AATQoS" for short. The basic steps for this prioritization scheme include: 1.) detecting the presence of a VoIP call, 2.) reacting to the presence of a VoIP call with rate limiting, 3.) detecting the absence of a VoIP call, and 4.) reacting to the absence of a VoIP call by removing rate limiting..
I have a background in Networking, Security, and VoIP. When I took a step back and spent some a little time defining the problem and visualizing the "perfect solution", the answer became readily apparent. Why not take a security application that is generally tasked with looking for "bad" traffic and have it watch for traffic that needs to be prioritized instead? When the important traffic is seen, then the previously identified steps can be taken to rate limit all other traffic. This ensures the latency-sensitive traffic's timely and unhindered delivery. Once the latency-sensitive traffic is gone, then the temporary limits on all of the other traffic could be removed. This restores full access to all of the bandwidth.
At first I thought this project might take days, but thanks to pfSense and a handful of other open-source projects, this project only took an hour or two.
THE PLATFORM - pfSense
First off, pfSense (www.pfsense.org) is super-easy to install. I downloaded an ISO and installed it on an unused PC. pfSense is a very powerful platform that supports firewalling, traffic shaping, routing, intrusion detection, and transparent webcaching among many other things. Once the base system is configured, pfSense offers one-click options to install additional services. The Snort IDS engine is perfectly suited to detect VoIP calls. I am currently running the 1.2-RELEASE version of pfSense, although this process worked with the pfSense BETA-3 and -4 releases.
THE SENSING MECHANISM - Snort
In the pfSense GUI, go to System, then Packages. From here, you can click on the "+" sign next to Snort, which will launch the automated installation process. After installing Snort, I registered at the Snort website (www.snort.org) and requested an "oinkmaster key". This key allowed me to download pre-packaged Snort rules. I entered this key into the pfSense GUI and pfSense automatically downloaded a bunch of Snort rules. pfSense makes it very easy to enable rule categories and individual Snort rules. I enabled all of the voip.rules to see what would happen. Unfortunately, the built-in rules didn't seem to alert when I made or received calls, most likely because Snort usually looks for malicious traffic, not normal VoIP traffic. Without reliable logs, I couldn't get reliable traffic policing. I ended up writing my own Snort rules.
PACKET CAPTURE - tcpdump
Before I could write Snort rules, I needed to get a packet trace using a sniffer. It is impossible to write a detection rule when you don't know what you're looking for. Lucky for me, pfSense includes the ever-popular tcpdump utility. With tcpdump, I could capture packets on the wire and see the Vonage adapter's communications while making & receiving calls. I ssh'd to the pfSense box, got a shell, and then issued the command "tcpdump -w capture.cap -s 0 udp". I moved the capture.cap file from pfSense to my desktop using SCP and used Ethereal and/or Wireshark (www.wireshark.org) to investigate exactly what happend during phone calls. Please keep in mind that different VoIP providers may use different port numbers and protocols. Most standard SIP traffic uses UDP port 5060 and 5061. Vonage, however, seems to have moved to UDP 10000 for their signalling (maybe due to the recent lawsuits?). The actual RTP stream seems to use UDP 10001+. Your mileage may vary depending on your provider & underlying VoIP technology.
Note: I initially configured Snort to match SIP keywords INVITE, CANCEL, and BYE. This required multiple snort rules and didn't scale well. It didn't work at all for multiple VoIP lines and would have been extremely difficult to define a Snort rule for all of the applications that people would want to have prioritized. If you're interested in reading more about keying off of start/stop messages, feel free to read a previous version of this webpage: AATQoS v1.1.
After looking at the packet capture, it looked like the Vonage Real-Time Protocol (RTP) could simply be defined as a UDP session with a source port of 10001+ and a destination port of 10001+.
CREATING SNORT RULES
At this time, I wasn't interested in using any other Snort rules, so I disabled all of the default rules through the pfSense GUI and created my own category. Creating a new category is as simple as creating a file in the /usr/local/etc/snort/rules directory called 00police.rules. I used the vi (http://en.wikipedia.org/wiki/Vi) text editor to create the file. I used the name "00police.rules" so that it is the first category to show up in the pfSense Snort GUI. The 00police.rules file has the following entry:
/usr/local/etc/snort/rules/00police.rules
alert udp any 10001: -> any 10001: (msg:"VOIP-SIP RTP DETECT"; reference:url,www.ietf.org/rfc/rfc3261.txt; classtype:protocol-command-decode; threshold: type limit , track by_src, count 1 , seconds 10 ; sid:72010; rev:1;) |
Note: I don't claim to be a Snort expert. Until today, I've never written a Snort rule. I'm sure there are more specific and accurate ways to write this rule with offsets, distances, regular expressions, etc. However, the above rule works perfectly for my purposes and I haven't gotten any false positive-matches (yet). If anyone has a more specific rule to detect the presence of Vonage RTP, let me know.
Notice the threshold statement in the Snort rule. This essentially limits Snort to making a single alert every 10 seconds. You don't want to log each and every single RTP packet. That would cause tens or hundreds of log alerts per second, causing undue CPU utilization on your pfSense box as well as quickly filling up your hard drive. One log every 10 seconds works great for what we're trying to accomplish.
Now, with Snort running, I issued the command "tail -f /var/log/snort/alert" and made some test calls. While I was on a call, I would see the RTP DETECT message every 10 seconds. Once the call was complete, the RTP DETECT messages stopped. Once Snort was reliably detecting when I was on and off the phone, it was time to move to the next step: automating the policing.
AUTOMATING ACTIONS - Simple Event Correlator
I didn't see anything in the Snort documentation about automated actions based on rule matches, although I have to admit I didn't look for long. I found a log file analysis tool that did exactly what I was looking for: the Simple Event Correlator (SEC). You can download SEC from here: (http://sourceforge.net/projects/simple-evcorr/). SEC Documentation is found here: (http://simple-evcorr.sourceforge.net/)
In a nutshell, SEC can be configured to watch a log file for specific entries and then take customized actions for each match.
While reading the SEC documentation, the "SingleWith2Thresholds" event correlation rule type stood out as the right one to use. From the SEC documentation: "count matching input events during t1 seconds and if a given threshold is exceeded, execute an action list. Then start the counting of matching events again and if their number per t2 seconds drops below the second threshold, execute another action list. Both event correlation windows are sliding." This would work great. The first RTP packet detected will be the start of the rate-limiting time period. When the RTP stream is complete and the DETECT messages haven't been seen for more than 10 seconds, then it would be safe to remove the rate limit.
I downloaded sec-2.4.1.tar.gz and placed it into the /usr/src directory. I extracted the files with "tar xzfp sec-2.4.1.tar.gz". This created a new directory called "/usr/src/sec-2.4.1", which I ended up using as the working directory for all of the scripts and log files associated with this project. There's probably a better, more appropriate place for all of these files - but this worked for me. Here is the conf.txt configuration file I used for SEC:
/usr/src/sec-2.4.1/conf.txt
type=SingleWith2Thresholds ptype=RegExp desc2=No RTP Payload, Disabling Police |
Note: you don't tell SEC which log file to watch in the configuration file, you do it as part of the actual sec.pl command line. Details on launching sec.pl are found a little later in this document.
The first half of this rule rule watches the input file for a single (thresh=1) RTP packet during a 1 second window (window=1) specifically matching "DETECT" from the 00police.rules file. When successfully matched, SEC executes 3 actions:
Note: the SEC documentation says that you can use $d or %d to write the date, but I couldn't get that part to work properly on my pfSense box
When 0 (thresh2=0) DETECT messages are received in a 20 second winow (window2=20), then SEC launches these 3 actions:
AUTOMATED ACTIONS - Perl Telnet Scripts
The Net::Telnet perl module is required in order to create a perl-based telnet script. I don't think Net::Telnet was included with my pfSense distribution. I was unable to use cpan to get a copy of it either. I manually downloaded Net::Telnet from the following location: (http://search.cpan.org/~jrogers/Net-Telnet-3.03/lib/Net/Telnet.pm)
I placed Telnet.pm into /usr/local/lib/perl5/5.8.8/net/. Thankfully, there are plenty of perl/telnet script examples available on the Internet. After some trial and error, I came up with these scripts. They are very simplistic scripts that leverage 3 functions: open a telnet connection, wait for text, send text.
/usr/src/sec-2.4.1/police/pl
use Net::Telnet (); |
/usr/src/sec-2.4.1/nopolice.pl
use Net::Telnet (); |
ROUTER CONFIGURATION - Cisco 1801
The router I used for the test is a Cisco 1801 DSL router, although any rate-limiting router would do the trick. Here are the snippets from the configuration that apply the rate limiting. They could probably be a little more specific and detailed (ie: different limits for inbound vs outbound traffic, further limiting "bulk" applications vs. interactive ones, etc.). I'm sure I'll modify this to be much more specific as time goes on.
class-map match-all acgroup110 policy-map police access-list 110 permit tcp any any interface Dialer0 |
The above police.pl perl script goes into policy-map police, class acgroup110, and adds the "police 900000..." line when I start a call. The nopolice.pl script removes that same police line once I hang up. The rest of the lines in the router configuration remain unchanged during and after VoIP calls.
TYING IT TOGETHER w/ pfSense
Finally, I wanted SEC to launch each time that pfSense reboots. To do this, I placed a startup script on the pfSense box.
/usr/local/etc/rc.d/sec.sh
#!/bin/sh |
I also ran "chmod a+x sec.sh" to make sure it was seen as an executable file by the operating system. I believe pfSense calls the scripts in /usr/local/etc/rc.d once in a while (every 24 hours?) and I didn't want multiple instances of the SEC script running. The "pkill" line effectively kills any running instance of the SEC perl script first and then launches a new copy. I rebooted the pfSense box, ssh'd in and checked the output of "ps aux". From here, I could see Perl running the sec.pl script.
MAKING SURE IT ALL WORKS
Finally, to make sure it was all working; I tailed the /usr/src/sec-2.4.1/sec.log file and made a few calls. As soon as the call started ringing, I would see "Enabling Police" and a timestamp. About 20 seconds after hanging up the call, I would see "Disabling Police" and the timestamp. I also made sure that the police statement was in my router while I was on a call. I checked it again once the call was complete and the police statement was gone. A few more tests with multiple HTTP/FTP downloads saturating the link showed that the police statement immediately took effect and the Vonage voice quality was crystal-clear.
OTHER THOUGHTS
Keep in mind that AATQoS doesn't need to be specific to pfSense, although I don't know if it could have been any easier with a different firewall/shaper. This should work great with any appliance/gateway/application that includes Snort and Perl. I fully expect others to be able to do this with IPCop, dd-wrt, openwrt, m0n0wall, smoothwall, etc. I'm sure that router manufacturers could easily add capability like this to their products.
This should work great for online gaming as well. At a minimum, your Snort rule just needs to pick out the game you're playing. Snort rules could be written to match the game in question, and the rest of the AATQoS framework could be followed. I don't play World of Warcraft, but some quick Googling tells me you could probably write a Snort rule that matches a source network of your home network, a source TCP port of 1024-65535, and a destination network of any, destination TCP port 3724. For as long as WoW is running, the limit will be in place. Access to full bandwidth will be restored 20 seconds after disconnecting from the game.
I believe this would be possible to do without Snort, matching off of standard layer 3/4 ACL & firewall matches as opposed to digging deeper into the packet. Hopefully, the router / firewall being monitored supports logging thresholds like Snort does, but it's deifnitely something to look into and could make AATQoS that much easier to implement.
CAVEATS
THINGS TO DO
FREQUENTLY ASKED QUESTIONS
Q1.) Why didn't you have the scripts modify the built-in QoS in pfSense?
A1.) This is a great idea and I'd love to see someone do it.
Personally, I don't use pfSense's built-in QoS because it interferes (read:
rate-limits) responses from the Squid cache. What's the point of having a web cache that's limited to
1.5mbps?) For those who don't have a router capable of ACLs and
rate-limiting, this would actually be a great way to do it. pfSense
prioritizes based on a percentage of available bandwidth. The automated
action could re-write the config file that holds the prioritization settings,
modify the qlanroot/qwanroot to the point where TCP data is limited to some
smaller value, and then restart the shaper process with the new settings.
While plenty of headroom will be available for VoIP, the remaining data would
still be prioritized within the smaller pipe. Interactive sessions like
Telnet could still have a higher priority than HTTP, which would have a higher
priority than bulk applications like FTP, Bittorrent, etc. At the end of
the call, replace the config file, restart the shaper process, and all of the
bandwidth becomes available again.
Q2.) Would this work with broadband technologies that have a "boost" or
"burst" feature (10+mbps for the first few seconds of a download, and drops off
to the standard 5mbps or 3mbps)?
A2.) Sure. Just make your rate limiting calculations assume the worst
case scenario, i.e. without the boost technology there. Not that any
connection is "guaranteed", but base your calculations off of your standard
bandwidth number, not the boost number. While you're on the phone, the
boost won't be available, but things will sound crystal-clear. Once
you're off the phone the boost will be ready to go.
Q3.) Why don't you just get more bandwidth?
A3.) Since I'm only a block or two away from the DSLAM, I assume the reason
I can only get 1.5mbps is because it's a T1/copper-fed remote terminal. I
wish Qwest would pull some fiber to it as I know the my copper loop supports 7mbps. Cable is not available in my
cul-de-sac and Comcast won't run a cable to me even though my backyard neighbor
has it.
Q4.) Why don't you get another phone line from the local Telco?
A4.) I'm not happy about only getting 1.5mbps and would rather spread my
consumer dollars around. Vonage has some cool features too and the price
is good.
Q5.) Why don't you just use a cellular telephone instead?
A5.) Cell service in my basement office is dismal
Q6.) Will this work for IAX VoIP?
A6.) I don't see why not. IAX, like SIP, uses UDP as a transport and
signaling protocol. Where SIP usually uses UDP 5060/5061 for call control, IAX
uses UDP port 4569 for both control & media. Fire up a sniffer and watch the
traffic on UDP 4569. Once you can describe the traffic, write a rule.
Q7.) Why AATQoS?
A7.) All of the shorter acronyms were already taken (Application Aware QoS, Application Triggered QoS, Triggered QoS, etc.). and had plenty of results on search engines. I wanted to be able to watch the search engine results grow over time... maybe I'm a little vain that way. When I started, there were exactly 4 Google matches for AATQoS (and all of them somehow greek-related).
DISCLAIMER
Use the information in this document at your own risk. I disavow any potential liability for the contents of this document. Use of the concepts, examples, and/or other content of this document is entirely at your own risk. All copyrights are owned by their owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark. Naming of particular products or brands should not be seen as endorsements. You are strongly recommended to take a backup of your system before major installation and backups at regular intervals.
CREDITS
I'd like to publicly thank the authors and contributors of pfSense, tcpdump, Snort, Simple Event Correlator, and FreeBSD for their time and effort in providing us these great tools.
© Copyright 2008 by Jared Valentine (hidden at xmission.com).
v1.0 January 15, 2008 - Initial Release
v1.1 February 28, 2008 - After a power bump, the script stopped working. Further inspection showed that Vonage had changed the SIP Signalling port on my Linksys adapter from 5060 / 5061 to UDP Port 10000. Made modifications to the snort rules for INVITE / CANCEL / BYE and things started working again.
v1.2 March 12, 2008 - Modified to detect presence of a stream instead of sign-in/sign-out messages. Graphics added.
The content presented on this web page is provided for your personal non-commercial use only and may not be republished in whole or in part without the express written or verbal consent of the publisher. All rights are reserved.