Special Look: Face Time (part 2: SIP and Data Streams)


In part 1 of this series we looked at the protocols involved in a Facetime call. The basic outline of the Facetime network exchange is as follows:
  • Unknown TCP protocol starts the conversation (TCP/5223);
  • Unknown UDP traffic between the iPhone and two hosts with similar IP addresses (UDP/16385 and UDP/16386);
  • Certificate validation through an Akamai server (HTTP);
  • HTTPS request to an Apple server;
  • STUN traffic for NAT traversal;
  • SIP traffic for call setup and negotiation;
  • UDP stream data for video/audio.
In this installmemt of the series we'll look at the last two components: SIP and the UDP stream information.

Examining SIP

SIP is the Session Initiation Protocol, used for controlling the setup and establishment of audio and video calls over TCP or UDP. As a text-based protocol, it looks a lot like HTTP (verbs like INVITE and BYE and numeric response codes), with a little SMTP love thrown in there as well.

Wireshark does a great job of identifying SIP traffic, even on non-standard ports. While SIP is typically done over port 5060 (or 5061 for SIP over TLS), Facetime is using UDP/16402. Wireshark gives us a summary of SIP activity in the packet capture by selecting the Telephony | SIP option, as shown.

Of interesting note here is the lack of the SIP request method REGISTER, which would be used with digest authentication to authenticate the device. This isn't a statement of vulnerability, but it indicates that Apple is not using the standard SIP authentication method, instead relying on an alternate exchange to authenticate the devices.

Also interesting here is the use of the SIP MESSAGE verb. According to RFC3428, the MESSAGE verb is used for instant messaging as part of the SIP exchange.

Otherwise, the SIP exchange is straightforward, as follows:

  • INVITE from the initiator to the responder;
  • ACK from the responder;
  • Several MESSAGE frames back and forth;
  • After a few minutes (the duration of the video call), a BYE from the responder to terminate the session.
The SIP exchange is shown in Wireshark packet list form below:

A few IP addresses worth clarifying here:
  • The remote iPhone from Apple's 888-Facetime service;
  • My iPhone's IP address on my open WiFi network;
  • My NAT address from my ISP, previously negotiated with STUN.
In the packet list we see that Apple is using user@address:port for the SIP address (URI). Looking at the detail of the INVITE frame we can gather additional detail. First we'll look at the message header content (content has been omitted to protect the privacy of the remote caller):

More interesting stuff here:
  • The Display component in the To and From fields reveals the cell phone number of both parties. In the first "To:" field shown, my cell phone number is listed "4015242911" followed by an unknown "570". This is interesting since the 888-Facetime caller's phone number was blocked from my phone display, but accessible to me from a packet capture.
  • The User-Agent of the 888-Facetime caller is "Viceroy 1.4/GK", which is similar to the User-Agent used by the iChat video client ("Viceroy 1.3", or "1.2" in older iChat clients).
Looking at the message body detail reveals more details about the session:

The message body details the Session Description Protocol (SDP) content, including the SDP session owner as "GKVoiceChatService" which is documented in Apple's iPhone SDK. We can also see the Real Time Control Protocol (RTCP) negotiated for UDP/16402, as well as multiple negotiated media attributes, essentially reduced to AAC for audio and X-H.264 for video.

Later in the SIP exchange, we see several of the MESSAGE verbs. Although intended for use in instant messaging applications, the MESSAGE verb is used by Facetime to exchange arbitrary data between the two iPhone devices. The MESSAGE verb payload data repeats the "User-Agent: Viceroy 1.4/GK" information, then includes the message "Content-Type: application/ske", similar to a HTTP exchange. Following this tag we have a Content-Length tag and "SKESeq: 1;0" for the first of the 4 MESSAGE verbs. Each subsequent MESSAGE verb also includes this content, changing the numeric identifier "1" for the successive packets (e.g. "SKESeq: 2;0", "SKESeq: 3;0" and "SKESeq: 4;0").

We can apply the display filter "sip.Request-Line contains "MESSAGE"" to focus the Wireshark display on these MESSAGE frames, as shown below.

A quick Google search doesn't turn up anything about the SKE protocol, though I'll speculate here that it is some kind of authentication negotiation mechanism. A summary of the 4 payloads following the SKESeq header is as follows:
  • SKESeq 1: A large-ish payload commonly around 785 bytes which appears to include certificate-looking information.
  • SKESeq 2: Always 4 bytes of payload: "61 f4 27 9f" (in one capture)
  • SKESeq 3: A consistent payload length of 170 bytes, no significant ASCII strings.
  • SKESeq 4: Always 4 bytes of payload: "53 a0 8e a3" (in one capture)
This data requires further analysis, possibly representing a proprietary authentication protocol used by Facetime through SIP MESSAGE verbs.  I'll devote further analysis to a later article so we can move on to the good stuff.

Data Streams

Following the SIP exchange we see a RTP exchange over UDP/16402 with a reflexive source port.  To evaluate this stream we'll turn to the videosnarf tool by Arjun Sambamoorthy and Jason Ostrom.  Videosnarf and the parent tool ucsniff are really impressive, and Jason and Arjun are really cool guys as well.

Videosnarf can read from a libpcap file, but the current version of the tool does not properly accommodate wireless packet capture link types other than native 802.11 (e.g. it cannot interpret PPI or Radiotap headers), with the following error:

# videosnarf -i 4g-inbound-888FACETIME-session-1.pcap
Starting videosnarf 0.63
[+]Starting to snarf the media packets
[+] Please wait while decoding pcap file...
[-] Invalid IP header length: 0 bytes
[-] Invalid IP header length: 0 bytes
[-]No RTP media stream found
[+]Snarfing Completed

My packet capture uses the PPI header, so I added support to handle this link type with videosnarf.  Download and apply the patch as shown (against videosnarf 0.63, future versions will hopefully integrate this functionality and not require patching):

# cd videosnarf-0.63
# wget -q http://www.willhackforsushi.com/code/videosnarf-wifi-ppiheader.diff
# patch -p1 <videosnarf-wifi-ppiheader.diff
patching file src/videosnarf.c
patching file src/videosnarf.h
# ./configure && make && make install

Once videosnarf includes the ability to read from wireless packet captures with the PPI header, we can run it against the packet capture again:

# videosnarf -i 4g-inbound-888FACETIME-session-1.pcap
Starting videosnarf 0.63
[+]Starting to snarf the media packets
[+] Please wait while decoding pcap file...
[-] Invalid IP header length: 16 bytes
Protocol: Unsupported
[-] Invalid IP header length: 16 bytes
[-] Invalid IP header length: 16 bytes
[+]Stream saved to file H264-media-1.264
[+]Stream saved to file H264-media-2.264
[+]Stream saved to file H264-media-3.264
[+]Stream saved to file H264-media-4.264
[+]Number of streams found are 4
[+]Snarfing Completed
# ls -l H264-media-*
-rw-r--r-- 1 root root  413160 Jul  5 18:24 H264-media-1.264
-rw-r--r-- 1 root root  272459 Jul  5 18:24 H264-media-2.264
-rw-r--r-- 1 root root 3765017 Jul  5 18:24 H264-media-3.264
-rw-r--r-- 1 root root 1761492 Jul  5 18:24 H264-media-4.264

Videosnarf was able to extract four H.264 data streams, saving them to files.  We can quickly evaluate the contents of the files to determine if the content itself is encrypted using the "ent" tool:

# ent H264-media-3.264
Entropy = 4.509034 bits per byte.

Optimum compression would reduce the size
of this 3765017 byte file by 43 percent.

Chi square distribution for 3765017 samples is 298830527.55, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 55.8586 (127.5 = random).
Monte Carlo value for Pi is 3.626079279 (error 15.42 percent).
Serial correlation coefficient is 0.622531 (totally uncorrelated = 0.0).

Ent applies several tests to evaluate the entropy and randomness of a given file.  In this example, entropy is fairly low at 4.5 bits per byte.  Compare this to a data stream collected from the Linux /dev/urandom device:

# dd if=/dev/urandom of=rand bs=4096 count=1000
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 1.64117 s, 2.5 MB/s
# ent rand
Entropy = 7.999961 bits per byte.

Optimum compression would reduce the size
of this 4096000 byte file by 0 percent.

Chi square distribution for 4096000 samples is 224.01, and randomly
would exceed this value 90.00 percent of the times.

Arithmetic mean value of data bytes is 127.5652 (127.5 = random).
Monte Carlo value for Pi is 3.142760882 (error 0.04 percent).
Serial correlation coefficient is 0.000133 (totally uncorrelated = 0.0).

Or an encrypted file of all 0's:

# dd if=/dev/zero of=zero bs=4096 count=1000
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 0.0211891 s, 193 MB/s
# openssl enc -aes-128-cfb -in zero -out zero.enc
enter aes-128-cfb encryption password:
Verifying - enter aes-128-cfb encryption password:
# ent zero.enc
Entropy = 7.999947 bits per byte.

Optimum compression would reduce the size
of this 4096016 byte file by 0 percent.

Chi square distribution for 4096016 samples is 300.32, and randomly
would exceed this value 5.00 percent of the times.

Arithmetic mean value of data bytes is 127.4714 (127.5 = random).
Monte Carlo value for Pi is 3.137092793 (error 0.14 percent).
Serial correlation coefficient is 0.000067 (totally uncorrelated = 0.0).

Clearly the output from the Facetime video stream as extracted by videosnarf is not encrypted. Sadly, it does not appear that the extracted data is viable to play with mplayer:

# mplayer H264-media-3.264 -fps 17
MPlayer 1.0rc2-4.3.2 (C) 2000-2007 MPlayer Team
CPU: Intel(R) Core(TM)2 Duo CPU     L7100  @ 1.20GHz (Family: 6, Model: 15, Step
ping: 11)
CPUflags:  MMX: 1 MMX2: 1 3DNow: 0 3DNow2: 0 SSE: 1 SSE2: 1
Compiled with runtime CPU detection.
mplayer: could not connect to socket
mplayer: No such file or directory
Failed to open LIRC support. You will not be able to use your remote control.

Playing H264-media-3.264.
H264-ES file format detected.
xscreensaver_disable: Could not find XScreenSaver window.
Opening video decoder: [ffmpeg] FFmpeg's libavcodec codec family
Selected video codec: [ffh264] vfm: ffmpeg (FFmpeg H.264)
Audio: no sound
FPS forced to be 17.000  (ftime: 0.059).
Starting playback...
[h264 @ 0x896a290]illegal POC type 5
[h264 @ 0x896a290]sps_id out of range
[h264 @ 0x896a290]sps_id out of range
[h264 @ 0x896a290]decode_slice_header error
[h264 @ 0x896a290]concealing 12 DC, 12 AC, 12 MV errors

MPlayer interrupted by signal 11 in module: decode_video
- MPlayer crashed by bad usage of CPU/FPU/RAM.
  Recompile MPlayer with --enable-debug and make a 'gdb' backtrace and
  disassembly. Details in DOCS/HTML/en/bugreports_what.html#bugreports_crash.
- MPlayer crashed. This shouldn't happen.

It appears the reconstructed file is close to a H264 file, but has some errors preventing it from being played back.  This is still positive from an attack perspective though, since we know the content is not encrypted; hopefully the videosnarf developers will release an updated version soon that can address any problems with reconstructing and saving the H.264 stream.


Let's summarize what we learned today:

  • While Facetime uses SIP, it does not use the standard authentication mechanisms;
  • Phone number information is disclosed in the SIP exchange, even if it is blocked on the phone itself;
  • Facetime uses the SIP MESSAGE verb for passing arbitrary data between iPhone devices involved in a Facetime call.  This could be a proprietary authentication mechanism;
  • Videosnarf with a minor patch can extract video and audio stream data.
  • The video and audio content of a Facetime conversation are NOT encrypted, leaving them susceptible to eavesdropping attacks if the underlying WLAN infrastructure is weak or otherwise compromised;
  • Mplayer is unable to play back this stream data today; hopefully fixes can be applied by the videosnarf team to resolve this in the future.

Next time we'll spend some time looking at the initial TCP exchange between the iPhone 4g and the authorization process that initiates the connection.  Comments and questions are welcome, thanks!



  1. Fascinating. Mind you, I don't think anyone really expects phone calls to be safe from eves dropping anyways, regardless of the medium (being cellular or copper wire).

    You may get better luck with vlc. I tend to find VLC does a better job of bad stream handling then mplayer does. You could also apply a 'skip' of a few seconds, perhaps the video is only bad at the beginning. But, it could just be all wrong in the headers...

    Being that the iPhone is so powerful, I'm not sure why it doesn't encrypt the video. But then your video calls would really be safer then your GSM (non-3g calls) thanks to recent attacks on the GSM encryption.

    I'm glad someone is doing this analysis though. Its important that people (read, the general public) understand the implications of using this technology.

  2. Awesome Josh .. Glad to see you joined the team !

  3. Sadly, vlc does not play the H.264 file either. I believe it could be an issue with how videosnarf is extracting the data; I'm sure Jason and Arjun will have some great things to say about it shortly.

  4. Note that the distribution of symbols in an H.264 coded bitstream would be expected to be almost even; otherwise, the underlying compression algorithm would be leaving too much redundant information.

  5. I tested a bit regarding sending sip
    invites directly towards an iPhone 4.
    You can find my results at:

    Searching for FaceTime pcap traces or users
    to do test calls.

  6. The reason why you can't get the RTP packets to play is because they're encrypted. The Facetime announcement that Apple made said that it's SRTP. Your tests for encryption are not only silly, but the answers you got are what you'd expect from an encrypted SRTP payload.

  7. I may be wrong, but the entropy looks weird to me. Not only is the stream not encrypted, but it also looks like it's not very well compressed. I mean, take any h264+aac stream capture and give it to the best compression algorithm and I doubt you will achieve a 43% size reduction. Doesn't that just mean that captured packets are "stuffed", packetized or interleaved with predictable data ?

  8. Skemost likely stands for symmetric key exchange. The clients are basically passing keys they probably got from the apple server earlier. The two keys authenticated the clients to each other. Also, in the earlier article you mention port 5223. I think that's simply xmpp with ssl enabled.

  9. Josh,

    Did you ever find out if the reason you couldn't play the file was that it was encrypted? Apple seems adamant that it is even though it isn't one of their "features" of Facetime.


  10. Unknown TCP protocol starts the conversation (TCP/5223); = Apple Push Notification Service

    Unknown UDP traffic between the iPhone and two hosts with similar IP addresses (UDP/16385 and UDP/16386); = RTP, RTCP

  11. thanx for info, but i'd better recommend to use this nice FaceTime calls recorder http://www.imcapture.com/IMCapture-for-FaceTime/, i do enjoy it!)

  12. Is there any common information exchanged between STUN packets in a face time call and initial HTTP session when a user signs in to face time account with an apple ID so as to link the face time call to logged in user ID