Some Frequently Asked Questions about RTP

Is RTP a transport protocol or a kind of application protocol?
RTP has important properties of a transport protocol: it runs on end systems, it provides demultiplexing. It differs from transport protocols like TCP in that it (currently) does not offer any form of reliability or a protocol-defined flow/congestion control. However, it provides the necessary hooks for adding reliability, where appropriate, and flow/congestion control. Some like to refer to this property as application-level framing (see D. Clark and D. Tennenhouse, "Architectural considerations for a new generation of protocols", SIGCOMM'90, Philadelphia). RTP so far has been mostly implemented within applications, but that has no bearing on its role. TCP is still a transport protocol even if it is implemented as part of an application rather than the operating system kernel.

RTP does not ensure real-time delivery. So how come it is called a real-time protocol?
No end-to-end protocol, including RTP, can ensure in-time delivery. This always requires the support of lower layers that actually have control over resources in switches and routers. RTP provides functionality suited for carrying real-time content, e.g., a timestamp and control mechanisms for synchronizing different streams with timing properties.

Is RTP an unreliable protocol? Are there any mechanisms provided for error recovery in RTP?
As currently defined, RTP does not define any mechanisms for recovering for packet loss. Such mechanisms are likely to be highly dependent on the packet content. For example, for audio, it has been suggested to add low-bit-rate redundancy, offset in time. For other applications, retransmission of lost packets may be appropriate. (The H.261 RTP payload definition offers such a mechanism.) This requires no additions to RTP. RTP probably has the necessary header information (like sequence numbers) for some forms of error recovery by retransmission.

Can RTP run over IPv6? ATM?
Yes. RTP contains no specific assumptions about the capabilities of the lower layers, except that they provide framing. It contains no network-layer addresses, so that RTP is not affected by addressing changes. Any additional lower-layer capabilities such as security or quality-of-service guarantees can obviously be used by applications employing RTP. There are several implementations of video tools that run RTP directly over AAL5 (T. Braun) and recent efforts to define the carriage of RTP over AAL2 and AAL5. It should be noted that the RTCP CNAME field is currently based on the assumption that hosts have Internet-style domain names.

Can RTP be used in asymmetric networks?
In asymmetric networks, the bandwidth in one direction, typically from the user to the Internet, is significantly lower than in the other. These networks include ADSL, cable modems and satellite distribution. RTP can be used readily, but it may be necessary to have only data senders send RTCP messages. These RTCP messages are useful to allow inter-media synchronization and identify the content of the media stream.

Why doesn't RTP have a length field?
RTP does not contain a length field, that is, it assumes that framing is performed by the underlying protocol and that only one RTP packet is to be carried in one PDU of the underlying protocol. This is the typical application with UDP (or AAL5) as the underlying protocol. Since most applications currently envisioned do not need framing, it would be a waste of processing and bandwidth to add one. This is covered in detail in the section RTP over Network and Transport Protocols of the spec.
If RTP is used with a protocol that is not message-based (e.g., TCP) or if it is desirable to carry several RTP packets in one lower-layer PDU (e.g., for aggregation of streams), it is trivial to define a profile that prefixes the RTP header by a 16 or 32-bit length field, depending on the desired tradeoff between overhead and maintaining word alignment.

Does RTP have a fixed packetization interval?
Some implementations assume that packet audio is sent with a particular packetization interval, e.g., 20 ms. This is wrong. While RFC 1890 recommends certain values and SDP allows to express a preference, implementations need to be able to handle all reasonable values. There is no constraint that G.711 or other sample-based formats is conveyed in multiples of a certain unit. Thus, an RTP packet with 123 samples of G.711 is perfectly legitimate and needs to be handled appropriately.

Are all these fields really needed?
Periodically, it is suggested to create an "RTP lite" version with a header shorter than 12 bytes. It is argued that, in particular, packet voice does not require all the RTP header fields and is particularly sensitive to packet header overhead due to the short payloads.

In general, the best compression is accomplished using RTP header compression, as it can compress the IP/UDP/RTP headers from 40 to one or two bytes. However, it only works for short-delay unicast connections on a single link.

For wide-area links that see a lot of voice trafic, e.g., for PBX interconnect, RTP muxing is far more efficient, since it avoids the overhead of IP and UDP packet headers, as well as featuring shorter RTP headers. Using RTP muxing, the overhead can be reduced to about two bytes per "channel", with one UDP/IP header for up to several dozen channels.

A minimal version of RTP would likely contain a sequence number (SN) and a payload type (PT), with a minimum combined size of two bytes. Unfortunately, such a choice would have a number of disadvantages:

How does padding work?
Since the underlying transport unit defines the end of the packet, the application can always locate the last byte of the (say, UDP) packet and look there for the number of padding bytes.

Practically speaking, how is the timestamp computed?
For audio, the timestamp is incremented by the packetization interval times the sampling rate. For example, for audio packets containing 20 ms of audio sampled at 8,000 Hz, the timestamp for each block of audio increases by 160, even if the block is not sent due to silence suppression. Also, note that the actual sampling rate will differ slightly from this nominal rate, but the sender typically has no reliable way to measure this divergence.
For video, time clock rate is fixed at 90 kHz. The timestamps generated depend on whether the application can determine the frame number or not. If it can or it can be sure that it is transmitting every frame with a fixed frame rate, the timestamp is governed by the nominal frame rate. Thus, for a 30 f/s video, timestamps would increase by 3,000 for each frame, for a 25 f/s video by 3,600 for each frame. If a frame is transmitted as several RTP packets, these packets would all bear the same timestamp. If the frame number cannot be determined or if frames are sampled aperiodically, as is typically the case for software codecs, the timestamp has to be computed from the system clock (e.g., gettimeofday()).

In a multimedia conference, are the initial timestamp values related?
No, initial time stamp values are picked randomly and independently for each RTP stream. (This is more or less unavoidable if different media types are generated by independent applications, whether these applications reside on the same host or not.) Synchronization (such as lip sync) between different media is performed by receivers through the NTP timestamps in the RTCP sender reports. This timestamp provides a common time reference that associates a media-specific RTP timestamp with the common "wallclock" time shared across media. The mechanism how end systems synchronize different media is not prescribed by RTP, however, a workable approach is to periodically exchange messages between applications to indicate what delay each application would impose on the stream (including any media decoding delays) if it were not to synchronize and then have all applications choose the maximum of these delays.

What are the roles of the RTP timestamp and sequence numbers?
The timestamp is used to place the incoming audio and video packets in the correct timing order (playout delay compensation). The sequence number is mainly used to detect losses. Sequence numbers increase by one for each RTP packet transmitted, timestamps increase by the time "covered" by a packet. For video formats where a video frame is split across several RTP packets, several packets may have the same timestamp. In some cases such as carrying DTMF (touch tone) data (RFC 2833), RTP timestamps may not be monotonic.

What are the different clocks and how are they synchronized?
RFC 3550 specifies one media-timestamp in the RTP data header and a mapping between such timestamp and a globally synchronized clock, carried as RTCP timestamp mappings.

The NTP timestamps in the SR are assumed to be synchronized between all media senders within a single session. If the media sources come from the same network source, this is obviously not an issue. Receiver(s) synchronize to the sender, the only solution feasible for multicast.

Experience has shown that all other cross-media, cross-host schemes end up doing clock synchronization, usually inferior to NTP and application-specific.

What's the marker bit good for?
For voice packets, the marker bits indicates the beginning of a talkspurt. Beginning of talkspurts are good opportunities to adjust the playout delay at the receiver to compensate for differences between the sender and receiver clock rates as well as changes in the network delay jitter. Packets during a talkspurt need to played out continuously, while listeners generally are not sensitive to slight variations in the durations of a pause.

The marker bit is a hint; the beginning of a talkspurt can also be computed by comparing the difference in timestamps and sequence numbers between two packets, assuming the timestamp clock rate is known.

Packets may arrive out of order, so that the packet with the marker bit is received after the second packet in the talkspurt. As long as the playout delay is longer than this reordering, the receiver can still perform delay adaptation. If not, it simply has to wait for the next talkspurt.

What is the sender packet count and byte count used for?
They are not needed for loss computation; the sequence number fields are used for that to avoid round-off errors. They may be used to compute the sender packet and byte rate.

What is the RTP timestamp in the RTCP sender report used for?
The RTP timestamp and NTP timestamps form a pair that identify the absolute time of a particular sample in the stream. For example, if the RTCP sender report contains an RTP timestamp of 1234 and an NTP timestamp indicating February 3, 10:14:15, it means that sample 1234 in the media stream occured exactly on February 3, 10:14:15.

How is the jitter computed?
If several packets, say, within a video frame, bear the same timestamp, it is advisable to only use the first packet in a frame to compute the jitter. (This issue may be addressed in a future version of the specification.)

Jitter is computed in timestamp units. For example, for an audio stream sampled at 8,000 Hz, the arrival time measured with the local clock is converted by multiplying the seconds by 8,000.

Steve Casner wrote:

For encodings such as MPEG that transmit data in a different order than it was sampled, this adds noise into the jitter calculation. I have heard handwavy arguments that this factor can be calculated out given that you know the shape of the noise, but my math isn't strong enough for that.

In many of the cases that we care about, the jitter introduced by MPEG will be small enough that when the network jitter is of the same order we don't have a problem anyway.

There is another problem for video in that all of the packets of a frame have the same timestamp because the whole frame is sampled at once. However, the dispersion in time of those packets really is all part of the network transfer process that the receiver must accommodate with its buffer.

It has been suggested that jitter be calculated only on the first packet of a video frame, or only on "I" frames for MPEG. However, that may color the results also because those packets may see transit delays different than the following packets see.

The main point to remember is that the primary function of the RTP timestamp is to represent the inherent notion of real time associated with the media. It also turns out to be useful for the jitter measure, but that is a secondary function.

The jitter value is not expected to be useful as an absolute value. It is more useful as a means of comparing the reception quality at two receiver or comparing the reception quality 5 minutes ago to now.

What is the session bandwidth?
First, it is most certainly not the link bandwidth. This would not scale, as then a large number of sessions could saturate the link with RTCP traffic, even if each used just 5% of the link bandwidth for RTCP. Secondly, the concept of link bandwidth is ill-defined in a heterogeneous network.

The session bandwidth is the nominal data bandwidth plus the IP, UDP and RTP headers (40 bytes). For example, for 64 kb/s PCM audio packetized in 20 ms increments, the session bandwidth would be (160 + 40) / 0.02 bytes/second or 80 kb/s. If there are multiple senders, the sum of their individual bandwidths is used.

The session bandwidth is typically defined out-of-band, e.g., in a session announcement protocol, based on reasonable estimates of the number of concurrent senders and their average bandwidth. Distributed and consistent on-line estimation of the session bandwidth may be hard as the number of senders and their bandwidth changes. The absolute value is less important than that all participants agree on a common value. (After all, there is nothing special about choosing the RTCP bandwidth to be 5% of the session bandwidth, it just has to be agreed upon by all participants to avoid timing out members prematurely.)

What is the use of RTCP for two-party calls?
Since the cost of sending RTCP is minimal (about one packet every 5 seconds), it makes sense to send RTCP even for point-to-point connections:

How do I register an RTP payload type?
See the description, drawn from RFC 1890 (with some practical comments).

What is the current list of RTP payload types?
See the current version of the RTP profile or the list maintained by IANA at http://www.iana.org/assignments/rtp-parameters.

What are dynamic payload types?
Dynamic payload types are described in the RTP A/V Profile. Unlike static payload types, dynamic payload types are not assigned in the RTP A/V Profile or by IANA. They map an RTP payload type to an audio and video encoding for the duration of a session. Different members of a session could, but typically do not, use different mappings. Dynamic payload types use the range 96 to 127. They are assigned by means outside of the RTP profile or protocol specification, including

Note that a number of encodings are described in the RTP A/V profile which do not have a static (permanent) payload type. The RTP A/V Profile defines names for encodings which may be used by SDP or other mechanisms to specify the mapping. Encodings may also be identified by object identifiers or other names.

Since the space for payload types is limited, only very common encodings should be assigned static types. These are typically audio and video encodings "blessed" by international standardization bodies, such as the G. series of ITU-T audio encodings. The RTP A/V Profile defines a set of criteria for making static assignments.

If I'm using H.323 or other set-up protocol, can I ignore the RTP payload type (PT) field?
An application must never just play a packet without inspecting its payload type, even if a single payload type has been negotiated via H.245 or similar protocols. New mechanisms, including may conveniently use the PT to indicate special packets, which an end application can ignore, if desired, ensuring backward compatibility. But this assumption is violated if an application blindly plays back all packets regardless of PT.

Also, in multicast environments, it is unlikely that every sender will use the same payload type.

Should the RTP payload type (PT) field be used for multiplexing different streams?
It has been suggested that in some environments (such as RTP over AAL5) that lack lower-layer muxing abilities, the RTP payload type (PT) field be used to differentiate streams originating from different sources. This is a fundamentally bad idea and violates the letter and intent of the specification. It makes use of multiple PTs in a single stream difficult (see previous question). It is also unnecessary, as the SSRC was designed for distinguishing several sources.

Should the RTP SSRC be used for demultiplexing different streams for the same RTP session?
The RTP SSRC is meant to label streams from different sources, that is, each sender in a conference has its own SSRC. It has been suggested to have a single source, using the same RTP session (identified by source and destination addresses and ports), send different media, such as an audio and video stream, using different SSRCs. This is generally a bad idea for the following reasons: (contributed by Steve Casner)

Do receivers need their own SSRC identifiers?
Yes, all participants in an RTP session have SSRC values, since they are needed in receiver reports.

Why can't we just use TCP for audio and video?
For delivering audio and video for playback, TCP may be appropriate. Also, with sufficiently long buffering and adequate average throughput, near-real-time delivery using TCP can be successful, as practiced by the Netscape WWW browser. TCP may often run over highly lossy networks (e.g., the German X.25 network) with acceptable throughput, even though the uncompensated losses would make audio or video communication impossible.

However, for real-time delivery of audio and video, TCP and other reliable transport protocols such as XTP are inappropriate. The three main reasons are:

An additional small disadvantage is that the TCP and XTP headers are larger than a UDP header (40 bytes for TCP and XTP 3.6, 32 bytes for XTP 4.0, compared to 8 bytes). Also, these reliable transport protocols do not contain the necessary timestamp and encoding information needed by the receiving application, so that they cannot replace RTP. (They would not need the sequence number as these protocols assure that no losses or reordering takes place.)

While LANs often have sufficient bandwidth and low enough losses not to trigger these problems, TCP does not offer any advantages in that scenario either, except for the recovery from rare packet losses. Even in a LAN with no losses, the TCP slow start mechanism would limit the initial rate of the source for the first few round-trip times.

Can't we just use XTP?
Many of the arguments parallel those in the previous section. The question of the relationship of RTP and XTP appears to arise frequently. (This may simply be due to the word 'transport' in both protocol names.) However, XTP and RTP are not replacements for each other. XTP is designed as a general, configurable network and transport protocol for both reliable and unreliable data communications. RTP has no reliability mechanisms (although these could be added if desired for specific applications) and no flow control like the rate control in XTP. RTP is not intended for regular, reliable data transfer (where TCP or XTP might be used instead). For real-time data, where retransmission is usually not possible due to timing constraints, XTP would have to disable retransmission. Flow/congestion control for real-time data is most likely inappropriate as the rate of such sources is inherently given and not modifiable on the time-scale of transport-protocol flow control, as explained in the previous section. It should be noted that RTP supports mechanisms that allow a form of congestion control on longer time scales, e.g., by modifying the source encoder if network congestion is detected.

RTP has no protocol state by itself and can thus be used over either connection-less networks, such as IP/UDP, or connection-oriented networks, such as XTP, ST-II or ATM (AAL3/4 or AAL5). Many real-time multimedia applications use multicast with a large fan-out, e.g., several hundred to thousands for a lecture or concert. Connection-oriented protocols like XTP have difficulty scaling to such a large number of receivers.

XTP does not offer timing or content type (media) information, and thus would need these services, as offered by RTP. XTP provides no RTP-like direct feedback of the received quality-of-service, and thus, again, would have to "import" these from another protocol. Looking at existing applications using XTP for real-time services confirms that they need to add a layer similar in content to the RTP data part "between" XTP and the actual media.

How should RTP sessions be played back?
Since RTCP packets contain absolute time information, a recorded session cannot simply be played back by time-shifting the whole recorded session. One approach plays back the data packets with their original time stamps, with re-normalized timing. SDES information other than NOTE items can be gathered for each source and regenerated as in a "live" session. NOTE SDES items need to be inserted at the appropriate instant in the playback as they are allowed to change.

Is there an RTP library or kernel implementation?
RTP (in particular, the data part) is tightly coupled to the application, so that a kernel implementation makes little sense. A number of people have developed libraries that implement RTP and RTCP (see listing). The sources to NeVoT, rtpdump, vat, rat and vic also contain RTP and RTCP processing modules which should be usable in other applications with minor modifications. Note also that the specification itself contains numerous code fragments. (Most of the other applications are using older versions of RTP and thus should not be relied upon for developments.)

The Java Media Framework (JMF), a Java API, also supports RTP and RTCP.

There is no standard API for RTP.

What are some of the differences between the VAT protocol and RTP?
The VAT protocol was originally implemented in the VAT audio tool and subsequently also in other audio tools such as NeVoT. The VAT protocol is now obsolete and should not be used or implemented.

The VAT header format is only described in header files. (See the VAT and NeVoT sources for details.) Many aspects of RTP and the VAT protocol are similar, but RTP improves upon the VAT protocol in a number of ways:

What are the differences between RTP version 1 and 2?
Version 1 is of historical interest only. Applications should not be written for it. RTP version 2 is not backwards compatible with version 1. If you care, you can find a definition of version 1 in an old Internet draft.

Are there specific ports assigned to RTP?
No, as explained in the section Port Assignment of the RTP profile:

As specified in the RTP protocol definition, RTP data is to be carried on an even UDP port number and the corresponding RTCP packets are to be carried on the next higher (odd) port number.

Applications operating under this profile may use any such UDP port pair. For example, the port pair may be allocated randomly by a session management program. A single fixed port number pair cannot be required because multiple applications using this profile are likely to run on the same host, and there are some operating systems that do not allow multiple processes to use the same UDP port with different multicast addresses.

However, port numbers 5004 and 5005 have been registered for use with this profile for those applications that choose to use them as the default pair. Applications that operate under multiple profiles may use this port pair as an indication to select this profile if they are not subject to the constraint of the previous paragraph. Applications need not have a default and may require that the port pair be explicitly specified. The particular port numbers were chosen to lie in the range above 5000 to accommodate port number allocation practice within the Unix operating system, where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automatically assigned by the operating system.

Also, the multicast (version 3.5 and later) kernel sources use the following port ranges:
from to application priority
016383unclassifiedlowest
1638432767audio highest
3276849151whiteboard medium
4915265535video low

Note: The port ranges in question do not make any difference unless the traffic traverses an interface or tunnel where the multicast traffic rate exceeds the configured mrouted rate-limiter.

If RTP is used within the H.323 framework, port assignment is done by the H.225.0 signaling messages. In SDP and SIP, the conference controller or inviting party picks the port numbers.

How are ports assigned for bidirectional unicast RTP sessions?
Each side in a bidirectional RTP session assigns their source ports independently, i.e., there is no assumption that if Alice sends audio to Bob on port 5000 (and RTCP on 5001), Alice also has to receive audio on port 5000. (Imposing such a restriction on ports would make it difficult for a host to participate in several independent RTP sessions using different tools.) Each side in a unicast session simply indicates to the other side where it wants to receive RTP packets, e.g., using SDP.

Note that the SSRC values used for each source are always different.

What about firewalls?

Ports used:
H.323 TCP1720
H.235 TCPephemeral, > 1024

What is the quality of audio codec X?
See separate summary with audio samples.

Are all audio codecs patented?
Most older, higher-bitrate codecs are not subject to patent protection. However, G.723, G.729.1 and GSM are covered by various patents. For example, U.S. patent 4,752,956, Digital speech coder with baseband residual coding modifies coding using short term fine structure speech data produced by analysers within encoder-multiplexers applies to GSM and is assigned to Philips.

Is there a way for a router to tell apart RTP packets?
No, if the router doesn't have access to the protocol that established a session, there is no way by looking at a single packet to tell that it is an RTP packet. However, if the router maintains state, it can inspect the sequence number and, with probability, determine that a particular UDP port pair carries RTP if the sequence number increases by one (or a small number) for each packet. In addition, the first two bits of every packet will be the same, namely the RTP version identifier. It is also likely that the payload type will stay constant from packet to packet.

Are there related ITU efforts?
Audio and video media formats and encodings:
Name Type Algorithm Sampling frequency (kHz) Bit rate
MPEG L3 audio 22.05, 44.1 48..128
G.711 audio mu-law, A-law 8.0 64 kb/s
G.721 subsumed by G.726 audio ADPCM 8.0 32 kb/s
G.722 audio 16.0 (7 kHz spectrum) 64 kb/s
G.723 recommendation no longer in force! audio 8.0 24 kb/s
G.723.1 audio ACELP and MQ-CLP 8.0 5.3 and 6.3 kb/s
G.726 audio ADPCM 8.0 16, 24, 32, 40 kb/s
G.728 audio low-delay CELP 8.0 16 kb/s
G.729 audio CS-ACELP 8.0 8 kb/s
H.261 video DCT
H.263 video DCT (improved version of H.261)

H.324:
Audio and video over POTS at less than 20 kb/s.

For conferencing over ISDN:

H.221:
Frame structure for a 64 to 1920 kbit/s channel in audiovisual teleservices.

H.320:
Framework for transmitting audio and video over circuit-switched digital networks (primarily ISDN).

H.323:
H.320 over LAN.
For conference control, application and data sharing, there are a number of recommendations:

T.120:
Introduction to the audiographics and audiovisual conferencing recommendations.

T.121:
Generic application template.

T.122:
Multipoint communication service for audiographics and audiovisual conferencing service definition

T.123:
Protocol stack for audiographics and audiovisual teleconference applications.

T.124:
Generic conference control.

T.125:
Multipoint communication service protocol specification.

T.126:
Still image protocol specification.

T.127:
Multipoint binary file transfer protocol.

mbus
A protocol for coordinating multimedia applications.

The comp.speech FAQ contains many additional references, including a good summary. of how different algorithms work.

Are there other efforts in using the Internet for real-time audio and video?

Too many, some may say. vat versions 3.4 and earlier, one of the early (recent) Internet audio applications, uses mostly the same audio encodings as specified in the RTP profile, but a different protocol. There are also a number of Internet telephony applications that usually only operate on PCs and in unicast mode. There are initial efforts to interconnect the public switched telephone network and the Internet.

CuSeeMe (for Windows PC and the Macintosh) is a combined audio and video tool using reflectors rather than IP-level multicast.

The Internet Telephony Consortium maintains a listing of standards and related efforts.


Last updated by Henning Schulzrinne