US20110234200A1

US20110234200A1 - Adaptive slip double buffer

Info

Publication number: US20110234200A1
Application number: US13/065,583
Authority: US
Inventors: Kishan Shenoi
Original assignee: Floreat Inc
Current assignee: Floreat Inc
Priority date: 2010-03-24
Filing date: 2011-03-24
Publication date: 2011-09-29
Also published as: US8379151B2; US20110235500A1; US20110234902A1

Abstract

A method includes monitoring a fill in an adaptive slip buffer of a digital to analog convertor; adjusting a number of samples that are read from the adaptive slip buffer per page as a function of the fill; and reading the number of samples from the adaptive slip buffer. An apparatus includes a digital to analog convertor including an adaptive slip buffer and a read address generator coupled to the adaptive slip buffer, wherein the read address generator includes an increment control that adjusts a number of samples that are read from the adaptive slip buffer per page as a function of fill of the adaptive slip buffer.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims a benefit of priority under 35 U.S.C. 119(e) from copending provisional patent applications U.S. Ser. No. 61/340,923, filed Mar. 24, 2010, U.S. Ser. No. 61/340,922, filed Mar. 24, 2010 and U.S. Ser. No. 61/340,906, filed Mar. 24, 2010, the entire contents of all of which are hereby expressly incorporated herein by reference for all purposes.

BACKGROUND INFORMATION

1. Field of the Invention
Embodiments of the invention relate generally to the field of digital networking communications. More particularly, an embodiment of the invention relates to methods and systems for packet (and/or frame) switched networking that include an adaptive slip double buffer.
2. Discussion of the Related Art
With the advent of Internet Protocol (“IP”), packet-based transmission and routing schemes are becoming ever more popular. It is well accepted that Next Generation Networks (“NGN”s) will be built upon these principles. However, several services, such as real-time voice and voice-band communication, that are well suited for circuit-switched (“TDM”) transmission and switching, have to be supported by this new architecture. VoIP (“voice over IP”) is one such example. The underlying premise of VoIP is that speech, after conversion from analog to digital format, can be packetized and several protocols such as RTP and RTCP (see Ref. [1,2]) have been developed to support the ability of IP networks to provide such real-time services.
One of the premises of NGNs is that the Quality of Experience (QoE) should be at least as good as good, or even better than, that provided by the legacy circuit-switched network or PSTN (Public Switched Telephone Network). It is clear that delay is an important parameter in determining the QoE. It is well known that one-way delays that are very large (of the order of 400 ms or larger) are extremely detrimental from the view of subjective quality, making regular full-duplex conversation difficult. At lower one-way delays, the impact of echo is important. The Quality of Experience, for a given level of Echo Return Loss (ERL) drops rapidly with increasing delay.
The overall delay has four principal components. The process of packetization involves buffering information to fill the packet payload and thus introduces delay. The encoding and decoding algorithms, especially in the case of source codecs, require buffering as well. These two delays are often known quantities. The third component is the delay through the network. This delay is difficult to predict a priori since it depends on the physical distance, the number of intermediate packet switches involved in the end-to-end transport of a packet, the bandwidth of the links between switches (routers). However, for two given end-points there is, in principle, a minimal network delay corresponding to the transit time of the fastest possible packet transmission. Considering that in a pure IP network the transmission path could be different for different packets, and the queuing delay in intermediate nodes is a function of congestion, the delay experienced by packets will be variable, ranging from the minimal delay to infinity (a packet lost in the network is construed as an instance of infinite delay). Some maximum delay threshold must be determined and packets with delay greater than this maximum are discarded. Received packets are stored in a buffer whose size corresponds to the difference between minimum and maximum delays and so, practically speaking, fast packets are delayed so that the packets can be decoded and converted back to analog signals in a smooth fashion. The notion of play-out, or dejittering, whereby some delay is introduced via a jitter buffer constitutes the fourth delay component. Clearly, in order to maximize the subjective quality of the call, the play-out buffer, also referred to as the jitter buffer, should be as small as possible.

SUMMARY OF THE INVENTION

There is a need for the following embodiments of the invention. Of course, the invention is not limited to these embodiments.
According to an embodiment of the invention, a process comprises: monitoring a fill in an adaptive slip buffer of a digital to analog convertor; adjusting a number of samples that are read from the adaptive slip buffer per page as a function of the fill; and reading the number of samples from the adaptive slip buffer. According to another embodiment of the invention, a machine comprises: a digital to analog convertor including an adaptive slip buffer and a read address generator coupled to the adaptive slip buffer, wherein the read address generator includes an increment control that adjusts a number of samples that are read from the adaptive slip buffer per page as a function of fill of the adaptive slip buffer.
These, and other, embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating various embodiments of the invention and numerous specific details thereof, is given for the purpose of illustration and does not imply limitation. Many substitutions, modifications, additions and/or rearrangements may be made within the scope of an embodiment of the invention without departing from the spirit thereof, and embodiments of the invention include all such substitutions, modifications, additions and/or rearrangements.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings accompanying and forming part of this specification are included to depict certain embodiments of the invention. A clearer concept of embodiments of the invention, and of components combinable with embodiments of the invention, and operation of systems provided with embodiments of the invention, will be readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings (wherein identical reference numerals (if they occur in more than one view) designate the same elements). Embodiments of the invention may be better understood by reference to one or more of these drawings in combination with the following description presented herein. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.

FIG. 1 is a functional block view of a simplified depiction of a VoIP call (only one direction shown), appropriately labeled “PRIOR ART.”

FIG. 2 is a functional block view of a circular buffering action separating ADC and DAC clocks, appropriately labeled “PRIOR ART.”

FIG. 3 is a functional block view of a simplified model of VoIP over an IP network, appropriately labeled “PRIOR ART.”

FIG. 4 is a functional block view of transmission of voice-band signals over a packet network, appropriately labeled “PRIOR ART.”

FIG. 5 is a functional block view of depicting the functions involved in generating the received speech signal, representing an embodiment of the invention.

FIG. 6 is a functional block view of an underlying principle of a retiming FIFO buffer (play-out buffer), representing an embodiment of the invention.

FIG. 7 is a functional block view of a double buffer arrangement for delivering samples to the DAC, representing an embodiment of the invention.

FIG. 8 is a functional block view of a simplified circular buffer arrangement, representing an embodiment of the invention.

FIG. 9 is a functional block view in more detail of “Read Add. Gen.” (433 in FIG. 8), representing an embodiment of the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components and equipment are omitted so as not to unnecessarily obscure the embodiments of the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
Within this application several publications are referenced by Arabic numerals, or principal author's name followed by year of publication, within parentheses or brackets. Full citations for these, and other, publications may be found at the end of the specification immediately preceding the claims after the section heading References. The disclosures of all these publications in their entireties are hereby expressly incorporated by reference herein for the purpose of indicating the background of embodiments of the invention and illustrating the state of the art.
The invention described herein describes a novel approach to the play-out buffer, providing a method to maintain optimal performance even in situations where the analog-to-digital converter (ADC) and digital-to-analog converter (DAC) have different underlying time-bases. In particular, a method based on controlled slips, a technique that is well known as being efficient in TDM architectures for addressing clock offset, is presented. The invention is an extension of controlled slip behavior. In particular, the slip mechanism is invoked primarily when the speech segment represents a synthetic signal such as during periods of silence or if the characteristics of the speech segment are such that the repetition/deletion of a speech sample will have minimal subjective annoyance. It will be seen that an adaptive play-out buffer of the manner described here can form an integral part of an adaptive jitter buffer mechanism. Extensions of the invention include methods to implement adaptive clock control with minimal impact on subjective quality.

1. The Inherent Need for SYNCHRONIZATION

Strictly speaking, the term synchronization applies to alignment of time and the term syntonization applies to alignment of frequency, but in the telecommunication environment we often use the term synchronization to refer to either time-alignment, or frequency-alignment, or both. It is generally clear from the context which meaning is appropriate. All real-time communication carried over a digital network requires synchronization to some degree. This can be illustrated by considering the example of delivering a real-time voice signal between two geographically disparate points across a network.
The situation is depicted in FIG. 1. The analog source is converted into digital format by an analog-to-digital converter (ADC or A/D) operating at a sampling clock rate of nominally 8 kHz. Each sample is, conventionally, quantized to 8 bits so that the digital stream carrying the voice information is 8 kilo-octets-per-second or 64 kbps (see ITU-T Rec. G.711, Ref. [3], and Ref. [4]). This is regarded as a DS0 and represents “uncompressed” voice. In a conventional circuit-switched or TDM (Time Division Multiplexed) architecture, this DS0 is delivered “as is” to the destination for conversion back to analog format. In a packet-switched environment, exemplified by Voice-over-IP (VoIP), the DS0 is, possibly, compressed and organized into packets. These packets are delivered to the destination where the expansion to DS0 format is performed prior to conversion back to analog. Whereas the schemes described here are applicable regardless of the word-length employed for A/D conversion or D/A conversion, we shall henceforth assume here that these are done with a word-length of 8 bits (1 octet) (representative of □-law and A-law formats provided in ITU-T Recommendation G.711) for specificity.
It is important to recognize that at each end the digital-to-analog converter (DAC or D/A) and analog-to-digital converter (ADC or A/D) are usually in the same integrated circuit chip or on the same circuit board and thus the same clock is used for both functions at any one end. In the event that the (digital) signal processing includes echo cancellation, it is mandatory that the same clock be used for both functions else the echo canceller will exhibit sub-par performance and there will be instances of echo leakage and other phenomena that negatively impact the quality of experience. In FIG. 1 we show a single direction of transmission solely for convenience in representation and explanation.
The rate at which packets are generated (in the encoder) is determined by the A/D clock, shown as f_Ain FIG. 1. In most VoIP schemes, one packet is generated for every 80 samples from the A/D converter. That is, using the conventional sampling rate of 8 kHz (nominal), each packet represents 10 ms (ms=millisecond) of speech (there are variants that use block sizes other than 10 ms, such as 5 ms, 20 ms, 30 ms, etc. but the principles described here are applicable in all cases). The nominal word-length associated with each sample is 8 bits, following G.711 (see Ref. [3]) so the “uncompressed” signal represents a bit-rate of 64 kbps (or DS0). Compression algorithms are employed to reduce the effective bit-rate. For example, ADPCM (adaptive differential pulse code modulation) following ITU-T Recommendation G.726 (see Ref. [5]) reduces the word-length associated with each sample to 4, effectively reducing the data rate to 32 kbps. ITU-T Recommendation G.727 (see Ref. [5]) describes methods for reducing the bits/sample from 8 down to 5 or 4 or 3 or 3, corresponding to bit-rates of 40, 32, 24, and 16 kbps, respectively. More sophisticated schemes, such as those described in ITU-T Recommendation G.723 and G.729 (see Ref. [5]) are even more effective in reducing the bit-rate. The notion of a “10-ms-packet” is the collection of information produced by the coder that permits the decoder at the far end to synthesize a 10-ms block of speech. Depending on the coding algorithm it is possible that information from previous packets is necessary as well. At the receiving end the decoder recreates the appropriate digital signal (DS0) for conversion back into analog format. The D/A clock is shown as f_Din FIG. 1.
If the frequencies of the A/D clock (f_A) and the D/A clock (f_D) are not equal, then slips will occur. The notion of a slip is simple. If f_A>f_Dthen the DAC will experience a surfeit of samples; if f_A<f_Dthen the DAC will experience a shortage of samples. Rate-adaptation then requires that samples be deleted or inserted. In the circuit-switched architecture of the legacy PSTN, every transmission boundary element is required to extract DS0 s from an incoming digital signal (typically a DS1) and reinsert the information into an outgoing digital signal (typically a DS1) that may, potentially, have a different time-base. Therefore slip buffers are very common. To minimize the occurrence of slips, the circuit-switched network is well synchronized and this approach to network synchronization has the derivative benefit that the clock offset between the end points is minimized. In an NGN, where asynchronous transport is employed, there is no guarantee that the clock offset between the end points is negligible.
However, this phenomenon is not necessarily catastrophic, but the DAC would have to either insert or delete a sample to account for the difference in sampling rates. This insertion or deletion of a block of information, such as a sample, is referred to as a slip. Note that a slip is the result of the difference in sampling rates and is independent of the word length associated with the quantization and compression. The degradation of perceptual quality caused by slips is in addition to any degradation caused by other factors. In conventional circuit-switched telephony, the unit of information inserted or deleted is one sample (or octet). Considering the nominal sampling rate is 8 kHz (one sample every 125 □s), a slip occurs when the accumulated phase difference, expressed in time units, caused by the aforementioned frequency difference, crosses 125 □s. In a packetized scenario, the unit could be as large a block of speech, typically of duration 20 ms and thus slips would have an impact similar to packet loss. Note that 20-ms slips occur much less frequently than 125-□s slips but have a greater impact each time they occur. The thrust of the current invention is to get the benefits of single-octet (single-sample) slips in a packet environment. Furthermore, the thrust of the current invention is to get the benefits of a single-octet slip in low-cost implementations such as in customer-premises-equipment (CPE) integrated-access-devices (IADs) and residential gateways.
In the following table we provide the slip rate assuming that the D/A conversion clock uses a free-running oscillator and that the A/D clock is accurate (relative to a Primary Reference Source). Also provided is the typical technology used for that accuracy and a budgetary estimate (order of magnitude) of the cost of the oscillator. The last three columns provide an approximate time between slip occurrences for different block sizes. In generating this table it was assumed that the transmission link between the A/D and D/A is equivalent to a “null” link that adds no impairments such as excessive time-delay variation or transmission errors. The intent is to lay the baseline for the minimum impairment that is introduced by the lack of synchronization between the end-points.
With regard to Table I as shown below, the terminology used includes: XO: Crystal Oscillator, TCXO: Temperature-Compensated Crystal Oscillator and OCXO: Oven-Controlled Crystal Oscillator

TABLE 1

Relationship between frequency offset and interval
between buffer overflow/underflow events

Accuracy	Technology	Cost	125-□sec slip	10-ms slip

1 × 10⁻¹⁰	Rubidium	~$1000	1.25 × 10⁶	sec.	1 × 10⁸	sec.
			(14.5	days)	(3.2	years)
50 × 10⁻⁹	Hi-Quality	~$500	25 × 10³	sec.	2 × 10⁵	sec.
(50 ppb)	OCXO		(41.7	min)	(2.3	days)
5 × 10⁻⁶	OCXO	~$50	25	sec.	2 × 10³	sec.
(5 ppm)					(33.3	min)
50 × 10⁻⁶	TCXO	~$10	2.5	sec.	20	sec.
(50 ppm)
1 × 10⁻³	XO	~$1	0.125	sec.	1	sec.
(0.1%)			(8	per sec.)
1 × 10⁻²	XO	~$0.1	12.5	msec.	0.1	sec.
(1%)			(80	per sec.)

It should be noted that in carrier-grade equipment such as that used in large telecom service provider networks, the higher quality clock sources (oscillators) are appropriate. For customer-premise equipment, including cases where the application runs on a personal computer, the quality of the oscillator is likely to be of the XO or, at best, TCXO class.
The perceptual degradation in quality caused by slips is very subjective. The impact of an isolated slip in conventional telephony using uncompressed signals (G.711) is typically a “click” that could well be imperceptible, especially if it occurs during a silent interval. However, the perceived quality degrades rapidly as the slip-rate increases. The various digital switches in the PSTN are all provided a PRS (Primary Reference Source) traceable reference and thus have an absolute accuracy of better than 1×10⁻¹¹. A call traversing two distinct timing domains may experience slips corresponding to a worst-case frequency difference of 2×10⁻¹¹. Considering that this equates to one slip every 72 days, we can, for all practical purposes, ignore the phenomenon of slips in the traditional circuit-switched network. In VoIP applications, the end points are quite cost sensitive and therefore it is likely that the quality of oscillator deployed will be represented by one of the last three rows of Table 1 and clearly slips may play an important role in determining the quality of experience (or lack thereof).
Most studies for evaluating the perceptual quality of compressed voice are done in a controlled environment and consider only a single compression/expansion. Additional study is required to assess the impact of tandem connections wherein there may be multiple conversions of format. Furthermore, the impact of an isolated slip may have a different perceptual effect on synthetic speech, such as that inherent in CELP (Code Excited Linear Prediction) methods for compression, such as G.729 (see Ref. [5]). However, it is quite well accepted that the controlled slip method, where one sample (octet) is deleted/inserted in an “uncompressed” stream, works very well provided that slips do not manifest themselves too often.
If the size of the buffer is large, then the relative frequency of occurrence of buffer overflow/underflow events will be small. However, large buffers imply the introduction of delay and the decrease in quality of experience. Nevertheless, even with large buffers deployed to mitigate the occurrence of buffer overflow/underflow, there are other impairments that arise because of a difference in clock between the end-points. Note that if there is a long-term-average difference in the clock (frequency) at the two end-points then buffer overflow/underflow will occur—the size of the buffer will just determine the interval between these catastrophic events.
The analog signal from the source enters the network and is converted into a digital signal by the analog-to-digital converter (ADC). The network acts as a pipe for these digital words (samples) that are delivered to the far-end digital-to-analog converter (DAC) for conversion back to analog. The conversion points could be in equipment, such as a customer-premise located IAD or PBX or even a Class-5 switch operated by the local telephone company. It is important to recognize that the time-base governing the A/D clock could be different from the time-base governing the D/A clock and thus there could be a difference in the sampling rates associated with these two conversions. That is, in every digital network there is the potential of encountering the pitch modification effect. The frequency difference could be small, of the order of 2 parts in 10¹¹, if the conversion clocks are traceable to a Stratum-1 source (or sources); the frequency difference could be significant, of the order of 64 parts in 10⁶(64 parts per million or 64 ppm), if the only guarantee given is that the conversion clocks are Stratum-4 quality (Stratum-4 implies an accuracy of no worse than ±32 ppm). {The notions of clock strata and the frequency accuracy of different classes of clocks are available in Ref. [6,7].}
Clearly, if the conversion rates are different, then the DAC will experience a surfeit of samples if the ADC clock is higher than the DAC clock, or a dearth of samples if the situation is reversed. In fact, such a phenomenon could be manifested at multiple places in the network where there is a connection between two Network Elements with different clock references. Clock offsets of this type are accommodated by the use of slip-buffers. Whereas buffers are always required to compensate for accumulated jitter and wander, it is the effect of a frequency offset that is the primary focus here.
Again for simplicity, we shall assume that there is just one buffer, and that this buffer is associated with the DAC. This buffer will be of a FIFO (first-in-first-out) nature where the data is written into the buffer under control of the ADC clock and read out of the buffer under control of the DAC clock. Clearly, if there is a frequency offset between the two clocks, the buffer will, eventually, either overflow (ADC clock is higher) or underflow (DAC clock is higher). In practice the buffering method is called “double buffering” wherein there are two pages, say A and B, and while data is being written into page A, data is being read out of page B. If there is no frequency offset, then the opposite-page nature of read and write will, for the most part, be preserved. Such a buffer needs to be just big enough to accommodate any relative wander or jitter between the two clocks. It is convenient to describe the size of the buffer in terms of time. For example, if each page is “10 ms”, then each page has 80 octets, assuming a nominal sampling rate of 8 kHz and one octet per sample (e.g. G.711; see Ref. [3] or [4]). The overall buffer is then 20 ms deep, introduces a nominal delay of 10 ms and can accommodate ±10 ms of wander.
A good way of visualizing the double-buffer action is to consider a circular buffer as depicted in FIG. 2. The memory is organized in a circular manner with address calculations done Modulo-2N, where 2N is the total number of memory locations. From the viewpoint of the DS0 channel under consideration, each location holds one octet (corresponding to one octet per sample), the buffer has a “length” of (2N/8) ms, introduces a nominal delay of (N/8) ms, and can accommodate ±(N/8) ms of wander. The operation is quite simple. With each write operation the write pointer moves one location counter-clockwise and likewise the read pointer moves one location counter-clockwise with each read operation. If the relative time error between the read and write clocks is zero, then the pointers remain a fixed distance apart. A frequency offset will result in one pointer catching up to the other, resulting in an overflow/underflow. The reset position is when the pointers access diametrically opposite locations. When an overflow/underflow occurs, one pointer is forcibly moved to be diametrically opposite to the other. This action causes data corruption in the sense that N octets will be either lost or repeated. It should be emphasized that allowing large buffers to overflow/underflow results in losses of large amounts of data when such events occur and this could have a much more deleterious impact on end-user (human) quality of experience than losses of small amounts of data that may occur more frequently.
One special case is when the buffer is 250 □sec deep. This is the notion of a conventional slip buffer. Considering the sampling rate is 8 kHz (125 □sec period), a slip buffer has two octets and the overflow/underflow results in either the deletion of an octet or the repetition of an octet. This is called a controlled slip. A slip occurs when the relative time interval error between read and write clocks exceeds 125 □s. For example, if the relative frequency offset between the two clocks is 64 ppm, then a slip will occur approximately every 2 seconds.
In packet-switched networks the delay through the network is not steady as is the case of circuit-switched networks. Therefore, even if the rates of the ADC and DAC are equal, the write clock may, on a short-term basis, appear to be faster (or slower) than the read clock. This requires the use of a buffer that is called a jitter buffer because the term used in the industry for variable transit delay is “jitter”.
Now suppose that the buffer is 200 ms deep. The buffer will overflow (underflow) when the relative time interval error between the two clocks exceeds 100 ms. A 64 ppm offset will thus result in overflows (underflows) approximately every 3000 seconds. Considering that a telephone call rarely lasts 50 minutes, it is clear that overflows (underflows) that are a result of a clock offset may be ignored for all practical purposes. This is one of the (incorrect) reasons given by proponents of IP networks that frequency synchronization is not required because free-running clocks can support VoIP considering that buffer overflows and underflows can be made rare by increasing the size of the buffer.
It should be recognized that:

- if a frequency offset exists then there will be occurrences of buffer overflow/underflow.
- the relative rate of such events will be smaller for larger buffer sizes.
- the larger the buffer size the greater is the loss of data when such an event occurs.
- the larger the buffer size the larger the delay (important for human quality of experience).

The thrust of this invention is to use multiple buffers. One buffer is similar to a traditional jitter buffer. The incoming packets are written into the jitter buffer upon arrival. Note that this write operation is tied, effectively, to the ADC clock (of the far end) with additional jitter introduced by the packet delay variation in the network. The packets are extracted (read out) from the jitter buffer using the DSP block (explained later) that is nominally uniform. The rate of packet extraction by the DSP block is determined by the rate of the DAC clock. The second buffer is a double buffer whose size is altered occasionally to adjust the rate at which the jitter buffer data is extracted by the DSP block.

2. A Simplified Model for a Next Generation Network (VoIP Environment)

A network based on packet switching and transmission can be quite complex, but the simple model depicted in FIG. 3 is sufficient to illustrate how synchronization and adaptive play-out buffers play a role. We consider an IAD (Integrated Access Device) at the customer premises as the traffic aggregator. All the various services are provided from the IAD to which all the customer equipment is connected. To allow for attachment of legacy devices such as telephones and Fax machines, the IAD will provide an FXS port to which the Fax machine (telephone) is connected. To the Fax machine (telephone), the FXS port appears, for all intents and purposes, as the line circuit of a traditional Class-5 switch. The IAD contains the codec where the conversion between analog and digital is accomplished. The information, however, is not transported as a conventional DS0 would in a TDM (time division multiplexed) or circuit-switched scenario. The data is packetized and encapsulated in the appropriate “wrappers” for transmission over the packet network.
In terms of the important processes involved after call set-up, a simple, though accurate, view is depicted in FIG. 4. For convenience only one direction of transmission is shown. The analog signal from the source Fax machine or telephone (“srce”) is converted into digital format using an A/D converter. It is quite conventional to use a conventional telephony codec that uses a sampling rate of 8 kHz and encodes the sample value in an octet (G.711 coding) though there are implementations described in the literature where a higher sampling rate and a higher word-length are used for improved fidelity. These samples are assembled into packets. For speech applications there may be some signal processing involved for purposes of echo cancellation and data compression; for Fax calls the samples are generally used without modification. The packets are delivered to the destination by the packet network.
Speech implementations also allow for voice activity detection (VAD) whereby intervals of silence are detected and transmission bandwidth conserved by just transmitting an indication of silence rather than (encoded) speech sample information. At the receiving end intervals of silence are synthesized using comfort noise.
Whereas packet architectures are superior to circuit-switched architectures in terms of efficiency of bandwidth utilization (because of statistical multiplexing), they have some drawbacks, comparatively speaking. Packet architectures tend to increase latency (average delay) and introduce time delay variations. In order to accommodate time delay variations, jitter buffers are required. That is, buffers of an “elastic” nature are used to account for the burstiness of the packet arrival pattern. In order to avoid loss of data the depth of these buffers must be large enough to span the peak-to-peak time delay variation over the network. Put another way, the size (depth) of the jitter buffer determines the peak-to-peak time delay variation that is allowed for the network and a variation greater than this maximum value will result in packets being lost or used incorrectly.
If the jitter buffer is too small, time delay variation can be the primary cause of packet loss. For normal voice (speech) calls, packet loss concealment (“PLC”) algorithms are available to mitigate the impact of lost packets. However, it should be emphasized that the mitigation of the deleterious impact does not mean that the problem is eliminated. In Ref. [8] a general picture of the impact of packet loss on Quality of Experience is provided. One way to reduce packet loss is to increase the size of the jitter buffer. However, this approach, too, has its drawbacks since the increase in delay caused by increasing the depth of the jitter buffer has a negative impact on the Quality of Experience for voice calls for several reasons (see Ref. [8]). Consequently most prior art VoIP implementations utilize what is referred to as an adaptive jitter buffer, algorithms have been developed to make the jitter buffer size dynamic, the intent being to keep the buffer just large enough such that the loss of packets due to time delay variation is within an acceptable limit, which the ITU-T Recommendations specify as 0.05%. However, adaptive litter buffer operation in the prior art has a major problem because the proponents of VoIP and adaptive jitter buffers have ignored the effects of lack of clock synchronization.
With the jitter buffer set at its “optimum” size, and providing adequate traffic engineering is in place to provide the real-time services (such as VoIP) the appropriate priority, it is assumed that time delay variation will not cause packet loss except in situations of high traffic congestion. However, the frequency offset between source and destination has two deleterious effects. One is the pitch modification effect that has been described elsewhere (see Ref. [12], for example) and while important, is not the thrust of this invention. The other is a “buffer shrink” effect. If the DAC clock is faster than the ADC clock, the jitter buffer will empty faster than it is being filled. Suppose for example the buffer size is 200 ms. Then, whereas at the start of the call a 200 ms buffer will, theoretically, allow a ±100 ms time delay variation, the emptying of the buffer will affect the lower threshold. Similarly, if the ADC clock is faster than the DAC clock, the buffer will fill faster than it is being emptied and this will affect the upper threshold. For example, a frequency difference of 50 ppm will cause a threshold reduction (either the upper or the lower) of 50 □sec every second or 1 ms every 20 seconds. Therefore, whereas the probability of losing a packet due to time delay variation may have been small to nonexistent at the start of the call, the probability increases with the duration of the call and, for calls of long duration could become appreciable.
For voice calls there have been several methods described in the literature to handle such problems. The notion of an adaptive jitter buffer is to modify the size of the jitter buffer to match the existing time-delay variation condition being experienced. Silence-stretching and silence-compressing algorithms have been proposed to delete or expand sections (sub-intervals) of silence. Packet loss concealment algorithms have been developed to insert or delete sections of “non-silence” in such a manner as to reduce (subjectively) any annoying effects of packet loss. The interested reader is pointed to Ref [9,10] for further information on these methods.
In the context of this invention, silence-manipulation and packet loss concealment will be designated as extreme measures. Such measures are necessary because the general behaviour of IP networks is such that packets will be lost in the network for a variety of reasons, including excessive time-delay variation that could lead to jitter buffer overflow or underflow. In the context of this invention, the block labeled “Depacketization, Jitter Buffer, and Signal Processing” in FIG. 4 will be, logically, split into multiple entities:
a. Depacketization. The packets received from the IP network are processed and the information content required for synthesis of the speech signal extracted. As part of the depacketization process, the protocol wrappers are examined to detect whether a packet was lost in the network. If a packet is detected as “lost”, then the packet loss concealment algorithm must be invoked. The current invention does not relate in particular to depacketization algorithms and implementations and most methods prevalent in the state-of-the-art can be employed. Packets contain both time-stamps and sequence numbers (also called frame numbers) and between these two it is straightforward to decide whether there was a missing packet or whether the apperent missing packet was actually a “no_transmission” corresponding to a silence packet. Basically the block labeled “Extract Frames” in FIG. 4 extracts the (encoded) speech frames from the packets. Note that each IP packet may contain more than one speech frame. That is, each IP packet may contain the information for multiple (1 or more) blocks of speech. For example, if the block size used is 10 ms, the IP packet may contain 20 ms or two blocks worth of speech information in encoded form. For convenience, the unit of storage in the jitter buffer (see below) is the speech frame since this is the most convenient and useful unit of storage and can be either in the form of encoded speech or even decoded speech (see the notion of processing, below).
b. Jitter Buffer. The jitter buffer in prior art VoIP decoders comprised a first-in first-out (FIFO) buffer that was large enough to accommodate the time delay variation encountered by packets as they traverse the IP network from source (encoder/packetization) to the destination decoder. In one possible first implementation, the incoming packets are written in as they arrive and read out by the signal processing entity at the play-out rate. That is, the jitter buffer contains the actual received packets with, possibly, the protocol wrappers removed. In a second possible implementation, the incoming packets are treated by the signal processing entity as they arrive and the synthesized speech samples written into the FIFO. In this second implementation the FIFO contains actual speech samples destined for the DAC and is emptied based on the clock of the DAC. The invention disclosed herein applies to both modes of operation. The reason for the first mode of operation is that the jitter buffer module includes the logic required to handle missing packets as well as “silence” when there are really no packets available and the missing packets are synthesized as “silence” based on other information such as time-stamps available in the packets. Specifically, if the sequence numbers of consecutive packets are in correct sequence but the time-stamps indicate a time gap greater than the unit (frames or packets) then it is deemed that there were silent frames/packets between the two in-correct-order-sequence-number packets. In the second mode of operation there must be logic to determine silence packets. The invention described here is applicable to both implementations though, for specificity, the first implementation scheme is assumed.
c. Signal Processing. The information extracted from the received packet is processed with the appropriate algorithms to generate the speech segment. This includes the codec function, echo treatment (if any), comfort noise generation to synthesize silence, and packet loss concealment. The current invention does not relate in particular to the signal processing algorithms and implementation and just about any methods prevalent in the state-of-the-art can be employed.
There is one additional (though optional) requirement on the signal processing implementation arising from the current invention. That is, a flag is associated with each sample (octet) of speech signal recreated/synthesized. This flag is asserted (“true”) if the speech sample generated was part of a silence segment or a segment of signal artificially created via the packet loss concealment algorithm or had some particular characteristic as will be described later. The intent in this flag is to indicate that the sample is “actionable” and will have a minimal subjective annoyance in the event that the sample was deleted (or repeated) as part of the adaptive slip double buffer that is the crux of the invention disclosed herein. If the signal processing entity is incapable of providing such a flag for any reason, then the play-out buffer will, in essence, ignore the flag and assume that all samples are “actionable”.
The notion of “actionable” is that the frame of speech is either representative of silence or is representative of a synthetic frame of speech used for packet loss concealment. In the case where the speech is compressed, the nominal short-term power of the speech is computed by the encoding function (at the analog-to-digital converter side) and communicated to the decoding side (the digital to analog converter side). In the case where there is no compression, the decoding side must compute the short-term power of the signal and invoke suitable algorithms to determine whether the current decoded speech is part of a silence interval. Implementing slips introduces degradation but the degradation is much less consequential is invoked during periods of silence.

3. The Adaptive Play-Out Buffer

The invention disclosed here deals with an adaptive play-out buffer that is also called an adaptive slip double buffer. This is described below by considering the fundamentals of prior-art and the extensions that comprise the invention.

4.1 The Play-Out Buffer Viewed as a Retimer

The underlying principle of retiming is quite straightforward. The play-out buffer can be viewed as a retimer as described here. Fundamentally, the data (speech samples or octets) as well as a clock (“recovered clock”) are recovered from the incoming packet stream. The “recovered clock” is used to write the incoming packets into a buffer that is operated in a FIFO (“first-in-first-out”) mode. The recovered clock in this scenario is a burst mode clock corresponding to packet arrival instants. The data is read out of the buffer using, effectively, the DAC clock (the retiming function generally involves inserting the “reference” clock), and then packets read out from the FIFO can be applied to the signal processing function to generate the digital speech samples for the DAC. The function of “retiming” is illustrated in FIG. 6.
Referring to FIG. 6, a FIFO buffer 412 is coupled to a depacketization block 411. A digital signal processor 413 is coupled to the FIFO buffer 412.
In FIG. 6, the block labeled “DeP” refers to the circuitry used to implement the depacketization functions. The block labeled DSP represents the DSP functions that generate the speech samples for handing off to the digital-to-analog converter (DAC). The FIFO buffer represents the jitter buffer. The DSP block reads out of the jitter buffer based on the DAC clock. The DeP writes into the jitter buffer when packets arrive and this can be viewed as a jittered version of the encoder clock from the far end.
For illustrative purposes, the FIFO can be viewed as a “pipe” with the receive data that is written into the FIFO viewed as being pushed into the pipe. The transmit data that is read out of the FIFO is viewed as being pulled out of the pipe. The arrow designated as “fill position” indicates where the next frame/packet that must be read out is located within the pipe. The action of “write” moves the fill position to the right and each read operation moves the fill position to the left. At the beginning or “reset” situation, the fill position, arbitrarily, points to the middle of the FIFO buffer. With such an arrangement, if the size of the FIFO buffer is 2N units (typically frames), short-term frequency variations, referred to as wander, can be accommodated without loss of data. In particular, up to N unit intervals (“UI”) of time-delay variation in the packet network (2N UI, peak-to-peak) can be absorbed (1 UI is equivalent to 1 frame-time, 10 ms for a frame size of 80 samples if the underlying sample rate is 8 kHz). Needless to say, the arrangement adds transmission delay of, on the average, N UI. A FIFO of this nature can serve as a jitter buffer accommodating up to ±N UI of time-delay variation. For reference, if N is 10, up to ±100 ms of time-delay variation (wander) can be absorbed.
If the (long-term) average frequencies of the write clock and read clock are different, then the buffer will either overflow or underflow. With respect to FIG. 6, the fill position will move all the way to the right if the write clock is high or all the way to the left if the write clock is low. In this situation data will be corrupted; either some data is lost (“overflow”), or some “garbage” data must be inserted (“underflow”). In a generic retiming application, the appropriate way to handle such frequency offsets is to force the fill position to the center (the equivalent of “reset”) whenever the fill position rails at either extreme. In such a situation, either N frames are discarded (“lost”) or N frames are repeated (“garbage”). In a VoIP scenario, where the signal processing entity is capable of packet loss concealment, the advent of underflow can be anticipated and instead of “garbage”, speech segments can be synthesized that have much less subjective annoyance. Likewise, the advent of overflow can be detected and packet loss concealment methods applied to “delete” packets in a manner that is not arbitrary but introduces less impairment from a subjective standpoint.
One key element of the disclosed invention is the anticipation of overflow/underflow events.
This will be described shortly.
Another key element of the disclosed invention is the manner in which the clock used by the DSP to read frames out of the jitter buffer is derived from the DAC and adjusted to minimize the impact of clock offset between the local DAC and the far-end ADC.
This is described next

3.2 Double Buffer Arrangement for Delivering the Samples to the DAC

The arrangement for delivering samples to the digital-to-analog converter generally involves a double buffer arrangement. The reason for this buffering is that the actual conversion is done on a sample by sample basis using a “continuous” clock. The DSP unit will usually generate the samples as a block of samples. Thus while the DSP unit generates the correct number of samples per unit time on the average, it generates the samples in bursts.
The most common arrangement for implementing the double-buffer function involves the use of two buffers of equal size, say N octets, and referred to as “Page-A” and “Page-B”. One of the sides (we shall assume the “write” side for specificity and ease of explanation) accesses the buffer(s) sequentially. That is, the write operation first fills buffer Page-A, moves to buffer Page-B, fills it, and returns to filling buffer Page-A. The read operation empties the buffers. Under “normal” conditions, the read side is accessing buffer Page-B while the write side is accessing buffer Page-A, and vice-versa. If the average (long-term) frequencies of the read and write operations are equal, then the accesses will, substantially, remain in opposite buffers. This arrangement is sometimes referred to as a linear buffer arrangement to distinguish it from a circular buffer arrangement. The advantage of a linear buffer arrangement is that the memory allocation for the buffer can be slightly more than the actual page size.
In FIG. 7 a simplified depiction of a double buffer arrangement for implementing the interface to the DAC is shown. A first buffer 421 (Page-A) is coupled to a second buffer 422 (Page-B). The actual DAC converts samples that are read out of the appropriate. The two buffers are often referred to as Page-A and Page-B. The trajectory of the write pointer (“WP”) (the address to which the next write operation will pertain to) is shown. In particular, after filling Page-A, the pointer moves to the bottom of Page-B and commences filling Page-B. The trajectory of the read pointer (“RP”) follows the same principle and is implied. At the beginning (or “reset”), the WP and RP point to different pages. It is especially pertinent to make the page size equal to one frame. For example, in implementations using a 10 ms frame with an 8 kHz sampling rate the frame size is nominally 80 samples. Also, in this situation the DSP writes into the buffer the entire frame (nominally 80 samples) almost instantaneously. That is, it computes the appropriate sample values and fills the buffer in one “write statement”. The pseudo code for this operation will appear as:


	Get initial value for write_pointer (establish whether it is page A or
	page B)
	For l = 0,1,2,...,N1
	{
	Write X[l] into address defined by write_pointer [write
	instruction “W1”]
	Increment write_pointer
	}

Write X[N1] into address defined by write_pointer [write instruction “W2”]
Switch page designation for next block (from page A to page B and vice versa)
In the above block of code N1 is 79 if the block size is 80 since the range of the index I starts at 0. It is assumed that the DSP has computed the requisite sample values and these are available in the array {X[j]; j=0, 1, 2, . . . , N1}. At the start write_pointer identifies the memory address of the first element in the appropriate page (page A or page B). The instruction following the loop (in bold font) is important. What this achieves is the replication of sample value X[N1] into page-A/B location N1 as well as (N1+1). Thus for the case of an 80-sample frame, the same value is placed in the 80^thas well as the 81^stlocation of the buffer. Note that this approach is suitable for a linear buffer arrangement; slight modifications are required for circular buffer operation.
Note that the speed of the write operation is determined by the speed at which the DSP operates and not by the rate of the DAC clock. Generally speaking the machine-cycle time of the DSP will be very small and the entire process of writing 81 samples will be a very small fraction of the 10 ms frame duration.
In common implementations in customer-premises equipment such as the Integrated Access Device (IAD), the DAC clock is locally generated and may or may not be locked to a network reference. That is, it may be derived from a free-running oscillator. In either case it is not controlled by the DSP module that is reading out from the jitter buffer because implementing a clock synchronization method based on jitter-buffer fill (also referred to as adaptive clock recovery) requires expensive oscillators to smooth out the jitter introduced by packet network that can be quite large (see Ref. [11] for example).
A key aspect of the invention is to allow the DAC clock to run asynchronously with respect to the far-end ADC clock but yet account for the frequency offset using a slip mechanism that is based on single-sample slips while simulating clock synchronization as applied to the jitter buffer read/write. That is, the intent of this simulated synchronization is to avoid the “buffer shrink” effect and keeping the data corrupted due to a slip small (one sample) minimizes the deleterious effect on end-user quality of experience.
The typical manner in which the “read clock” (the “DAC-derived clock” in FIG. 6) for the DSP unit is generated is based on the premise that the DAC unit will provide a marker (generally implemented using the notion of a “software interrupt”) every 80 samples. In most implementations the DAC will empty one page, say Page-A, and provide a software interrupt signal to initiate the DSP unit operation so that the DSP will read one frames worth of information from the jitter buffer and complete the signal processing required to fill Page-A while the DAC is reading from Page-B. The operation of the DAC unit will involve a counter that starts from 0 for each page and is incremented by 1 when a sample is extracted from that page. When the counter reaches a “final value” the page is deemed to have been emptied and the DAC unit flags the DSP unit that one frame interval has transpired. If this “final value” is 80 then the frame interval is 10 ms (assuming that the sampling rate is nominally 8 kHz) according to the DAC clock. One key aspect of the invention is to allow the controlling entity to adjust the “final value”. Thus if the final value is set at 79, the DAC will interrupt the DSP unit in less than 10 ms (10 ms minus 125 μs) and if the final value is set at 81, the DAC will interrupt the DSP unit in more than 10 ms (10 ms plus 125 μs) where these time intervals are based on the DAC clock.
That is, the method of changing the “final value” provides the means to either shorten or lengthen the apparent frame interval corresponding to an apparent increase or decrease of the apparent DAC clock frequency from the viewpoint of the read clock.
Some important points associated with this method:
a. If the final value (N1) is set at 78 then the DAC will extract only 79 out of the 80 valid samples from the page. That is, effectively we have deleted one sample.
b. If the final value (N1) is set at 80 then the DAC will attempt to extract 81 samples from the page though there are only 80 valid samples. To ensure that this is done in a reasonable manner, the buffer size should be 81 and when the DSP writes 80 samples into the buffer it repeats the last sample to get the 81^stsample. That is, sample 80 and 81 are the same. Consequently the DAC is repeating one sample.
c. The controlling entity should change the final value occasionally, and only when necessary. At all other times it should be left at the nominal value of 80 (N1 set to 79). Note that the example cited above assumed a frame size of 10 ms and a sampling rate of 8 kHz. The same technique is applicable for different frame sizes and different sampling rates though the specific values such as 79, 80, and 81 for the “final value” will depend on the sampling rate and chosen frame size.

4.3 Circular Buffer Implementation for the Packet Jitter Buffer

The overall adaptive jitter double buffer arrangement can be viewed as a combination of the linear double buffer between the DSP block and the DAC and a “traditional” jitter buffer that stores packets between the depacketization block and the DSP block (as depicted by the FIFO in FIG. 6). The FIFO is advantageously implemented as a circular buffer.
A simplified view of the circular buffer arrangement is depicted in FIG. 8. A buffer 432 is coupled to a write address generator 431. A read address generator 433 is coupled to the buffer 432. A page control block 434 is coupled between the write address generator 431 and the read address generator 433. A control signals black 435 is coupled to the read address generator 433. The data written into the buffer comprises the packets extracted from the IP stream by the depacketization block. The size of the circular buffer is 2N “locations”, each “location” containing the data associated with the packet. The data read out of the buffer comprises the packet data that is used by the DSP to extract the speech samples. As mentioned before, based on each read access the DSP block gets enough information to synthesize one block/frame worth of samples that will be fed to the DAC. In this implementation it is assumed that the nuances of the method are implemented in the “Read Add. Gen.” block and thus the “Write Add. Gen.” block where the write address (“WR_ADD”) is generated can be quite simple. The block labeled “Δ” generates the difference between the read and write addresses [“RD_ADD”-“WR_ADD”] where the B-bit numbers are interpreted as 2's-complement represented integers. The block labeled “Control Signals” represents the circuitry implementing the logic associated with the control signals required by the “Read Add. Gen.” block. The functions associated with the various blocks are elaborated upon next. These functions have a direct counterpart for a linear buffer arrangement.
The “Write Add. Gen.” block is quite straightforward. The starting address is provided as the initial value of the write_pointer and then for every write operation the write_pointer is incremented. Since a circular buffer operation is used, modulo-2N arithmetic provides the wrap-around feature. When the write instruction is asserted (see write instruction W1 in pseudo-code; this applies for the jitter buffer as well), the input data is written into the buffer in the location pointed to by the counter contents, “WR_ADD”, and the write_pointer incremented by one. In the case of a linear buffer arrangement software instructions are needed to determine the suitable memory address of the start of the page.
The “page ctrl” block represents a function that monitors whether the read operation as well as the write operations are happening in the “location”. If so then the buffer has overflowed/underflowed and the correct action is to forcibly move one or the other side to the opposite part of the circular buffer. This is achieved by adding “N” (modulo-2N) to write_address or to the read_address (depending which is to be forcibly moved to the other page). Minor modifications are required in the case of a linear buffer arrangement.
The block labeled “Δ” generates the difference [“RD_ADD”-“WR_ADD”]=Δn. This difference is done modulo-2N; when the memory addresses are at diametrically opposite parts of the circular buffer the difference will be N; when the addresses are close to each other the difference is small in magnitude; when they coincide the difference is zero. Considering the circular nature of the buffer, defining which is “ahead” is somewhat moot. For our purposes, if Δn is positive the write pointer is “catching up” to the read pointer; if Δn is negative the read pointer is catching up to the write pointer.
Assigning appropriate actions based on the value of do is a key aspect of the invention.
To this end, three “threshold values”, T₃>T₂>T₁are predetermined. Suitable choices for these thresholds and the underlying rationale are provided later. Comparison of Δn with these determines the “state” of the adaptive play-out buffer; the state then determines the appropriate action.
a. If |Δn−N|≦T₁, the state is “green”. The implication of the “green” state is that the read and write pointers are far apart and no special action is taken. Note that the furthest they can be apart is, essentially, N, implying that the read and write operations are occurring in diametrically opposite parts of the circular buffer. The “increment” applied to the read address pointer (discussed shortly) is unity implying the read function operates in a normal manner.
b. If T₂>|Δn−N|≧T₁, the state is “yellow”. The implication of the “yellow” state is that the read and write pointers are possibly coming closer and some action is required. This takes the form of a controlled slip provided some other conditions are met. A controlled slip involves repeating or deleting one signal sample by changing the final_value in the linear double-buffer arrangement between the DSP and the DAC.
This is achieved by modifying the final_value to (N1+1) as described earlier. As described before, this implies that we essentially repeating a sample. This is done if Δn is negative (read catching up with write). What this accomplishes is artificially increasing the duration of a “frame” from the viewpoint of accessing the jitter buffer, slowing down the rate at which the read is catching up with the write.
Making the final_value equal to (N1−1) means the read address reads one less location from the page, essentially deleting a sample. This is done if Δn is positive (write catching up with read). What this accomplishes is artificially decreasing the duration of a “frame” from the viewpoint of accessing the jitter buffer, slowing down the rate at which the write is catching up with the read.
The aforementioned conditions for allowing a slip operation to take place are the following:
1) The flag associated with the current read data should be true. The flag will be set true by the signal processing block if the sample is part of an “actionable” signal segment.
2) The timer has expired. The timer is essentially a counter that is reset (to zero) when a slip event (repetition/deletion) has occurred. The timer counter is incremented by the DAC clock and saturates at a (pre-determined) maximum value. Until it reaches this maximum count, slip events are inhibited. The intent is to ensure that slip events are not allowed to occur too close together.
c. If T₃>|Δn−N|≧T₂, the state is “orange”. The implication of the “orange” state is that the read and write pointers are very likely coming closer and some action is definitely required. This takes the form of a controlled slip provided some other conditions are met. This is similar to the yellow state with relaxed conditions. In particular, the flag is ignored. The timer constraint is the same as for the yellow state.
d. If |Δn−N|>T₃, the state is “red”. The implication of the “red” state is that the read and write pointers are very close to each other and some extreme action is required. This takes the form of a controlled slip provided the timer constraint is met (as in the orange state) as well as a request to the signal processing entity that packet loss concealment must be initiated. If Δn is negative a segment of synthetic speech must be inserted; if Δn is positive a segment of speech must be deleted. In the red state we invoke not just effective change of frame duration by 1 DAC sample interval, but an entire frame in addition.
Traditional “adaptive” jitter buffers adjust the size of the jitter buffer to mitigate the occurrence of such overflow/underflow events. That is, the size of the jitter buffer is increased if the trend is seen to be towards such overflow/underflow events. Traditional adaptive algorithms for jitter buffers malfunction because they make no distinction between overflow/underflow that is the result of packet delay variation and the result of a clock offset. The slip function implemented in this algorithm addresses the clock offset issue and therefore if overflow/underflow does occur it is because the jitter buffer is not large enough to accommodate the packet delay variation in the network. Consequently the invention disclosed here will improve and enhance the behavior of conventional adaptive jitter buffer algorithms.
e. If Δn=0, the state is “catastrophic” implying that the write pointer and read pointer are coincident. This requires very drastic action. This is achieved by re-centering the jitter buffer. That is, the read pointer is “reset” to be diametrically opposite to the write pointer. N packets will be lost or repeated by this action that is equivalent to jitter buffer overflow/underflow. Suitable values for the thresholds are T₃=(¾)N; T₂=(½)N; T₁=(¼)N, where the size of the overall jitter buffer is 2N. If the packet loss concealment algorithm is not very sophisticated and thus should be minimally invoked, an alternate set of threshold values is T₃=(⅞)N; T₂=(¾)N; T₁=(⅛)N. These choices are well suited for efficient implementation and it is unlikely that “optimum” values for these thresholds, derived by any sophisticated means, will provide an efficacy that much greater than this particular set to warrant an increase in implementation complexity. The value for N, the buffer size, depends on the expected time-delay variation. If we assume a packet size of 10 ms (80 speech samples) a “typical” time-delay variation will be ±10 ms, corresponding to ±0.5 packet duration.
A suitable value for the timer is the closest power of 2 less than the packet size and in this case is 64. With this choice of timer, the slip events will be constrained to no more than twice per packet duration.
The block labeled “Read Add. Gen.” is important since this is a key aspect of the invention. A simplified view of this block is shown in FIG. 9. A time 447 is coupled to an increment control block 443. An increment generator 442 is coupled to the increment control 443 and generates a final_value 441. The increment generator 442 is coupled to an adder block 444, which in-turn is coupled to a select block 445, which in-turn is coupled to a register block 446, which in-turn is coupled to a read address block 448.
The entity M-WR_ADD represents the WR_ADD modified to represent the address diametrically opposite the current location that is being written into. If Δn=0, the drastic action taken is to make the select control choose M-WR_ADD to load into the read address register (see item “e” above). The read address counter is implemented as an accumulator that is updated based on the DAC-derived clock (“Read_Clock”). Under normal operation the increment is one unit (corresponding to packet size). That is, the read operation will sequence through the jitter buffer in a normal manner. The adjustment of the “Read_Clock” interval based on the slip buffer mechanism between DSP and DAC will account for frequency offset between DAC and far-end ADC clock. If the condition is “red” (see item “d” above) then the increment is either 0 units (the packet loss concealment algorithm is invoked) or 2 units (one packet is effectively deleted).
The notion of “Final_value” is the control value for the double buffer between the DSP block and the DAC. The nominal value will be called “N” in the following. (N−1) and (N+1) are the values for Final_value that will delete or repeat a (DAC) sample, respectively
The block labeled “Increment Control” is one aspect of the invention of the adaptive play-out buffer. The actions have been described before but are summarized here for completeness. Based on the various state conditions this block controls the generation of the increment used by the read address counter:
1. If State is catastrophic (Δn=0):
i. Assert reset (forcing read pointer to be diametrically opposite to write pointer)
ii. Reset timer. This is optional. Included for specificity.
iii. Set increment to one unit. This is optional since counter action is overridden by reset action. Set Final_value to “N”.

2. If State is red:

i. Deliver message to signal processing entity that packet loss concealment (deletion or synthesis, based on sign of Δn) is required. FIG. 9 does not show this control signal explicitly but it is implied. Set increment to 0 or 2 units.
ii. If timer has not expired, set Final_value to “N”.
iii. If timer has expired, set Final_value to (NΔ1) or (N+1) depending on sign of Δn and reset timer.
3. If State is orange:
i. If timer has not expired, set Final_value to “N”.
ii. If timer has expired, set Final_value to (N−1) or (N+1) depending on sign of Δn and reset timer.
4. If State is yellow.
i. If timer has not expired, or flag is false, set Final_value to “N”.
ii. If timer has expired, and flag is true, set Final_value to (N−1) or (N+1) depending on sign of Δn and reset timer.
iii. Note: If the signal processing entity does not provide the flag it is deemed to be always true.
5. If State is green:
i. Set Final_value to “N”. (Normal slip buffer operation)
Note: In states orange, yellow, and green the increment for the read address for the jitter buffer (i.e. RD_ADD in FIG. 9) is set to one unit.

SUMMARY

One of the problems associated with communication of real-time information over packet networks is the time-delay variation introduced. A second problem is that the transport is asynchronous and therefore the receiving end may be operating at a different timing-base from the sending end. The packetized nature of VoIP necessitates the use of a jitter buffer and, possibly, a second buffer to interface to the actual digital to analog converter (DAC). The invention described herein deals with simple and efficient methods to address the jitter buffer and clock offset issues.
Salient points of the invention are:
1) The DAC double buffer is made adaptive in the sense that controlled slips are implemented.
2) The signal-processing entity can flag samples from segments of speech that are considered “actionable”.
3) The slip action can, optionally, be inhibited if the sample affected has been flagged as “nonactionable”
4) The controlled slip action is instantiated by monitoring the fill of the jitter buffer.
5) The jitter buffer FIFO is implemented as a circular buffer and the difference between the read and write pointers used as a measure of buffer fill.
6) A timer is used to ensure that slip events do not occur too close to each other.
7) A timer is used to ensure that the frequency control is not too rapid.

DEFINITIONS

The term program and/or the phrase computer program are intended to mean a sequence of instructions designed for execution on a computer system (e.g., a program and/or computer program, may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer or computer system).
The term substantially is intended to mean largely but not necessarily wholly that which is specified. The term approximately is intended to mean at least close to a given value (e.g., within 10% of). The term generally is intended to mean at least approaching a given state. The term coupled is intended to mean connected, although not necessarily directly, and not necessarily mechanically. The term proximate, as used herein, is intended to mean close, near adjacent and/or coincident; and includes spatial situations where specified functions and/or results (if any) can be carried out and/or achieved. The term distal, as used herein, is intended to mean far, away, spaced apart from and/or non-coincident, and includes spatial situation where specified functions and/or results (if any) can be carried out and/or achieved. The term deploying is intended to mean designing, building, shipping, installing and/or operating.
The terms first or one, and the phrases at least a first or at least one, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. The terms second or another, and the phrases at least a second or at least another, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. Unless expressly stated to the contrary in the intrinsic text of this document, the term or is intended to mean an inclusive or and not an exclusive or. Specifically, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present). The terms a and/or an are employed for grammatical style and merely for convenience.
The term plurality is intended to mean two or more than two. The term any is intended to mean all applicable members of a set or at least a subset of all applicable members of the set. The phrase any integer derivable therein is intended to mean an integer between the corresponding numbers recited in the specification. The phrase any range derivable therein is intended to mean any range within such corresponding numbers. The term means, when followed by the term “for” is intended to mean hardware, firmware and/or software for achieving a result. The term step, when followed by the term “for” is intended to mean a (sub)method, (sub)process and/or (sub)routine for achieving the recited result. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In case of conflict, the present specification, including definitions, will control.

CONCLUSION

The described embodiments and examples are illustrative only and not intended to be limiting. Although embodiments of the invention can be implemented separately, embodiments of the invention may be integrated into the system(s) with which they are associated. All the embodiments of the invention disclosed herein can be made and used without undue experimentation in light of the disclosure. Although the best mode of the invention contemplated by the inventor(s) is disclosed, embodiments of the invention are not limited thereto. Embodiments of the invention are not limited by theoretical statements (if any) recited herein. The individual steps of embodiments of the invention need not be performed in the disclosed manner, or combined in the disclosed sequences, but may be performed in any and all manner and/or combined in any and all sequences.
Various substitutions, modifications, additions and/or rearrangements of the features of embodiments of the invention may be made without deviating from the spirit and/or scope of the underlying inventive concept. All the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive. The spirit and/or scope of the underlying inventive concept as defined by the appended claims and their equivalents cover all such substitutions, modifications, additions and/or rearrangements.
The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase(s) “means for” and/or “step for.” Subgeneric embodiments of the invention are delineated by the appended independent claims and their equivalents. Specific embodiments of the invention are differentiated by the appended dependent claims and their equivalents.

REFERENCES

[1] RFC 3550, RTP: A Transport Protocol for Real-Time Application, Internet Engineering Task Force Request for Comment.
[2] RFC 3551, RTP Profile for Audio and Video Conferences with Minimal Control, Internet Engineering Task Force Request for Comment.
[3] ITU-T Recommendation G.711, Pulse Code Modulation (PCM) of Voice Frequencies, Geneva, 1989.
[4] Kishan Shenoi, Digital Signal Processing in Telecommunications, Prentice-Hall, 1995. ISBN0-13-096751-3.
[5] ITU-T Recommendations series G, Transmission systems and media, digital systems and networks.
[6] Stefano Bregni, Synchronization of Digital Telecommunications Networks, John Wiley & Sons, 2002. ISBN 0 471 61550 1.
[7] P. K. Bhatnagar, Engineering Networks for Synchronization, CCS 7, and ISDN, IEEE Press, 1997. ISBN 0-7803-1158-2.
[8] Danny De Vleeschauwer and Jan Janssen, Voice Performance over packet-based networks, An Alcatel White Paper.
[9] Ramachandran Ramjee, Jim Kurose, Don Townsley, and Henning Schulzrine, Adaptive playout mechanisms for packetized audio applications in wide-area networks, Proceedings of the Conference on Computer Communication (IEEE INFOCOM), Toronto, Canada, June 1994.
[10] Aman Kansal and Abhay Karandikar, Jitter-free audio playout over Best Effort packet networks, in ATM Forum—International Symposium on Broadband Communication in the New Millenium, August 2001.
[11] Kishan Shenoi, Synchronization implications of providing Circuit Emulation Services in an IP Network, NFOEC/OFC, Anaheim, Calif., March 2005.
[12] Kishan Shenoi, Synchronization Implications in VoIP, NIST-ATIS Workshop on Synchronization in Telecommunications Systems (WSTS), February 2004.

Claims

1. A method, comprising:

monitoring a fill in an adaptive slip buffer of a digital to analog convertor;

adjusting a number of samples that are read from the adaptive slip buffer per page as a function of the fill; and

reading the number of samples from the adaptive slip buffer.

2. The method of claim 1, wherein the number of samples defines an apparent frame interval as a function of a clock frequency of the digital to analog convertor.

3. The method of claim 2, wherein the number of samples is increased when the fill is decreasing and the number of samples is decreased when the fill is increasing.

4. The method of claim 3, wherein the number of samples is decreased when a sample is flagged as actionable.

5. The method of claim 3, wherein the number of samples is changed when a minimum slip time interval has been exceeded.

6. The method of claim 3, wherein the number of samples is changed when an apparent frequency change threshold has not been exceeded.

7. A computer program, comprising computer or machine readable program elements translatable for implementing the method of claim 1.

8. A machine readable medium, comprising a program for performing the method of claim 1.

9. An apparatus, comprising: a digital to analog convertor including an adaptive slip buffer and a read address generator coupled to the adaptive slip buffer, wherein the read address generator includes an increment control that adjusts a number of samples that are read from the adaptive slip buffer per page as a function of fill of the adaptive slip buffer.

10. The apparatus of claim 9, wherein the number of samples controls an apparent frame interval as a function of a clock frequency of the digital to analog convertor.

11. The apparatus of claim 9, wherein the adaptive slip buffer includes a circular buffer.

12. The apparatus of claim 9, wherein the adaptive slip buffer includes a double buffer.

13. The apparatus of claim 9, wherein the adaptive slip buffer includes a linear buffer.

14. A digital switched network integrated access device, comprising the apparatus of claim 11.