The OP question was perceived transmit/receive delay which in turn raises the issue of whether cell phones are full duplex or half duplex.
A traditional landline phone is full duplex – you can talk and receive at the same time, which facilitates natural conversational interaction. There is no transmit/receive delay since it’s both transmitting and receiving audio continuously.
Previous statements about packet delay don’t fully explain poor cellular audio quality. 3G cellular systems were NOT packet switched but circuit switched for voice and they also had poor audio.
Whether cell phones are full duplex or half duplex is complex. From an RF standpoint, a 4G cell phone is functionally full duplex. Using either Frequency Division Multiplexing or Time Division Multiplexing, there are effectively two separate RF channels available for transmit and receive. However how this is presented to the user at the audio level can vary.
In an attempt to conserve RF and system bandwidth, it appears that cellular carriers often create a half duplex or “walkie talkie” behavior at the handset. Maintaining a full time bidirectional or full duplex link would consume two separate audio channels with resultant RF and network bandwidth consumption. Since only one party is talking most of the time, they apparently often don’t continuously maintain the receive channel for the talking party. It’s a cheap way to cut bandwidth consumption but unfortunately also produces a cheap walkie-talkie feel.
They also apparently have some kind of cutoff threshold for transmitting non-voice audio. IOW if the DSP and vocoder logic doesn’t detect voice-type audio, why spend the bandwidth sending background sounds. However this further degrades the natural conversational feel, since when one party is quiet it feels like the call went dead. People often ask “are you still there?”. You never had to ask that on a plain old telephone landline.
On top of this there is vocoder compression, which further degrades audio quality. The vocoder (voice codec) itself in a cellular application is designed to shoe horn the voice audio bandwidth into the smallest possible data bandwidth, which in turn consumes less RF and network bandwidth. IOW it is highly compressed, and apparently they crank up the compression when the system is over subscribed, which is most of the time.
Yet VOIP can sound very good given enough bandwidth and network resources, e.g, Computer-to-computer Skype calling on a broadband network with good equipment. So it’s not inevitable that cellular calls have poor audio because of being packet switched or using vocoders. But the decision of cellular carriers to sacrifice quality for more in-use channels has resulted in this.
Here are several good discussions of the overall issues:
“Why Mobile Voice Quality Still Stinks—and How to Fix It” – IEEE Spectrum, 2014: Why Mobile Voice Quality Still Stinks—and How to Fix It - IEEE Spectrum
“What happened to the DUPLEX telephone call???” – Electronic Design, 2013: http://electronicdesign.com/forums/analog-mixed-signal/what-happened-duplex-telephone-call
“Why Is Cell Phone Call Quality So Terrible?” – Scientific American, 2015: https://www.scientificamerican.com/article/why-is-cell-phone-call-quality-so-terrible/