To provide a better understanding of why this happens:
When sound was added to film, it was recorded on different media. Clapper boards were used at the start of each shot so there was a distinct mark to line up audio and visual material. For the final release, it was logical to include the audio per frame, so each 1/25th of a second visual frame included a 1/25th second slice of audio, and the images and sound were naturally synchronized. But each frame contained a visual image and the associated audio.
Originally, digital media files were like this - fixed bitrate noncompressed video and audio. But it was slow to access and impossible for the low bandwidth internet of the time.
Modern digital media is different - we compress both the visual and audio, and the algorithms to do so are quite different. The work to decompress and display/play the material is also inconsistent - some material does not compress well, or decompresses slowly. You need to actually provide suitable timing points within both the visual and audio streams to keep the playback timing accurate.
This is the job of a “container” format - it holds both compressed video and compressed audio data, the information about how the streams have been compressed and should be played back, and timing information.
A naïve container format might start with the header that says that there is 100 seconds of H264 video, and then 100 seconds of MP3 audio, and then append the video and audio files.
But you would have to read the whole file, separate out the data, decompress it, then play the video and audio together. And it would be really easy to get them out of sync if the MP3 audio plays a bit faster than the video. It would be really bad for internet streaming.
A better “container” would split the video and audio into specific chunks (1/25th of a second), and interleave those chunks of video and audio. That would be great for synchronized playback and internet streaming, but it breaks up the data at really inconvenient times for both video and audio compression - loosing a chunk or two will produce really bad artifacts/distortion. Also, the overhead of the container format itself will make the final file quite a bit bigger, which is non-optimal for internet streaming.
So container formats are a balancing act between synchronization and file size/streaming capability.
Then you get Quality Of Service. Both content providers and your ISP look at the rate at which data is streamed to you. If they determine that the connection cannot keep up, they will signal the content provider, and ask for a less detailed stream that requires less bandwidth. So a good streaming container allows for bandwidth changes midstream without interruption or desynchronization.
The process of converting video to a different container or video/audio format is called “transcoding”, and is a very computationally expensive operation to get the best overall combination. Those YouTube videos you are having issues with were probably poorly containerized when uploaded, but use a container format that looks OK to YouTube, so they don’t transcode them into better containers. If you have bandwidth or QoS issue during playback, changing streams part-way through may cause the audio desynchronization you are experiencing.
Sent from my SM-G900I using Tapatalk