If I'm recalling a piece of music, it's just audio clips with no visual and the recollection is "weighted" toward the parts of the music that stand out more to me. For example, a song with drums will still have the "feeling" of the presence of drums, but I can't really concentrate on what the drums sound like because they don't receive enough of my attention while I'm actually listening to the song for my subconscious mind to accurately reconstruct them. The same goes for most of the support instruments, but I will have a pretty accurate representation of the lead, which includes both the melody and voice / timbre of the singer / instrument. I don't pay much attention to lyrics though and I can't even tell you what most songs are about, even songs that I just finished listening to. Consequently, vocals in my memory are typically just the melody and sound of the person's voice, while the words are kind of like if you're listening to a conversation through a wall. I can sort of understand words or parts of words here and there, but I don't recognize any sentences.
If I'm recalling an event or a conversation, I see just about everything in third person, like I'm watching a movie... sometimes with myself excluded, as if I'm only there "in spirit." I see everyone else the way they appear to me, but without details, so while they are probably wearing clothes in the memory, my mind just kind of reconstructs a generic outfit that I couldn't possibly remember or describe if you asked me about it. The image also won't include any details about jewelry, eye color, acne, moles, freckles, hair style, or anything like that, even though the person definitely does still have eyes and hair and maybe even freckles.
I just see an undetailed reconstruction of that person's image (face, hair, skin color, body shape, rough approximation of dress style) situated in an equally undetailed approximation of the environment, or a plausible environment if I don't remember where we were.
The sound of their voice is pretty accurate, but their speech patterns, vocabulary, arguments, opinions and mannerisms may be skewed by my personal biases about prescriptive speech and things like that as well as my opinion of them.
As far as myself being excluded, it depends on what's happening. If I'm remembering martial arts practice, I see myself doing everything in third person, exactly the same way I see everyone else. If it's a conversation, I see everything in third person also, but I'm represented only by my own disembodied voice. It's not first person perspective, because I'm not seeing the situation from where I was relative to them at the time and I also don't see my own hands in front of me. It's like I'm seeing a conversation between characters in a movie, but my own character's presence is evident only by the sound of my own voice.
In all circumstances, my voice sounds the way it sounds to me, which apparently is considerably softer and maybe an octave higher than it sounds to everyone else. When I'm speaking, I hear myself as somewhere between tenor and baritone, but in recordings I am very obviously a bass.