Does anyone have any experience with hooking up multiple sound inputs to a computer? In particular, I’d like to have multiple microphones statically mounted around a room (for speech recognition), all feeding into a single computer. Do I need multiple sound cards (one mic per card)? Is there a “microphone hub” that I could purchase that might take a lot of pain out this? What costs can I expect (time, effort, or monetary) to get this to work?
Although this is GQ, since this is fairly advanced computer work, I expect some advice and/or recommendations will be proffered also (and would be greatly appreciated!). In that spirit, some other questions: Any particular hardware recommendations? Software for sound processing (open source is highly desirable)?
I think it might be helpful to provide some background information: I’m a graduate student who has written the core networking/distribution/control software for voice controlled, interactive robots (generally Java on Linux, although minimally tested on Windows and Solaris also). For actual use, we’ve relied on a head-mounted, wireless mic for speech input (although during development and testing, I use the stick microphone on my desk); single, fixed location microphones have proven to be really bad for our purposes (directionality, background noise, general unintelligibility have all been issues).
My knowledge of computer sound equipment is pretty much limited to “plug in mic, start speech processing software, fix minor glitches (e.g., wrong serial port, adjust mixer settings, etc.)”, so any information would be greatly appreciated.
It would be trivial to feed any number of microphones into a mixer, and then into your sound card, but that’s NEVER going to work. The more mics you have, the more random uncorrelated noise you are going to have. Get enough mics, and all you will pick up is noise. If I were going to do this, I would run each mic independently, and run the voice-recognition software on each input independently, then vote on the results…
That should work, as long as you don’t need more than 4 inputs.
In my experience, the amount of noise picked up by 4 live mics shouldn’t be a serious problem. 12 mics, that would.
If you have a need for that many, you might have to either: record to a multi-track system and mix later, or use gates on each mic. A gate will clamp down on the input unless there is significant level; less than the threshhold level will have 0 level. Sometimes tricky to adjust, though.
Speech recognition has been the bane of our lab’s human-robot interaction. We did do something like what you mention at one point – one head-mount mic (wireless to a desktop) and one stick mic (wired to a laptop on the robot), each running different voice recognition software (one package was better at word recognition, the other at phrases). Reconciling between them was difficult, to say the least. And that was assuming we got recognizable speech to begin with. Am I right in assuming I’d need two sound cards to do this on one computer?
Ugh. I was hoping this was a problem with at least a somewhat known solution, maybe using a set of directional mics (I know of, though I’m not familiar with, some work done on localizing a sound source using an array of 8 mics). I was also hoping that having fixed location mics would be easier to work with, figuring it’d be easier to manipulate the room’s acoustics for non-moving mics.
You mentioned a mixer…do you mean a piece of hardware? (I realize I used the term; I was referring to ALSA settings which are obviously very different.) What might that do for me? What available options are there? Would it work if, say, there were only two microphones at opposite ends of the room (in case someone is facing away from one of them)?
I seriously doubt that mixing multiple room mics together is going to work. When you have microphones in “free space” they pick up all kinds of echos. Your brain filters this out, but speech synthesis software isn’t going to like this type of input. Now, if you have multiple mics in free space, they are all going to have slightly different echos, so I think it’s just going to make the software’s job more difficult. I’d love you to prove me wrong on this - it would sure be great if you could walk into a room an give commands to a computer reliably…
A “mixer” is just a box that accepts multiple mic inputs, and combines them into a single output - one of the earlier posts had a link to one.
Kind of out of my field so I’m just brainstorming. Are there no microphones with built in squelch circuitry to filter out ambient noise? I know that cell phones and PC mikes do this, but I’m pretty sure that’s a function of the phone/software.
Squelch and gates are not the same thing, although the principle is similar. Squelch, commonly used in taxicabs, cuts out suddenly and completely and has little adjustment available. A gate is a much more sophisticated device/circuit and can be adjusted as to sensitivity, threshold, “knee” which is sort of a slope control, trigger time & release delay and other settings like compression. A gate can be used in a high-fidelity environment where its action is desired to NOT be obvious.
It sounds like for Digital Stimulus’ application, fidelity might be an important consideration.
And yes, mics do come with gates built in, some adjustable. The ones in a studio console or standalone are higher quality and have more adjustments. A badly adjusted gate is worse than none at all.
Former semi-pro sound engineer here. We need a little more information. Technically, a (hardware) mixer of the type in Projammer’s link would seem to answer the OP’s questions, but I have a feeling the actual situation is more complicated.
What exactly are you trying to do with all these mikes? Is the idea that you want to be able to pick up and (and process) speech from multiple people around a room, or from the same person at many different places? Will all these inputs be controlling a single device, or do you want each to control a different device?
I’m no expert on voice recognition systems, but a head-mounted mike is the simplest way to obtain high quality audio while rejecting extraneous environmental sounds. Is there a reason you can’t use one?
Beowulff and Musicat have raised some of the issues and suggested some of the solutions that may come into play, but we’ll need a fuller picture of exactly what you’re trying to do before we can help you narrow them down.
I think your problem is that your mics are crap. The $10-20 stick mics you get at the computer store are terrible. 4 stick mics aren’t going to sound any better. Get a professional mic and preamp and your troubles will be gone.
From your second post (#5) I’m getting the impression that you need to keep the sound from the various microphones separate to run on different software. A mixer will NOT help you in this scenario, as it will combine the sounds from all the mics into one audio stream.
You could possibly do what you are asking with multiple sound cards, but it would be dependent upon the software to be able to select from t he correct source.
An easier, but slightly more expensive option would be to run each microphone on a separate computer.
Thanks all. I’m going to need a little time to check out the information thus far, so I’ll post some more questions later today (and with a cursory reading, I know I have some). A friend I haven’t seen for about a decade rolled into town yesterday, so I’m entertaining. But I did want to respond to this:
At the most basic level, I’d like to be able to pick up speech commands given by a single person in a single room, anywhere in that room. With the limited experience I have, I realize what a problem sound processing/speech recognition is, especially in an uncontrolled environment. So, I figure that the issue of multiple people, varying levels of ambient noise, filtering and/or separating conversation participants, etc. are out of my league right now (but would be interesting to work on at a later time).
As far as controlling devices: I’ve written the software that coordinates, controls, and communicates with a variety of sensors and effectors (e.g., cameras, motors, speech recognition/production, laser range finders, etc.), at least making use of supplied APIs and/or outside software. So long as I have a hardware connection (serial port, 1394, etc.) and network connectivity, control isn’t an issue. I’m currently looking into wireless sensor network hardware (e.g., Zigbee) and possibly venturing into DIY territory (busting out a soldering iron to hook up custom sensors). Up to now, I’ve been pretty much a software guy, only tackling hardware on an as-needed basis, so this is fairly new to me.
If it comes to that, I’ll have to. But it would be so much nicer to not have to worry about dealing with and wearing a personal mic, instead just having the room wired for sound.
I hope that narrowed it down some. One question I have is: what options do I have in adjusting a room’s acoustics? Not soundproofing, mind you, but just improving the environment (perhaps just reducing echo and/or deadening reflection).
No, no. If you need to keep each input as separate as possible, use multi-tracking software, not separate computers. Syncing two computers would be difficult.
This escalates the cost, but it may be the only way to go. Investigate ProTools, by Digidesign. You might contact Sweetwater Sound in Fort Wayne, IN; they carry all that stuff and I’ve found them very helpful.
Basically, you would need an input box that sends each mic to a separate virtual track and the software to handle it. It’s exactly what musicians use to record and edit nowadays. The number of simultaneous tracks is limited only by the input hardware and the power/storage of your computer.
If you can mic each mouth closely, the room acoustics matter less. If you can’t, deaden the room as much as you can and try to avoid fans and anything that generates noise. Echoes reduce clarity; you can always add it later for effect, but you can’t take it out easily.
If people are moving around, a mic attached to each person is vastly superior to a room pickup. With speech recognition software, you need all the help you can get.
Just to be clear, you want one person to be able to interact with a computer by issuing voice commands from anywhere in a room, right? Not multiple people, not multiple independent tasks depending on the location? If so, I heartily second Musicat’s last post. Use a headset mike.
However, I have a few more questions:
[li]Will this interaction be the person’s primary activity, like a game or an office task, or will it be an occasional thing done along with other things not involving the voice recognition system? [/li]
[li]Does the user have to be able to be picked up absolutely anywhere in the room, or are there a few specific places where he/she is most likely to be: the desk, the table, the sofa? Could the user come to one of a few mike placements to give commands? [/li]
[li]Could he/she press a button when giving commands, or does this have to be a hands-free operation? [/li]
[li]Is this a task that will be carried out only in one specific room, or is it intended to be done in many different kinds of places? Obviously, there are things that can be done in a one-off situation that won’t work if you’re developing a product that is to be used in thousands of customers’ homes or offices.[/li]
[li]What other sound sources are in the room? Other people, machinery, radios/TVs, etc.?[/li]
[li]What are the objections to a (possibly wireless) headset mike? [/ul][/li]
There is a lot that can be done with acoustics. So much, in fact, that it is sometimes nearly impossible to know what to do, and deadening the room isn’t always the best solution. Acoustics is one of those black arts that are so complicated and difficult to analyze that success is often a matter of blind luck. Ask the designers of concert halls. Suffice it to say that just putting down some carpet or sticking up some drapes is unlikely to solve the problems caused by having multiple microphones open in a single room.
In short, if my assumptions about the situation are correct, the problems other posters have mentioned related to noise picked up by multiple mikes will, IMHO, vastly outweigh any inconvenience involved in using a headset mike. Unless your project has an unlimited budget, use the headset.
We did a robot in Japan that used three built-in microphones connected to a cheap controller with a Windows driver. The robot did voice recognition; there’s a video somewhere on the 'Net of me demonstrating it during a cable TV show. Despite all of the background noise in the studio and my bad Japanese, the robot worked well.
We used basic algorithms to have the robot turn towards the speaker, and if necessary approach, to improve voice recognition. It was a useful trick that meant that we could focus S-R accuracy tuning to a smaller subset of audio conditions.
I’ll dig out the audio controller information tonight, and see if they have Linux drivers.
AFAIK, for this purpose a digital audio workstation such as protools isn’t going to be able to do anything a cheap mixer can’t do. Only one channel of voice commands can be processed by the voice recognition software, correct? DAWs can be finicky beasts with their own issues anyway.
For simplicity and effectiveness I can’t think of anything better than a headset or maybe a lapel mic attached to the user.
I imagine there’s some sort of bluetooth solution out there as well.
Yow, too much information to mull over. I’ve got a couple minutes, so I’ll hit what I can.
While I’ll google terms later, is there a site you can supply that explains these terms (perhaps even with some recommended equipment)? And I am on a limited budget, being a non-resident student (thus sacrificing my stipend) while finishing up.
It is, I suppose, but just enough to get decent recognition accuracy. Thus far, one thing we’ve done is to (radically) restrict the dictionary; a well-chosen set of commands does wonders. Ultimately, of course, an unrestricted dictionary is desired…but damn is natural language a hard problem. I’m just thankful packages like Sphinx exist and are open source.
This may very well be. I followed the thread from a week or two ago about inexpensive mics with much interest, and may pick up one (or more) that was recommended there. Assuming I don’t just go with a wireless mic, which may end up being my best choice.
I’ve got time for one more response before dinner; I figured this one would be the most productive.
As a stopgap, one person, one room. Since I’m neither a sound technician nor speech researcher, it’ll do, although not quite what I’m looking for. More:
An occasional thing. However, I’d (eventually) like to monitor the activity in the room and keep track of context (e.g., episodic memory, scene/gesture interpretation, emotive detection, etc.). Again, although some of this is done (in rudimentary form), that’s a big eventually – at this point, I’d be happy with a glorified “Clapper”.
Strategic placement would help; I hadn’t thought that far into it. Optimizing for the common case might be good enough, I suspect. More precisely, having adequate sound pickup places where people sit, with a mic easily accessible that someone could walk to when necessary (or at least talk at) might gloss over the issue just enough.
What might I expect from a directional mic? (Am I using the correct term there?)
Well, much of my research concerns fault-tolerance (although this particular part is a narrow part of it); so, having backup options is OK. In other words, multiple forms of input (e.g., speech isn’t working, use the keyboard) are an option. Of course, it’s incredibly frustrating to have to switch modes, and it would be nice to avoid it whenever possible.
At this point, very specific and very custom. With that in hand, hopefully it would be possible to branch out to bigger and better things. It’s just that, ya know, the “garbage in, garbage out” adage is a hugely limiting factor in human-robot interaction. Lessons learned in this situation can surely be built upon in the future.
Well, right now, just an office – noise can be artificially kept to a minimum. So, the computer (and it’s fans). The computer/robot can/will respond in natural language also; right now, the setup is to turn the mic off while speaking so that it doesn’t “hear” itself. Not a great solution, but it circumvents (at least) one glaring problem.
Until now, I’ve had to avoid using sonar (while moving, for obstacle detection) in many scenarios because it made speech totally unintelligible. If there are any effective ways to filter out sounds (either fairly consistent, like sonar pings, varied noise, like TV, or more general background noise), I’d be very interested in any or all of them.
Specifically, I’d like to not be tethered to the computer. Wireless gets around that, but one still has to don the headgear, which is a pain in the butt. Of course, a functional system with a minor annoyance is vastly better than something that works sporadically (if at all).
Getting deeper into this makes me appreciate more and more just how remarkable our auditory system is.
Well that’s not very encouraging. (That’s more of a lament than anything else; it always amazes me that as advanced as technology is, there’s so much that’s still so difficult.) Maybe sticking to the KISS principle should be my guide, as it’s looking like I might be biting off more than I can chew at this point. Nonetheless, down the road I expect to have to address some of these issues, and this is all very educational. Would you happen to be able to recommend something (book, website, etc.) on acoustic design?