Advice on database structure design (survey tracking)

Hello,

I hope I can get a bit of advice regarding database design. I’m a complete amateur at this, but being the only one with some skills, it’s fallen to me to do the implementation. Hopefully I can explain my question in an organized fashion:

Information to track
There are five (though there will soon be more) surveys. Each survey has some slight variation. In short, the survey questions can be divided into three categories:
[ul]
[li] Biographical information (name, address, etc.)[/li][li]Survey questions (e.g. “do you like filling out surveys?”, and “do you own any pets?”)[/li][li]Item checkboxes (e.g. "Which of the following pets do you own Cat__ Dog__ Llama __ Platypus __)[/li][/ul]

The biographical information fields will never change and are identical across surveys.

The survey questions may change in the future, but right now half are all identical across all surveys and half vary between surveys, though there is some overlap.

The item checkboxes will definitely change over time. There is some overlap between surveys (i.e. “cat” appears on all surveys at the moment, but some have “cat” and “dog” while others have “cat” and “llama”).

How it’s set up now
There is a main table to collect the common biographical information, and a sub-table for each individual survey.
[ul]
[li]tbl_main has all biographical fields and all questions that are identical on all surveys[/li][li]tbl_survey123N have fields for those questions that vary and for the checkboxes of the respective survey. There is also a foreign key that relates the record to tbl_main. [/li][/ul]

This seems somehow inelegant, but I’m not quite proficient enough to say why. It feels wrong to have several fields repeating across sub-tables. What if I want a query that, for example, pulls all people in California who don’t like surveys and have cats or llamas? I think the problems with that are somewhat self evident: I’d need to know/remember the fields of all tables in order to know which to include in the query.

But I’m not sure what the best way to redesign the structure would be (hence, why I’m posting here).

I’m leaning towards just two tables: One table with the unchanging, permanent biographical information in it. Then a second table with a field for each question, regardless of whether the question repeats across surveys. However… wouldn’t doing it this way mean that over time, the second table will grow and grow and grow in the number of fields?

Or should each question get its own table? That is, a tbl_cats with the only field being the primary key of tbl_main to relate it to a responder. Each one-field table would get populated by a responder-ID. However… wouldn’t that mean I’ll eventually end up with hundreds of tables? Some complicated queries might be a bit, well, extra complicated, no?

So… are any of the above methods the “proper” way to design this structure? Is there another, better way I should go about this?

Oh, if it makes a difference, this database is built with PHP and MYSQL, interacting through the web.

Thanks for any help you may have—even if it’s just an encouraging word!

Rhythm

I would suggest:
Tbl_Main (biographical stuff)
Tbl_Questions (one record per question)
Tbl_Survey - with QuestionID, PersonID as foreign keys, and a field for the answer

If your questions are multi-choice and not the same range of choices per question, then I think you need Tbl_Responses, with QuestionID as a foreign key, each row describing one possible response to the question. (in this case, the answer field in Tbl_Survey also becomes a foreign key to ResponseID.

That way, when you add a new question, you only need create one new record in Tbl_Questions and a set of records in Tbl_Responses defining each possible answer - you don’t have to create any new tables/structures, nor do you have to modify your application to deal with newly-added tables/structures.

I am not sure I follow exactly but this is what I would suggest.

  1. Users table to track respondants. ID, Address, other “biographical information.” Also date and perhaps IP data.
  2. Survey table. ID, Survery name only. Relate it to Users table with an ID (user fills out survey 1, row is added to Users table with SurveyID = 1)
  3. Questions table. One row per question. Give it an Answers column and put multiple-choice answers in the field if available with a separator you can use later, in your form. (Cat,Llama,Dog - use as an array in PHP later). Also give it a SurveyID and perhaps a SortOrder column. So you can dynamically make your forms.
  4. UsersQuestions table, to hold responses. SurveyID, UserID, QuestionID, Answer. When you tabulate your data for reporting, use joins to get everything to display nice for you.

The nice thing about this structure: very dynamic and very clean (using FK’s). Easy as heck to add on to without disrupting data and lets you re-use questions.

The bad thing about this structure: Multiple choice answers will show up like (1,2,3,‘cat,dog’) and you will need to do some string-based querying to extract how many people answered that they have a cat, even if some say “cat,llama” and some say “cat,dog”

But…it’s one idea.

I think it’s a really bad idea to store multiple pieces of data in a single field. Maybe it is common practice with PHP and makes it easy to construct an array, but I’m pretty sure it’s a big no-no as far as formal standards of database normalization are concerned.

You don’t even necessarily need Tbl_Survey. You can dedicate a field in Tbl_Questions, and one in Tbl_Responses, to a concatenation of the unique identifier in Tbl_Main + TimeStamp (Date & Time Taken).

You haven’t described the process by which the possibly-different valuelist set of checkbox responses, and the half-different set of questions, are determined; I’m assuming via a script? Anyway, just assign as a temporary variable the TimeStamp string and then auto-enter that in conjunction with the Tbl_Main keyfield and that would suffice to define the “survey” without really needing a table for it. (It’s not like it’s a Physics Exam where you assign grades or otherwise attach a set of variables to the survey as survey; it really only consists of an association between the other tables)

To clarify what I said above, I’d do it like this

If you do it that way, it’s not even in 1NF, and that throws off a lot of optimization/error-proofing techniques. I can’t recommend that.

My solution is very similiar to Mangetout’s structure, but there are a couple differences:

USERS( pk_USER_ID, BIOGRAPHICAL_FIELD_1, …, BIOGRAPHICAL_FIELD_N );
SURVEYS( pk_SURVEY_ID, TITLE, etc. )
SURVEY_USER_COMBINATIONS( pk_SURVEY_USER_COMBO_ID, fk_SURVEY_ID, fk_USER_ID, TIMESTAMP );
MULTIPLE_CHOICE_ANSWERS( pk_ANSWER_ID, fk_SURVEY_ID, QUESTION_ID, ANSWER_VALUE );
SURVEY_MULTIPLE_CHOICE_ANSWERS( pk_RECORD_ID, fk_SURVEY_USER_COMBO_ID, fk_ANSWER_ID );
SURVEY_CHECKBOX_ANSWERS( pk_RECORD_ID, fk_SURVEY_USER_COMBO_ID, BITMASK );

I hope the relationships are clear from the foreign keys. The basic idea is that you have one table to identify surveys and one to identify users. You also have a table to identify which users have taken which surveys, which is where most of your data will be joined to. Each user’s answers to each survey are split among two tables because the datatypes for multiple choice and checkbox questions are different, so you’ll have to do some coding to get data in and out of the table. I’ve used a bitmask so that you only have one field per answer, which means you can use a union query to get raw data.