Database design questions.

I’m a member of a database message board, and i may ask this question there later, but i know people here better, and i know there are people here with database experience, so i thought i’d give the Dope a try first.

I’m putting together a little database of websites related to history (which is my field of study and teaching). In particular, i’m focusing on sites that have primary sources in history (documents, letters, manuscripts, video, audio, etc.). I would like this database to be searchable in a variety of ways, using a variety of different categories.

I may come back later with questions about how to implement the more complex stuff, but right now i want to start with more basic questions of setting up the tables, and of organizing the data properly.

So, first of all, the table containing the websites themselves will look something like this:

Sites

SiteID
SiteName
SiteURL
SiteHost
StartYear
EndYear

I’m not an expert, but it seems to me that this conforms with the ideals of 1NF. (ETA: actually, SiteHost should probably be moved off to a table by itself, because a particular university or company might host more than 1 site)

The SiteID will be an autoincrement, and will also be the Primary Key for the table. I could, theoretically, make the URL the PK (as URLs are unique), but the length and complexity of some URLs would make that too messy, i think.

The StartYear and EndYear functions will contain the range of dates covered by each website. So, for example, a site that covers the Great Depression, New Deal and recovery might have StartYear=1929 and EndYear=1945. I figure that this will make it easy for the user (me) to search for sites within a particular date range.

Anyway, in addition to the main table, i want to have other tables that break each site into categories for classification and searching. Categories i would like to have include:
[ul]
[li]Geographical area: which region, nation, and states/provinces does the site cover?[/li][li]Time periods: these would be named time periods (in contrast to the StartYear and EndYear values), such as “Revolutionary,” “Antebellum,” “Progressive Era,” etc.[/li][li]Topic areas or themes: using standard historians’ terms, such as “Social History,” “Sex/Gender,” “Race,” “Economic History,” “Historiography,” etc.[/li][li]Document types: a selection like the list i gave in the introduction, including Letters, Maps, Images, etc., etc.[/li][/ul]

There may (and probably will) be other things, but you get the idea.

A key problem i’m confronting here is that the relationships between the main Sites table and these category tables will have to be many-to-many. After all, a site might cover more than one geographical region, more than one time period, etc., etc.

I’m cognizant of what needs to be done to set up a many-to-many relationship. I need an intermediate table with foreign keys that correspond to the primary keys of the two main tables. So, to link my Sites table in a many-to-many relationship with, say, a Nations table, i would have:



Sites		SiteNation	Nations

SiteID		SiteID (fk)	NationID
SiteName	NationID (fk)	NationName
SiteURL
SiteHost
StartYear
EndYear


My first issue here is one that i can’t seem to find a definitive answer on anywhere. It’s the question of whether the linking table, SiteNation, needs to have its own auto-increment primary key, or whether it’s sufficient to just use the two columns themselves to form a compound primary key. Some sites and books i’ve read insist that each table should always have a single-column primary key, while others say that a composite one is fine, especially for tables like this. Anyone have any advice on this issue?

Second, and much more difficult, is the problem i’m having working out how to deal with the question of geographic locations in my database. I’d like to identify, where possible, which nations AND which states/provinces a website covers. So, for example, a site about relations between Europeans and Indians in the Great Lakes region might cover the nations France, England, Canada, and the United States, as well as the States/Provinces of Ontario, Quebec, Michigan, Illinois, etc., etc. You get the picture.

What i’m wondering is how to deal with this geographical information in tables. I could, theoretically, have a single table with all geographic entities: nations, states, etc. It would be a single long list, and would have a many-to-many relationship with the Sites table.

But one problem with this is that it doesn’t establish any relationship between the states and the nations. It seems to me that the data should describe, somewhere, the fact that Michigan is not only a geographic entity in its own right, but that it is a part of the United States of America. Of course, i could have a table with nations, and then a bunch of other tables, each containing the states of one nation, and each nation could them be linked to its sub-table. But i don’t know how to do this, and also to connect those sub-tables back to the main Sites table.

Am i being clear about what i’m trying to do here, and what my problems are? If not, please ask for clarification, and i’ll try to explain further.

I’ll leave it there for now. If anyone can offer advice, or point me to somewhere that i can find some answers, i’d be grateful.

Autoincrement probably best, URL could be messy as you say, and it may change and name may change.

I’m sure you’ve already thought about this, but you may have sites that cover non-contiguous years, you may want to move the years into the topic table.

Sometimes it makes sense to use a single-column auto-increment primary key, but not always. In the case of the SiteNation table you have a natural compound key, adding an additional number doesn’t gain you anything.

You’re on the right track. You already have the SiteNation table that provides a relationship from a site to many nations.

You can then create a StateProvince master table with all available/valid state provinces (linked to their respective nation via fk), and a SiteStateProvince table to provide a relationship from a site to the StateProvinces.

You will most likely have situations where a site has rows in the SiteNation table but not in the SiteStateProvince table, so having 2 independent tables makes the most sense. If you always knew you would have the lower level detail for any site that has any geographic info, then you could just have the low level detail table and you could determine which nations for the site based on the StateProvinces (due to the fact that they have a relationship to the nation table).

Adding in another auto-increment key (I have been taught that this is called a surrogate key) helps with flexibility in design, IMHO. For instance, you may want to include mirrors next time - and mirrors may be across different countries. If using the compound key, you either have to duplicate the sites, or something else, but you can’t touch SiteNation. With a surrogate key, extension is not a problem.

Though the use of the compound key, if my university professor is anything to go by, is the academically correct method.

Is this a good design decision? Unless you are going to provide a tree which lists all possible regions, I think the idea of using tags to identify the region helps.

I think the relationship between county -> state -> country should be derived in code, not store in the database. I would suggest treating the relationship as a graph or tree. Imagine folders, and sub-folders and etc. The locations all are a node - be it state, country and etc. A node can contain multiple nodes. Hence, if you start a node and trace its descendants, you find all locations which are “contained” with it. If you trace to the parent, you will find what it is contained within.

In short:

  1. Have a generic LocationID and a table to store its outgoing edges.
  2. Have tables for different levels of classification, and use LocationID to classify it
  3. When it comes to deducing relationship, do a BFS or DFS and follow the edges as from the first table in #1

Hope this helps!

On the last point …

geographic reagions are hierarchical, but not neat. Different areas have different hierarchies. e.g. Country->state->county->city works well for most of the US, but not all of it, much less most of the rest of teh world. So any structure which tries to have different tables for differnt types of areqas is a loser.

The best way is a self-referential table of generic regions which forms a tree with “world” as the root node. But you’ll quickly discover that recursive structures are one thing SQL does especcially badly. So you’re buying into a crapload of extra work.

Something else to consider about geo-regions. They are not constant over time. Which “country” entries do you put in the Austro-Hungarian empire? Does the USA region include the Hawaii region? Either yes or no will be wrong depending on before or after 1959.

Bottom line, I think a truly accurate data model for regions is waay too hard for what you’re doing. better to just tag each entry with some geo-facts and leave understanding the hierarchy to both the person setting the tags & the person searching.

Auto-increment key (synthetic key, etc.) is the way to go for you main table.

Compound key is OK for your linking table. There might be some efficiency advantage in a simple auto-increment key if you had tens of millions of records but probably not in your situation.

I would have one table with countries, then another table with the next level in the hierarchy that links back to countries. The intermediate linking table would link the main table to the lowest level in the hierarchy, and you would have to navigate back to get the country. This is an academic solution that does not into account the practical problems of trying to map areas by their political geography. You might have to identify each lowest-level entity in the hierarchy by time period it existed, like East Berlin.

Thanks everyone, for all the advice. I’ll take your points in order.

Actually, i hadn’t thought about that. It’s a very good point, although after you posted i thought about it and couldn’t think of a single site in my collection where the issue of non-contiguous time periods would really apply. The websites that i’ve used nearly all tend to focus on particular, continuous periods. That’s not to say that a site with non-contiguous years could never emerge, but historians, especially those designing for a broader audience on the web, tend to deal in particular, continuous time periods.

I think that the site i use that poses the most problems for a StartDate / EndDate sort of format is Doug Linder’s Famous Trials website, which covers the years 399 B.C.E. all the way up to 2006 C.E., and which has a whole bunch of discrete years rather than a particular time period. But that’s a rather unusual case.

I’m not sure what you mean by suggesting that i move the years to the topic table. Do you mean, put a whole bunch of individual years in the topic table?

OK, i’m still reasonably new to this. I think i know what you’re suggesting, but maybe you can look at an image and let me know if i understood you correctly. Here’s something i threw together quickly in Access, based on what you said. Is this what you meant?

Link

(I only used Access because it has readily-available table relationship diagramming. I’ll probably be doing my database in MySQL.)

That’s two suggestions for tags. I had actually considered this earlier, and it was going to be one of my later questions.

So how would a tag system work? Would it be simply a single, long table with a whole shitload of tag possibilities, linked to the Sites table in a many-to-many relationship?

And, looking ahead, would it be better to simply use tags for ALL of my categories (geographical, time period, document types, etc.). And if so, would a single massive tag table work, or should i have separate tag tables for each category?

Down the road, i’d like to make my database accessible to my students through a web browser, probably using PHP. I’d like them to be able to search in a variety of categories. Does this change the way i need to model the database, or can it be dealt with in writing the PHP? (which is something i’m stil an absolute beginner at)

OK, now i’m really starting to get out of my depth. I’m not sure what BFS or DFS is, and i’m unfamiliar with the term “outgoing edges.”

Yeah, i had thought about that. I had thought about trying to shoehorn all nations into a schema that went something like:

Nation > Next administrative unit (state, province, canton, prefecture, etc.) > City/Town. I wasn’t going to bother about counties, and for most (US) sites that would probably work, because the state would be specific enough for the purposes of the search, but there might be some sites where counties would be important.

Given that i’m not even sure i understand what you’re saying here, i think i’ll take your word that it would be too much trouble.

I’m still debating how to deal with this issue, because it seems to me that there are certain advantages, in terms of consistency, to simply using the modern geographic entity names.

For example, the Puritans of Massachusetts Bay are generally considered to be part of United States history, despite the fact that they arrived in Massachusetts about 150 years before there was even a country called the United States. For simplicity’s sake, i could probably argue a similar case for making Hawaii part of US history, even though this presents some historical dilemmas.

A supra-national entity like the Austro-Hungarian empire, or the Ottoman empire, presents slightly different problems, as you recognize.

I’m going to talk to some of my history professor friends about this, and see what they think. What i’m seeking, really, is a system of nomenclature that is relatively consistent, and that will have meaning for people in the field of history. I’m pretty confident about how to go about it for the United States (my own area of expertise), but i’ll need to talk with experts in other regions, or do some reading, to learn more about geographic breakdowns.

Is your suggstion here similar to the table layout i provided in the image, above? Or something different? It’s not quite clear to me.

Thanks again to everyone for all your help. I really appreciate it.

I agree. I am in the camp that says each table should have autoID column. I almost always make that the primary key for the table.

If you trust the knowledge of whoever can add sites to be correct, then I think LSLGuy is right about the geography. Tagging is probably the easiest way to go. If you are are the only person that will add sites, then it will probably work out fine.

However, if you want to have a lot of control over geography labels, or you don’t trust people adding to the sites, then you might want to consider a more formal structure. This would be a LOT harder to design, implement, and maintain. You would lose a lot of flexibility going this route, but you would gain in data control. But you would have to come up with a set of descriptors that would work for every instance. If you can’t, then you almost need the frlexability of using tags.

That would probably work. You might want to have tag types (geo tags, people tags, etc.) but they could all be stored in a table linked to another TagType table.

This is a good question. Geo tags were suggested because of the issues surrounding modeling the geography over time. You might not have these issues surrounding Doc types, for instance. How many doc types are there? If there are 10 doc types and what constitues a doc type is pretty clear, tagging may not be the best. On the other hand, it might be nice to have the same structure for all your descriptors.

Yes.

Advice: Start simple and try it out. You will learn more about the data, exceptions, relationships, etc. as you go. Don’t try to make it perfect at the start, just make a reasonable effort to create something workable. If you are prepared to re-work from the beginning, and you do it before too much has been created around it, you will have the benefit of experience to fine tune your model.

For a database, each tag is has a TagID and a name. When a tag is suggested, you have to check if the tag exists. If not, you have to insert it in. Another table, probably TagSite, which stores the relationship between tags and site, will probably be just a 2 attribute table ( <TagID>, <SiteID>), though you could add fancy stuff like date tagged, first tagged by who and etc.

It would work, but it would be better to separate the tags by each type of category. I think some certain criteria can be fixed - types of documents for one.

Yes, not a problem at all. You can use PHP to take in search terms via textbox and uses the LIKE command in SQL for basic searches.

It basically something dealing with Graph Theory. A simpler solution could be easier to implement - tags seem to be easier to implement, but could be harder to maintain in the long run.

Thanks again for the further helpful advice.

I think i might move forward and start putting the database together, and then populate it with a relatively small number (50 or so) of websites, and see how it goes.

Once i’ve done that, i might come back and inflict more of my problems on you, and/or start getting some advice about getting the database online using PHP.

Cheers.

Something else to consider. With modern search technology, you don’t really need to use a database & SQL querying until you get to millions of records, if then.

Your “tags” could simply be a text field into which people type unformatted keywords.

SQL & querying is designed around the idea you have a neat data model & you need to be able to retreive EXACTLY the items that fit PRECISE criteria. If what you really want is to be able to find almost all the needles in the haystack & don’t mind if 10% of what you retrieve is really hay, well that’s a completely different problem.

The search paradigm is much more the latter, and is now very familiar to your users. Approaching the problem that way may help you a bunch.

In modern Knowledge Management systems, thre is very little structure & lots of tags & searches. That lets the stupid computer do the laborious, but stupid, part of finding most of the needles & a little hay, while leaving the humans with a much richer (but still impure) pile to go through using human, slow, but non-stupid, pattern recognition.

Here is how you could create a (simple) hierarchy.



Sites      SiteCity    Cities   States    Nations
-----      --------    ------   ------    -------
SiteID     SiteID (fk) CityID   StateID   NationID
SiteName   CityID (fk) CityName StateName NationName
SiteURL                StateID  NationID
SiteHost
StartYear
EndYear

However, the more I think about it, the less practical this seems, due to the shifts over times plus lack of consistency in how this is managed worldwide. It could be done but would get rather complex, and my example just scratches the surface because you would have to deal with provinces, counties, republics–what do you call Russia? A republic of the Soviet Union a while back, or a country today?

That’s some very good food for thought there.

Even if this database goes well beyond my current collection of websites, and expands to include many countries and regions, i really couldn’t see it ever going over a few thousand records. Maybe 10,000 at the absolute outside, which is small for a MySQL database.

In fact, before i started this thread, one thing i contemplated, instead of beginning with a MySQL database and then adding PHP myself, was going with a Content Management System of some sort. I know that many of those are frontends for MySQL/PHP setups anyway, and that they can deal with things like keywords.

I was just worried that a CMS might not give me enough flexibility in terms of searching. This is especially true in terms of things like particular years. Despite the caveat offered by RaftPeople in post #2, i would still like to have a Start Date and End Date for each site, allowing a user to plug in his/her own start and end dates into a search.

I would like, for example, to be able to search for all sites that cover the period 1800 to 1860, in the United States, dealing with the topic of women. While a keyword search could probably do that also, it just seemed to me that something more precise might be of help.

I will add, though, that i’m basically a complete novice at PHP, and have plenty of reading to do, so i don’t know exactly how easy it will be to get what i want. I also admit to not being very knowledgeable about exactly how customizable a CMS would be in terms of searches, etc.

CookingWithGas, thanks for the schema. I do agree with you, though, about the difficulty of implementing the geographic areas, and i think that keywords might be the way to go, without worrying about the hierarchy bit.

At least in SQL server, you can mix traditional precise query language with more google-like stuff.

SELECT * FROM Sites WHERE StartDate THROUGH EndDate OVERLAP WITH @InputStartDate THROUGH @InputEndDate AND Tags MATCHES @InputTags ABOVE 70% WITH ENGLISH STEMMING

That’s not the actual syntax, but it captures the flavor of what you can do in one line of modern SQL with embedded search.