Calling All Librarians: Can I get a downloaded soft copy of LCSH anywhere?

DISCLAIMER: I’m not asking anyone to incur trouble at work or to violate any license arrangements they may have, so please use your own best judgment in replying. (The reason for the disclaimer will become evident later.)

Now then, to my issue.

I’m looking for a soft copy of the LCSH in a format that allows searching and manipulation on a field by field basis. Therefore, it would optimally be in some Windows-friendly format like Excel or Access; a PDF dump would be useless. In the system I am building, I will need to be able to perform searches to identify See and Use For references. I can go into further details of the project, if you want, but it’s not essential to frame my question.

You would think that the subject headings list of our national library would be freely available online, but this is very evidently not the case. What I have found, on the LOC website itself, is a download that is utterly unlike any format I have ever seen. I have also seen something called ClassWeb, which seems to offer the type of data I need–but at a cost of hundreds of dollars.

Naturally I’ve contacted my university library with this plea and explained why I want this data and how I plan to use it, but nothing stops me from checking out other possible avenues while I wait for them to get back to me.

In short, then, I’m wondering if anyone can point me in the right direction. Maybe there’s some obscure website that provides this data, although I have not been able to find it. Maybe there’s a place where I can get the contents of a prior edition, which would be fine. I don’t need to have the most current version. Also, I only need the 'F’s. If you don’t know where I can go to get this for free, or very little money, I’m hoping that someone has an old copy of this data on a disc someplace, and would be able to upload it and send it to me. If you have a ClassWeb account, and your license would allow you to do this for me with the current data, that would be ideal.

Well, there is authorities.loc.gov - it isn’t what you’re looking for, but it IS a searchable (poorly) list of the subject headings. Doesn’t give you most of the rest of the information you need, of course - I use it with the big red books at hand.

Just being able to search the list isn’t sufficent; I would need to have the list in it’s entirety, or at least those headings that correspond to call numbers beginning with ‘F’.

Although, now that I come to think of it, if there’s a way to do a back-end query of one of these websites, that could work too. I’ve done that before using Java code and an NT script, but it’s been a long time, and the web page this code was running against was a good deal simpler.

Come to think of it, it doesn’t even have to be LCSH, but can be any comparable SH list. As long as I use the same set consistently, it doesn’t matter which it is.

authorities.loc might have a webmaster listed that you could contact…

Otherwise, I’m not sure that there ARE any freely-available LCSHs in cell-format, although you’d think that would be really useful. (I needed one desperately last term for classes, and ended up camping in my library’s TS department for a week or so during finals with the Red Books stacked around me like a fort.)

If you’re not limited to LCSH, you might have better luck, but I don’t know enough about those to advise.

As a total and unnecessary point of curiousity - what exactly are you building, and why?

Not a librarian, so I’m not sure if these are what you’re looking for, but there are two large downloads at id.loc.gov which look like they have the LCSH data, in two different formats.

Sorry, I don’t think you are going to get LCSH in a download. From their site (bolding mine):

  1. Are authority records available free of charge?

Yes. All authority information in Library of Congress Authorities is available free of charge via this Web site (authorities.loc.gov). Users do not have to register or request permission to search, save, print, or email the LC authority records. The only limitation is that authority records may only be saved, printed or emailed one at a time.
Most libraries pay for a subscription for authority records which include subject headings, so I don’t know that you will find a good set for free download.

Depending on what you are doing, however, LOC does (did, at least), make subject headings for other materials, like images, available for download.

I could send you what you want, but then I’d have to…

Sorry, I haven’t got the foggiest what your’re talking about but I couldn’t resist.

I think you can download LCSH here ( http://id.loc.gov/download/ ). However, you might need to do a bit of research to find out exactly what it is that you are downloading, both the format, and what data is encoded in that format.

This is an outside possibility, but there was a brief time when LCSH was available on CD-ROM, or so I’ve been told. If you could get ahold of one of those…?

I did find this, but couldn’t make head or tail of it. Well, at least so far; I need to take another look at it.

Yes, I know what you mean. I work with people who understand this stuff, but don’t really understand the nuts and bolts of these formats. When I work with LCSH, it’s in databases that someone else has set up.

It’s not too hard to understand the basic correspondence with the printed form. The N-triples file is a plain text file, with three fields per line. The first field is the entry tag, the second gives the type of property described on this line, and the third field is the actual property. (No doubt this is not the right terminology–I’m not a librarian.) For most of the file (all but the header records), the first field just looks like


<http://id.loc.gov/authorities/sh85028738#concept>

where the number represents the main entry that this line describes–here “Columbia Plateau”. The entire entry for “Columbia Plateau” looks like this (I’ve stripped out the “http://…/” prefixes since they are mostly distractions):


<sh85028738#concept> <created> "1986-02-11T00:00:00-04:00"^^<XMLSchema#dateTime>.
<sh85028738#concept> <core#broader> <sh85103284#concept>.
<sh85028738#concept> <core#broader> <sh85103258#concept>.
<sh85028738#concept> <core#broader> <sh85103270#concept>.
<sh85028738#concept> <core#inScheme> <authorities#conceptScheme>.
<sh85028738#concept> <core#inScheme> <authorities#geographicNames>.
<sh85028738#concept> <22-rdf-syntax-ns#type> <core#Concept>.
<sh85028738#concept> <core#prefLabel> "Columbia Plateau"@en.
<sh85028738#concept> <core#altLabel> "Columbia and Snake River Plateau"@en.
<sh85028738#concept> <core#altLabel> "Columbia River Plateau"@en.
<sh85028738#concept> <core#altLabel> "Columbian Plateau"@en.
<sh85028738#concept> <core#altLabel> "Scabland, Channeled"@en.
<sh85028738#concept> <core#altLabel> "Channeled Scabland"@en.
<sh85028738#concept> <owl#sameAs> <info:lc/authorities/sh85028738>.
<sh85028738#concept> <modified> "2010-12-03T15:34:57-04:00"^^<XMLSchema#dateTime>.

Compare this to the LCSH entry here. The “#prefLabel” line about halfway down is the name of the entry, and the “#altLabel” lines just below that are alternate names (“UF”) lines. (There are no separate lines in this file for “USE” entries; those are just represented by the presence of these #altLabel lines.) The “#broader” lines give the sh<#> labels for the “BT” lines; so for example sh85103284 is the entry for “Plateaus–Washington (State)”. Some entries similarly have #narrower lines for “NT”.

Yes, that is them. Learn something new everyday.

Spectre, can you use that? If not, I may be able to provide a large enough set for your needs.

It helps to see it laid out like this, which I couldn’t very well do with what I downloaded. The XML file I downloaded also contains the entire LCSH, so that can work also.

Now it’s just a question of gearing up my perl skills to parse it. As far as I was able to tell, it’s next to impossible to search for angle brackets in MS apps, because those seem to be the beginning/end-of-word markers.

Does the NT file correspond to the XML file on a one-to-one basis? It would be easier to use the XML file once I have extracted the elements that I need. The difficulty I’ve had to this point is that the XML file is too large to be loaded into any kind of editor or XML parser. This means that I can’t easily get a top-down look at how the file is structured, so I don’t know what exact start and end tags I should be looking for.

Once I’ve isolated the USE references, though I think my system should be able to handle it. I’ve been boning up on perl and it looks like it should be pretty easy once I know the file structure.

The NT and RDF/XML files do appear to correspond (at least, each has the same number of main entries, 404643). [You probably want to use a utility like “more” (under DOS or Unix) to view the file a screen at a time, rather than try to read the whole file into a text editor.]

Here’s a top-level view of the XML file, with all of the <rdf:Description> tags collapsed:


<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
        xmlns:dcterms="http://purl.org/dc/terms/"
        xmlns:owl="http://www.w3.org/2002/07/owl#"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <rdf:Description rdf:about="http://id.loc.gov/authorities#conceptScheme"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#personalNames"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#personalNamesChildren"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#corporateNames"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#corporateNamesChildren"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#meetings"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#meetingsChildren"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#uniformTitles"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#uniformTitlesChildren"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#chronologicalTerms"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#topicalTerms"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#geographicNames"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#geographicNamesChildren"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#genreFormTerms"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#generalSubdivision"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#geographicSubdivision"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#chronologicalSubdivision"/>
    <rdf:Description rdf:about="http://id.loc.gov/authorities#formSubdivision"/>
  <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85018933#concept"/>
    [...404641 <rdf:Description #concept> tags omitted...]
  <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85136949#concept"/>
</rdf:RDF>

And here’s the (complete, uncollapsed) entry for “Columbia Plateau” (compare the NT version in my earlier post):


  <rdf:Description rdf:about="http://id.loc.gov/authorities/sh85028738#concept">
    <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1986-02-11T00:00:00-04:00</dcterms:created>
    <skos:broader rdf:resource="http://id.loc.gov/authorities/sh85103284#concept"/>
    <skos:broader rdf:resource="http://id.loc.gov/authorities/sh85103258#concept"/>
    <skos:broader rdf:resource="http://id.loc.gov/authorities/sh85103270#concept"/>
    <skos:inScheme rdf:resource="http://id.loc.gov/authorities#conceptScheme"/>
    <skos:inScheme rdf:resource="http://id.loc.gov/authorities#geographicNames"/>
    <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
    <skos:prefLabel xml:lang="en">Columbia Plateau</skos:prefLabel>
    <skos:altLabel xml:lang="en">Columbia and Snake River Plateau</skos:altLabel>
    <skos:altLabel xml:lang="en">Columbia River Plateau</skos:altLabel>
    <skos:altLabel xml:lang="en">Columbian Plateau</skos:altLabel>
    <skos:altLabel xml:lang="en">Scabland, Channeled</skos:altLabel>
    <skos:altLabel xml:lang="en">Channeled Scabland</skos:altLabel>
    <owl:sameAs rdf:resource="info:lc/authorities/sh85028738"/>
    <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2010-12-03T15:34:57-04:00</dcterms:modified>
  </rdf:Description>