SAC Subcommittee on Metadata and Subject Analysis

Minutes for June 29, 1998

Washington, D.C.



Present: Casey, Culbertson, Dede, El-Hoshey, Geever, Glassel, Greenberg, Harken, Layne, Sandburg-Fox, Trumble, Wool



Diane Dates Casey, chair, opened the meeting with introductions of members and guests. She thanked Pat Kuhr and Bruce Trumble for compiling a metadata bibliography which is available at the SAC Metadata Web site (http://www.govst.edu/users/gddcasey/sac/metadata.htm).



Sara Shatford Layne and Greg Wool gave a presentation, "Theoretical Exploration on Similarities & Differences Between Subject Analysis in General & Subject Analysis of Digital Resources." Wool began by making distinctions between metadata as data about data vis-a-vis metadata as digital information about digital information, embedded metadata in contrast to third-party metadata, and web author-created metadata versus cataloger created-metadata. Then, Layne posited that subject analysis of digital resources is different, but not fundamentally different from traditional subject analysis. The purpose and process of subject analysis remains the same in either environment: 1) analyze and abstract a summary of the subject of an item; 2) chose appropriate subject headings with the appropriate level of specificity; 3) use a structured vocabulary; 4) establish preferred terms or collocate synonyms; 5) establish a syndetic structure for controlled vocabulary; 6) design systems so that the user retrieves relevant materials rather than large retrievals where many of the items are not germane to the search.



Wool responded with a discussion of new data elements, functionalities, and paradigms in the digital world. Under new data elements he noted increased use of multiple subject vocabularies at the same time, a greater need for mapping and crosswalks between subject vocabularies, as well as the use of data mining, and an explosion of searches using uncontrolled vocabulary or keywords.

His examination of new functionalities included the use of readily recognizable form subdivisions, subject hierarchies especially suited to the Internet, and software for automated text analysis and assignment of subject headings. Finally, he predicted that paradigm shifts will occur in the areas of an increased use of uncontrolled subject vocabulary, more web authors or publishers doing their own subject analysis, an expansion in analytics as a result of the modular nature of web documents, and a movement away from specificity towards broader subject headings. In contrast, Layne depicted these "new" elements as variations of the old, "new" functionalities as the same but with new tools, and "new" paradigms as shifts in emphasis rather than totally new. She foresaw increased integration of library and non-library subject analysis, as well as integration of human and automated subject analysis; new tools will implement the traditional functions of subject analysis.



Jane Greenberg and Shannon Hoffman spoke on "Subject Analysis and the Untrained Cataloger/Web Author." Hoffman began by emphasizing that subject analysis of Internet resources is important, because Internet users need to retrieve relevant information to their queries. However, the challenge is finding ways for the web authors to provide subject analysis for their own web documents. In the past, Greenberg note that the primary tools for subject analysis have been subject heading lists, thesauri and classificatory schemes. She argued a place remains for traditional tools of subject analysis. Additionally, the developing variety of metadata standards parallels the ISBD with data elements for subject, keyword and description. Hoffman discussed potential obstacles to web author-created subject analysis, such as a multiplicity of foreign languages on the Web, subject heading language becoming obsolete or politically incorrect, the complexity of traditional tools for subject analysis, such as LCSH, mistakes in subject analysis and the challenge of providing quality control, and the difficulty of nurturing cooperation among web authors and professional catalogers and indexers. Next, Greenberg highlighted the hopeful possibilities for embedded subject analysis. For example, web authors are already adding metatags which reflect subject analysis; most web authors could assign broad subject headings and upper-level class numbers; the success of PCC, SACO, and NACO demonstrate that cooperation is possible; templates and registries for subject analysis, such as the Nordic Project, are being developed; finally mistakes are lessened with software for spell checking and grammar correction. Hoffman then shared a template which she developed to assist web authors in providing their documents with subject analysis. Her template would be part of an HTML editor which included the Dublin Core. As web authors prepared to move the web document to the server, they would be provided with a script which would allow them to add metatags containing controlled vocabulary as well as a classificatory number.



This presentation was followed by a lively discussion on the role that librarians might play in training and developing tools for web authors. One of the challenges will be to nurture web authors in using the most specific subject headings to describe their documents. Additionally, librarians need to target the web authors whose material includes substantive information. Discussion, also, focused on the need to develop software which will map from search terms to a variety of controlled vocabularies. While LCSH terms may be viewed as complex in isolation, studies found that the subject headings were less troublesome in context with the LCSH syndetic structure. Librarians could promote cooperation in the task of applying subject analysis to web documents by each institutional library taking responsibility for cataloging the web documents published by their institution. Also, metatags need to be integrated into HTML editors.



Aimee Glassel spoke on "Projects using the Dublin Core for Subject Access." She mentioned three metadata models: creator-described metadata, trusted third party metadata, and automatically generated metadata. Her presentation focused on the first two models. Information about Dublin Core projects can be found at the Dublin Core Web site (http://purl.org/metadata/dublin_core/). Glassel discussed the use of Dublin Core elements relevant to subject analysis at the Scout Report Signpost (http://www.signpost.org/), Florida International University Digital Library (http://www.fiu.edu/~diglib/) and Jonas Hallgimsson: Selected Poetry and Prose (http://www.library.wisc.edu/etext/Jonas/). She noted that the final site was an example of creator-described metadata. To date she was unable to find an example of automatically generated metadata. Additionally, she stated that AltaVista, HotBot and InfoSeek search engines are indexing metadata to provide another avenue of subject access for their users.



"Subject Data in the Dublin Core Metadata Record" was addressed by Lois Mai Chan. She began by noting the elements in the Dublin Core which are related to subject: content description -- subject and keywords in Element 3 ("Subject") and description in Element 4 ("Description"), form data -- resource type in Element 8 ("Type") and format in Element 9 ("Format"), language data in Element 12 ("Language"), and spatial or temporal data in Element 14 ("Coverage"). The purpose of Element 3, Chan noted, is to provide embedded data to enhance subject access to web documents. Subject data could include free-text or controlled vocabulary and classificatory numbers. In terms of verbal representation, she posited three possibilities: keyword, controlled vocabulary, or a combination of keyword and controlled vocabulary. Chan quickly stated that while controlled vocabulary alone is theoretically a possibility, she felt it is not realistic. Both keywords and controlled vocabulary present problems. Keywords lack synonym and homonym control, as well as fail to indicate subject term relationships. Chan raised the following questions concerning controlled vocabulary: "Are existing schemes suitable for use?"; "What needs to be modified in existing schemes and who will do it?"; "Who will develop new schemes?" Is a "one size fits all" vocabulary a possibility? If so, which one?; "How do we harmonize terms from different vocabularies?"; "What about metathesauri, like Unified Medical Language System?"

She highlighted the following structural issues in controlled vocabulary: "What level of specificity is most desirable and suitable?"; Should syntax be "full string or single-concept descriptors?"; How will synonyms and homonyms be accommodated?; How will subject term relationships be noted? Additionally, Chan raised the following application issues of controlled vocabulary: consistency, summarization versus exhaustive indexing, placement of form, chronological and geographical data, and precoordination versus postcoordination. She identified the following strategies for implementing controlled vocabulary in the Dublin Core: " 1) define the functional requirements of subject data in the metadata record; 2) examine and determine the suitability of existing controlled vocabularies for use in metadata records; 3) consider the possibility of providing online display of controlled vocabulary terms, similar to the lists that have been or are being developed for Resource type, Format, and Language lists; and 4) consider the possibility of designing interfaces that enable assigning subject terms by 'clicking and dragging' from a subject term list and linking author-input terms to controlled vocabulary." Then she raised the following questions about classificatory notation: "Should we encourage users to adopt, adapt, or modify existing schemes or develop new ones?"; "How suitable are existing schemes for use in metadata records?"; "Do we need close classification or will broad classification serve the purpose just as well?"; "Should class numbers and/or captions be included in the metadata record?"; What about "cross classification?" Next, she suggested the following strategies for implementing classification data: "1) examine and determine the suitability of existing classification schemes for use in metadata records; 2) study the feasibility of assigning classification data through online display and interface; 3) encourage the development of mechanisms for automatically mapping controlled vocabulary to classification scheme." The overriding principles of the Dublin Core Record, Chan emphasized, are simplicity, semantic interoperability and flexibility. In conclusion, she proposed a "distribution of effort" for implementing subject analysis to web documents between the web author/creator who would "assign subject data" as embedded metadata, the information professional who would "develop controlled vocabularies, set application policies, prepare guidelines, develop software for effective vocabulary control and retrieval, and train web authors/creators", and the computer which would perform "programmable and repetitive tasks."



A brief discussion of subject in the Dublin Core record took place. The Dublin Core will not be the only metadata standard reviewed by the subcommittee, however, it is designed to map to other metadata standards. At Midwinter 1999 the subcommittee members will develop a list of suggestions to define the subject field (data element 3) of the Dublin Core record. Between now and Midwinter in Philadelphia members of the subcommittee will investigate the strategies enumerated by Chan. Additional areas to consider include use of punctuation common to subject data, methods which web search engines currently and in the future search and index, and the proposals being made for other elements of the Dublin Core record.



Respectfully submitted,



Diane Dates Casey, Chair