Child pages
  • Publisher Data Formats
Skip to end of metadata
Go to start of metadata

All data loaded into ScienceServer is converted from the publishers native data format to an XML format specific to ScienceServer.  This internal format is defined in the scserv.dtd file and has gone through 2 versions.  The current version is 2.3.  All XML files in ScienceServer use UTf-8 encoding.  The 2.3 DTD supports metadata records, issue tocs, journal tocs, and full-text articles.  Data in MarkLogic will be converted to the NIH Journal Publishing and Archiving DTD.  Wherever possible, we will convert our current FTP oriented download scripts to OAI harvesting.

We convert data from the following publisher formats:

Academic Press

-    AC Press files are in SGML format following the APJA 2.4 & 2.5 DTD
-    not sure if the datasets have now been converted to the Elsevier SDOS format
-    not sure how AC Press backfiles have been shipped to us
-    there seems to be a mix of AC Press SGML files and Elsevier SGML files
-    we receive metadata and PDF files but not full-text XML

American Psychological Association

-    subscription was moved to CSA and Illumina in 2006
-    but we continue to receive copies from the Elsevier warehouse in SDOS format
-    we recieve metadata and PDF but not full-text XML

American Chemical Society

-    datasets are in a simple SGML format but there is no DTD referenced in the SGML files
-    there is unhandled publisher markup in some journals
    o    <FNR>
    o    <BI>
-    the records are very brief with no cited references, no PDFs, and no fulltext XML
-    we also have hundreds of PDFs for older material not currently loaded
-   can we get metadata for these backfiles?
-   the current data is metadata only -- no PDFs or XML full-text
-   how could we supplement the stored data? can we grab the HTML page for each article from the ACS site and strip data elements from there?


-   SGML files (by extension) are actually XML files with IS0-8859-1 encoding
-   the XML files do not reference a DTD
-   we receive metadata and PDFs


-    data is in XML format, encoded as UTF-8, and uses Blackwell DTD (bpg4-0.dtd) and stylesheet (bgp4-0.xsl)
-    lots of problems with incomplete journals but the ones that are there look complete
-    some author names have spaces in the last name
-    <formula> tags shows in some titles
-    Canadian Geographer has French articles but the language encoding is not correct
-    <external_link> tag is not handled in Child and Adolescent Mental Health
-    not sure if the data format will be switching to Wiley after the recent acquisition
-    we receive metadata and PDFs but suppress the display of any journals with incomplete holdings, sending the user to the Blackwell Synergy site instead

Cambridge University Press

-    articles are mostly in SGML format using v 2.1 of the Cambridge Journal DTD (cupjnl2_1.dtd)
-    encoding of the SGML files is unknown
-    titles are being migrated to new XML format one by one
-    the new XML format is the NIH Journal Publishing DTD v 2.2
-    we are currently not loading XML titles until we can modify the data loader to handle these
-    we receive metadata and PDF
-    Cambridge SGML DTD

Emerald Publishing

-    Source data is SGML using mcb.dtd (no version numbering)
-    Lots of entities
-    BDY take holds full-text
-    unhandled <IT><UP><DN> tags in citations and titles
-    abstract headings tagged with <b> but not <p>
-    many unhandled entities in affiliation field of Journal of Investment Compliance
-    conversion is pretty good overall
-    but there is full-text also to convert so may be best to convert from source
-    we receive metadata, PDFs, and full-text markup
-    Emerald MCB DTD


-    we receive Effect v 4.1 data in SDOS format (current version is 3.0)
-    a dataset consists of a dataset.toc file, PDFs, and full-text SGML or XML articles with associated image files
-    the SGML articles follow the Elsevier FLA DTD 4.5 and the XML articles (introduced in March 2004) follow the FLA DTD 5.0 and use UTF-8 encoding
-    some older issues have PDF "extra pages" not associated with any article metadata
-    different manifest types over time, including tiff, raw, pdf variants
-    we need to figure out how to convert TIFF page images to  PDF
-    we load the XML articles (after they are converted to the ScienceServer 2.3 DTD) and render them as HTML using a stylesheet developed for this purpose
-   we receive metadata, PDFs, and full-text XML
-   Elsevier SDOS Information Page


-    until 2007 data has been supplied in a 3 number tagged format e.g. (002) Journal of xxx
-    we receive different files for journals, conferences, and standards (ieescnf, ieeestd, ieeejrn)
-    we are only loading journals and only those with ISSNs since ScienceServer cannot handle other formats
-    IEEE (and INSEPC) are converting to XML using the ieee_idams_exchange.dtd
    o    Uses ieee_idams_echange.dtd
    o    Lots of use of CDATA sections
-    lots of tex tags for math equations in titles and abstracts which should be converted to MathML
-    look into texvc for conversion of latex AMS math to html or mathml or PNG
-   we have stopped loading IEEE titles until we can modify the loader to handle the new XML format
-   we receive metadata and PDFs
-   About IEEE XML DTD

Kluwer Academic and Plenum

-   not sure what the history is here since Kluwer has gone through some orginizational changes over the years
-   i.e. break off of Kluwer Law and the purchase of Kluwer by Springer
-   some files are in SGML format with no DTD but a tag that indicates the format as "oases version 3"
-   lately, files seem to be coming from Springer in XML format using their A++ DTD v 4.2 with UTF-8 encoding
-   we receive metadata and PDFs

Kluwer Law

-    continues to use OASES but seems to have converted it to XML
-    uses oases-xml.dtd - lots of entities
-    we receive metadata and PDFs

Oxford University Press

-   until 2007 used SGML format and OUP Article Header DTD
-   has shipped us NIH XML content recently for backfiles which have been loaded into ML
-   has stopped shipping new content until they convert everything to NIH XML format
-   we receive metadata and PDFs
-   see this page for detailed information on Oxford's DTDs and on their OAI harvesting interface provided through HighWire
-   About new OUP XML DTD

Project MUSE

-    source data seems to be SGML because it lacks XML header
-    does not reference a DTD or specify an encoding, though it seems to include UTF-8 encoded Unicode codepoints or ISO-8859-1 encoded high ascii characters
-    see Wei for a paper copy of the Muse DTD
-    lots of character conversions problems in some journals (e.g. Canadian Historical Association)
-   we receive only metadata from MUSE
-    MUSE is testing a new OAI harvesting interface

Royal Society of Chemistry

-    datasets are in XML format using UTF-8 encoding
-    uses rscart37.dtd with inline entity definitions for figures to be displayed as images
-    <fn> and <url> tags in title field not handled by the loader -- need to fix this
-    footnotes are handled properly in the display (see Chemical Communiations 2007(24): 2485
-    we receive metadata and PDFs
-    Entities File
-    About RSC DTD


-    datasets are in XML using ISO-8859-1 encoding
-    uses SAGE_meta.dtd  (no version number)
-    includes full-text in metadata record which should be converted to XML but isn't
-   we receive metadata, PDFs, and full-text markup


-    oldest stuff uses MAJOUR SGML format
-    newer stuff includes DOCTYPE def and references Springer DTD Header v 1.2
-    files include entity references e.g. ä
-    most recent files follow the Springer A++ XML DTD
-    Kluwer articles are showing up under Springer directory now
-    some math articles have "formula not shown" - need to handle these better
-    no cited references
-   we receive metadata and PDFs

Taylor and Francis

-    until recently datasets have been in SGML format
-    no DTD reference in header
-    UTF-8 seems to be the encoding format but isn't being handled properly by the loader
-    began receiving XML formatted documents in 2007 with the switch to InformaWorld
-    have stopped loading new content until loader can be rewritten
-    we receive metadata and PDFs
-    Taylor and Francis XML DTD


-    currently uses SGML format but has plans to convert to XML
-    references the JWSART v 3.4 Wiley Journal Format DTD in header
-    lots of entities used
-    some entities showing through &lpar; &hyphen
-    some tags not handled by loader e.g. <TOGGLE>
-   we receive metadata and PDFs

New Publishers not Yet Being Loaded

American Institute of Physics

-    uses the SPIN DTD and SGML files
-    is converting to XML in 2008
-    provides full-text XML articles


- can't figure out if this is article level data or not 

Abstract and Index Databases on Scholars Portal Search


- MARC records


- XML metadata and fulltext in APA's PIX format

Web of Science

- 2 letter tagged codes


- has recently implemented a new XML format
- have stopped updating this file until we can convert the loader
- provides the INSPEC thesaurus in XML format too


- MARC records


- XML records but not referencing a particular DTD


- MARC records


- 2 letter code but switching to XML in late 2007
- will wait until then before loading this one

Thomson Gale

- XML using ISO-8859-1 encoding


- will load this one next
GEOBASE Technical Site


- we have data in an MCSR directory on ScienceServer and have instrucitons on downloading via OAI
- have not figured out how to load this one yet


- full-text XML articles can be downloaded via OAI
- have tested but we are not downloading records yet

  • No labels