Jeff,
Going back to basics: You are presently bulk-importing from flat-format files, and said you hoped to minimize re-coding existing systems, so let’s loop back to that concept.
Keep in mind that a big motivation for using XML is to allow transmitting complete text data (e.g., physician’s notes, etc.) instead of truncating it to a couple of thousand characters, so you don’t want to use the current NAACCR undelimited format. (These data items are identifiable in the NAACCR XML dictionary with allowUnlimitedText=”true”.) Also note that the plan is to eliminate the “starting position” attribute of NAACCR data items in the not-distant future, so positioning in a flat file will be something you’ll have to maintain with your own “record layout”. Boy, this is getting ugly really fast, isn’t it?
When importing from flat ASCII, do delimit the fields so that you can know where the text fields begin and end. But I recommend not delimiting with the pipe character (‘|’). In fact, don’t use any keyboard character because somebody is bound to embed it in a text field. I like the guillemet for this purpose (‘»’, typed from the keyboard with Alt+0187) because it is really unlikely to be typed during data entry. And you’ll need to flatten the CRLFs in text data; I think SQL Server understands “~” to be a linefeed, but you’d have to look into what Oracle uses.
You’ll have to write the converter yourself; the freebie IMS and NPCR tools perform conversions between XML and the traditional undelimited NAACCR flat record.
BTW, when you import data now are you loading into temporary tables, or are you loading straight into your production tables? And how long does it take you to load 100 thousand data files of Incident records each day? Again, I’m asking simply because I am finding this whole topic really interesting… so if you get bored with amusing me, feel free to ignore me!
Kathleen