Sunday, January 26, 2014

Parsing Word Document XML via MsXml in VBA

An old friend ping me because while having some success loading word documents saved as xml in his VBA routines in word, he was having massive problems getting XPATH to work. I resolved the basic issue and then asked him to give me his hardest issue for me to show some example code for. I think that the last time that I did this was a few years ago, MsXml was (and is) a bit lame compared to later implementations, so his troubles were very reasonable to have.

 

Declaring name spaces

You can’t use XPath on a document with namespaces without providing the name spaces. The process is pretty simple, just convert the namespaces declare at the top of the XML document to a VB String. There happen to be many of them in a word xml document..

 

Dim domPrefix As String
domPrefix = "xmlns:wpc='http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas' "
domPrefix = domPrefix + "xmlns:mc='http://schemas.openxmlformats.org/markup-compatibility/2006' "
domPrefix = domPrefix + "xmlns:o='urn:schemas-microsoft-com:office:office' "
domPrefix = domPrefix + "xmlns:r='http://schemas.openxmlformats.org/officeDocument/2006/relationships' "
domPrefix = domPrefix + "xmlns:m='http://schemas.openxmlformats.org/officeDocument/2006/math' "
domPrefix = domPrefix + "xmlns:v='urn:schemas-microsoft-com:vml' "
domPrefix = domPrefix + "xmlns:wp14='http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing' "
domPrefix = domPrefix + "xmlns:wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' "
domPrefix = domPrefix + "xmlns:w10='urn:schemas-microsoft-com:office:word' "
domPrefix = domPrefix + "xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' "
domPrefix = domPrefix + "xmlns:w14='http://schemas.microsoft.com/office/word/2010/wordml' "
domPrefix = domPrefix + "xmlns:w15='http://schemas.microsoft.com/office/word/2012/wordml' "
domPrefix = domPrefix + "xmlns:wpg='http://schemas.microsoft.com/office/word/2010/wordprocessingGroup' "
domPrefix = domPrefix + "xmlns:wpi='http://schemas.microsoft.com/office/word/2010/wordprocessingInk' "
domPrefix = domPrefix + "xmlns:wne='http://schemas.microsoft.com/office/word/2006/wordml' "
domPrefix = domPrefix + "xmlns:wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'"
The next item is just loading the xml file, this was his existing code and worked fine
    Set xmlDoc = New MSXML2.DOMDocument60
    xmlDoc.validateOnParse = False
    xmlDoc.async = False
    xmlDoc.Load sFile    
    If Not xmlDoc.Load(sFile) Then  'strXML is the string with XML'
        sErr = xmlDoc.parseError.ErrorCode & "  " & xmlDoc.parseError.reason
        GoTo errXML
    End If
The next item was just setting two properties on the xmlDoc (unlike later implementations, there was no XmlNameSpaceManager needed). The key one is using the domPrefix string from above.
    xmlDoc.setProperty "SelectionNamespaces", domPrefix
    xmlDoc.setProperty "SelectionLanguage", "XPath"

So if he wanted to find all of the tables, it was a simple:

Dim foundNodes As MSXML2.IXMLDOMNodeList  
Set foundNodes = xmlDoc.DocumentElement.SelectNodes("//w:tbl")

His tough problem

He wanted to find all images in the document, obtain their dimensions and the associated file name, plus some related stuff. The code was pretty clean to write with xPath, as shown below

 ' Find the  actual drawing nodswe
    Set foundNodes = xmlDoc.DocumentElement.SelectNodes("//w:drawing")
        Debug.Print foundNodes.Length & " Drawings found"
        iNode2 = 1
    For Each node In foundNodes
    Debug.Print "Image #" & iNode2
    'We search from this node
    Set node2 = node.SelectSingleNode(".//wp:extent")
    Debug.Print node2.Attributes(0).BaseName & "=" & node2.Attributes(0).NodeValue
    Debug.Print node2.Attributes(1).BaseName & "=" & node2.Attributes(1).NodeValue
    Set node2 = node.SelectSingleNode(".//wp:docPr")
        Debug.Print node2.Attributes(2).BaseName & "=" & node2.Attributes(2).NodeValue
    'Directly finding parent p node does not appear to be supported so..
    Set node3 = node.ParentNode
    Do While node3.BaseName <> "p"
        Set node3 = node3.ParentNode
    Loop
    
    Set node2 = node3.SelectSingleNode(".//w:pStyle")
        Debug.Print node2.Attributes(0).BaseName & "=" & node2.Attributes(0).NodeValue
     iNode2 = iNode2 + 1
  Next node
There were a few hiccups such as finding that ancestor: and descendent: did not appear to work, but there were easy work arounds for that.
The output is shown below
output
Friend helped, some memories refreshed, and a blog post that will hopefully help someone else!

2 comments: