Parsing Word Document XML via MsXml in VBA
An old friend ping me because while having some success loading word documents saved as xml in his VBA routines in word, he was having massive problems getting XPATH to work. I resolved the basic issue and then asked him to give me his hardest issue for me to show some example code for. I think that the last time that I did this was a few years ago, MsXml was (and is) a bit lame compared to later implementations, so his troubles were very reasonable to have.
Declaring name spaces
You can’t use XPath on a document with namespaces without providing the name spaces. The process is pretty simple, just convert the namespaces declare at the top of the XML document to a VB String. There happen to be many of them in a word xml document..
Dim domPrefix As String
domPrefix = "xmlns:wpc='http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas' " domPrefix = domPrefix + "xmlns:mc='http://schemas.openxmlformats.org/markup-compatibility/2006' " domPrefix = domPrefix + "xmlns:o='urn:schemas-microsoft-com:office:office' " domPrefix = domPrefix + "xmlns:r='http://schemas.openxmlformats.org/officeDocument/2006/relationships' " domPrefix = domPrefix + "xmlns:m='http://schemas.openxmlformats.org/officeDocument/2006/math' " domPrefix = domPrefix + "xmlns:v='urn:schemas-microsoft-com:vml' " domPrefix = domPrefix + "xmlns:wp14='http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing' " domPrefix = domPrefix + "xmlns:wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' " domPrefix = domPrefix + "xmlns:w10='urn:schemas-microsoft-com:office:word' " domPrefix = domPrefix + "xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " domPrefix = domPrefix + "xmlns:w14='http://schemas.microsoft.com/office/word/2010/wordml' " domPrefix = domPrefix + "xmlns:w15='http://schemas.microsoft.com/office/word/2012/wordml' " domPrefix = domPrefix + "xmlns:wpg='http://schemas.microsoft.com/office/word/2010/wordprocessingGroup' " domPrefix = domPrefix + "xmlns:wpi='http://schemas.microsoft.com/office/word/2010/wordprocessingInk' " domPrefix = domPrefix + "xmlns:wne='http://schemas.microsoft.com/office/word/2006/wordml' " domPrefix = domPrefix + "xmlns:wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'"
The next item is just loading the xml file, this was his existing code and worked fine
Set xmlDoc = New MSXML2.DOMDocument60 xmlDoc.validateOnParse = False xmlDoc.async = False xmlDoc.Load sFile If Not xmlDoc.Load(sFile) Then 'strXML is the string with XML' sErr = xmlDoc.parseError.ErrorCode & " " & xmlDoc.parseError.reason GoTo errXML End If
The next item was just setting two properties on the xmlDoc (unlike later implementations, there was no XmlNameSpaceManager needed). The key one is using the domPrefix string from above.
xmlDoc.setProperty "SelectionNamespaces", domPrefix xmlDoc.setProperty "SelectionLanguage", "XPath"
So if he wanted to find all of the tables, it was a simple:
Dim foundNodes As MSXML2.IXMLDOMNodeList Set foundNodes = xmlDoc.DocumentElement.SelectNodes("//w:tbl")
His tough problem
He wanted to find all images in the document, obtain their dimensions and the associated file name, plus some related stuff. The code was pretty clean to write with xPath, as shown below
' Find the actual drawing nodswe Set foundNodes = xmlDoc.DocumentElement.SelectNodes("//w:drawing") Debug.Print foundNodes.Length & " Drawings found" iNode2 = 1 For Each node In foundNodes Debug.Print "Image #" & iNode2 'We search from this node Set node2 = node.SelectSingleNode(".//wp:extent") Debug.Print node2.Attributes(0).BaseName & "=" & node2.Attributes(0).NodeValue Debug.Print node2.Attributes(1).BaseName & "=" & node2.Attributes(1).NodeValue Set node2 = node.SelectSingleNode(".//wp:docPr") Debug.Print node2.Attributes(2).BaseName & "=" & node2.Attributes(2).NodeValue 'Directly finding parent p node does not appear to be supported so.. Set node3 = node.ParentNode Do While node3.BaseName <> "p" Set node3 = node3.ParentNode Loop Set node2 = node3.SelectSingleNode(".//w:pStyle") Debug.Print node2.Attributes(0).BaseName & "=" & node2.Attributes(0).NodeValue iNode2 = iNode2 + 1 Next node
There were a few hiccups such as finding that ancestor: and descendent: did not appear to work, but there were easy work arounds for that.
The output is shown below
Friend helped, some memories refreshed, and a blog post that will hopefully help someone else!
Comments
Post a Comment