MergeDocx

MergeDocx. 0

INTRODUCTION.. 0

BASIC USAGE. 1

Concatenating several entire docx. 1

Concatenating parts of several docx. 1

Inserting in a table cell 4

Resolving altChunk. 6

Deleting part of a docx. 7

OpenDoPE. 7

MANY DOCUMENTS. 9

SETTINGS. 10

Page breaks. 10

Controlling Headers and Footers. 11

Page Numbering. 11

Macros. 12

Interaction between ODD_PAGE and Page Number restart. 13

Bullets & Numbering. 15

Styles. 15

EVENT MONITORING.. 16

ADVANCED TOPICS. 17

Document Defaults. 17

overrideTableStyleFontSizeAndJustification. 17

Paragraph spacing exception. 17

Editing Document Defaults. 18

 

INTRODUCTION

This chapter explains how to use the MergeDocx functionality, which is capable of appending/concatenating docx files together to create a single docx file.  For example, to place a cover letter and a contract into a single docx file, without changing the look/feel of either document.

BASIC USAGE

Concatenating several entire docx

 

A BlockRange is essentially a WordprocessingMLPackage, or a range of content in a WordprocessingMLPackage, plus config settings.

To merge docx files, you invoke DocumentBuilder with List<BlockRange>:

 

        List<BlockRange> blockRanges = new ArrayList<BlockRange>();

        blockRanges.add( new BlockRange( wordMLPkg1 ) );

        blockRanges.add( new BlockRange( wordMLPkg2 ) );

        // etc

       

        // Perform the actual merge

        DocumentBuilder documentBuilder = new DocumentBuilder();

        WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);

 

You can fine tune the merge process by configuring individual block ranges, or the DocumentBuilder object, as described in the SETTINGS section below.

 

The samples directory contains an example called MergeWholeDocumentsUsingBlockRange which you can use as a starting point.

 

Alternatively, there is a webapp which can generate code for you, based on your chosen configuration.

 

Note: there is also a static method you can use to merge a List<WordprocessingMLPackage>, but that is not recommended since it precludes user config of DocumentBuilder and individual BlockRanges. 

 

If you invoke DocumentBuilder with List<BlockRange>, obviously all your BlockRanges are in memory at once.   DocumentBuilderIncremental is a more memory efficient approach which avoids this.  See MANY DOCUMENTS further below.

 

Concatenating parts of several docx

If you wish to use only a certain part of the documents, you need to invoke DocumentBuilder with a List<BlockRange>

 

BlockRange associates a range with a WordprocessingMLPackage. 

The org.docx4j.wml.Body element has a method:

    public List<Object> getEGBlockLevelElts()

 

which contains the "block-level" document content (paragraphs, tables etc).[1]

BlockRange constructors let you say you want the contents starting from the nth element onwards:

    /**

     * Specify the source package, from "n" (0-based index) to the end of the document **/

    public BlockRange(WordprocessingMLPackage wordmlPkg, int n)

 

or count elements from the nth element:

    /**

     * Specify the source package, from "n" (0-based index) and include "count"

     * block-level (paragraph, table etc) elements. **/

    public BlockRange(WordprocessingMLPackage wordmlPkg, int n, int count)

 

or the entire docx:

    /**

     * Specify the entire source package. **/

    public BlockRange(WordprocessingMLPackage wordmlPkg)

   

 

For example:

              List<BlockRange> blockRanges = new ArrayList<BlockRange>();

              blockRanges.add(new BlockRange(wmlPkgIn));       // add all

              blockRanges.add(new BlockRange(wmlPkgIn, 0, 6)); // paras 0-5

              blockRanges.add(new BlockRange(wmlPkgIn, 6));    // paras 6 onwards

             

              DocumentBuilder documentBuilder = new DocumentBuilder();

              WordprocessingMLPackage output =

                            documentBuilder.buildOpenDocument(blockRanges);

 

The result is a new WordprocessingMLPackage containing the specified portions of the source documents.

The samples directory contains an example called MergeBlockRangeFixedN.

Where you want to use the nth element constructors, how do you determine n?  See Determining the nth element towards the end of this document.

You may use the one WordprocessingMLPackage in more than one BlockRange.  For example:

              List<BlockRange> blockRanges = new ArrayList<BlockRange>();

              blockRanges.add(new BlockRange(wmlpkg1, 12));

              blockRanges.add(new BlockRange(wmlpkg2, 3, 3));

              blockRanges.add(new BlockRange(wmlpkg1));  // Use wmlpkg1 again

 

You must not however, use a BlockRange object twice.  For example, the following is an incorrect usage:

              BlockRange blockRange1 = new BlockRange(wmlpkg1, 12);

              List<BlockRange> blockRanges = new ArrayList<BlockRange>();

              blockRanges.add(blockRange1);

              blockRanges.add(new BlockRange(wmlpkg2, 3, 3));

              blockRanges.add(blockRange1);  // Incorrect

 

Determining the nth element


As explained above, BlockRange constructors let you say you want the contents starting from the nth element onwards:

    /**

     * Specify the source package, from "n" (0-based index) to the end of the document **/

    public BlockRange(WordprocessingMLPackage wordmlPkg, int n)

 

or count elements from the nth element:

    /**

     * Specify the source package, from "n" (0-based index) and include "count"

     * block-level (paragraph, table etc) elements. **/

    public BlockRange(WordprocessingMLPackage wordmlPkg, int n, int count)

 

The question arises as to how to work out these numbers.

There are three approaches for finding the relevant block:

·         manually

·         via XPath

·         via TraversalUtils

TraversalUtils is the recommended approach.  This is mainly because there is a limitation to using XPath in JAXB (as to which see below).

Explanations of the three approaches follow.

Common to all of them however, is the question of how to identify what you are looking for. 

·         Paragraphs don't have ID's, so you might search for a particular string. 

·         Or you might search for the first paragraph following a section break.

·         A good approach is to use content controls (which can have ID's), and to search for your content control by ID, title or tag.

The examples provided show how to do each of these.  They can be readily adapted for other cases, such as before or after a table or image.  If you have any difficulties with your particular case, please do not hesitate to ask for support.

Manual approach

The manual approach is to iterate through the block level elements in the document yourself, looking for the paragraph or table or content control which matches your criteria.  To do this, you'd use org.docx4j.wml.Body element method:

    public List<Object> getEGBlockLevelElts()

 

XPath approach

Underlying this approach is the use of XPath to select JAXB nodes:

        MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();

        String xpath = "//w:p";        

        List<Object> list = documentPart.getJAXBNodesViaXPath(xpath, false);


You then find the index of the returned node in
EGBlockLevelElts.

Beware, there is a limitation to using XPath in JAXB: the xpath expressions are evaluated against the XML document as it was when first opened in docx4j.  You can update the associated XML document once only, by passing true into getJAXBNodesViaXPath. Updating it again (with current JAXB 2.1.x or 2.2.x) will cause an error. So you need to be a bit careful!

TraversalUtils approach

TraversalUtil is a general approach for traversing the JAXB object tree in the main document part.  TraversalUtil has an interface Callback, which you use to specify how you want to traverse the nodes, and what you want to do to them.

TraversalUtil can be used to find a node; you then get the index of the returned node in EGBlockLevelElts.

Examples are in the samples directory, named as follows:

 

Manually

via XPath

via TraversalUtil

String

MergeBlockRangeN
ViaManualString

MergeBlockRangeN

ViaXPathString

MergeBlockRangeN

ViaTraversalUtils

String

SectPr

MergeBlockRangeN

ViaManualSectPr

MergeBlockRangeN

ViaXPathSectPr

MergeBlockRangeN

ViaTraversalUtils

SectPr

Content control

MergeBlockRangeN

ViaManualContentControl

MergeBlockRangeN

ViaXPathContentControl

MergeBlockRangeN
ViaTraversalUtils
ContentControl

 

Inserting in a table cell


The approach described above doesn’t allow you to insert contents into a table cell.

To do this, you can either use the class ProcessAltChunk as described next page below, or you can use a placeholder to indicate where you want a BlockRange to be inserted.

The placeholder is a content control containing a

 

       <w:tag w:val="MergeDocx:BlockRangeIDREF=myTableContent"/>

 

in this case referencing a BlockRange having ID “myTableContent”. 

 

Inside a table cell, the complete placeholder would look something like this:

 

        <w:tc>

          <w:sdt>

            <w:sdtPr>

              <w:tag w:val="MergeDocx:BlockRangeIDREF=myTableContent"/>

            </w:sdtPr>

            <w:sdtContent>

              <w:p>

                <w:r>

                  <w:t>My placeholder7</w:t>

                </w:r>

              </w:p>

            </w:sdtContent>

          </w:sdt>

        </w:tc>

 

 

The BlockRange which will be placed at this location, is given a matching ID:

 

       blockrange.setID("myTableContent");

 

You may then invoke DocumentBuilder in the usual way.  The result will be that the contents of the table cell are replaced with the contents of the block range.

 

This works in a similar way to the way AltChunk processing works (see next page); in both cases you can insert the block range at locations where block/paragraph-level content is allowed.

For the best practice approach, please see the end of this section.  The interim content works up to that by describing alternatives.

 

The simplest approach is to add your ID’d block ranges to the blockRanges list before the ‘real’ documents:

 

                List<BlockRange> blockRanges = new ArrayList<BlockRange>();

                BlockRange block;

               

                // Define insertions

                block = new BlockRange(insertionDocx,1,1);

                block.setID("MySourceId");

                blockRanges.add( block );

               

                // Now add inputDocx1 proper

                block = new BlockRange(inputDocx1);

                blockRanges.add( block );

 

                // Perform the actual merge

The reason for this is that if instead the block range(s) being moved is/are last, then after it is/they are moved, the sectPr at the end of the previous block range is left untouched, and is now adjacent to the document level sectPr.  (The step of moving things around is the very last step in the MergeDocx process).

The downside of having your ID’d block ranges at the start of the blockRanges list, is that certain document wide defaults come from there.

If you have them at the end of the blockRanges list, you‘ll get two sectPr elements at the end of the document (the first belonging to the immediately prior block range, and the document level one an artifact from the block range which was moved).  For example:

        <w:p>

            <w:pPr>

                <w:sectPr w:rsidR="00D37ADB">

                    <w:pgSz w:h="16838" w:w="11906"/>

                    <w:pgMar w:gutter="0" w:footer="708" w:header="708" w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>

                    <w:cols w:space="708"/>

                    <w:docGrid w:linePitch="360"/>

                </w:sectPr>

            </w:pPr>

        </w:p>

        <w:sectPr>

            <w:pgSz w:h="16838" w:w="11906"/>

            <w:pgMar w:gutter="0" w:footer="708" w:header="708" w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>

            <w:cols w:space="708"/>

            <w:docGrid w:linePitch="360"/>

        </w:sectPr>

This is harmless enough, but if you wanted to fix it, you could in your own code programmatically delete the document level one, and you could also promote the sectPr from the last paragraph.

You wouldn’t want to setSectionBreakBefore(SectionBreakBefore.NONE) on the block range not being moved, since although you’ll end up with only one sectPr, it is the wrong one!

Finally, here is the best practise:

like so:

                WordprocessingMLPackage pkg1 = WordprocessingMLPackage.load(new File(file1));

                BlockRange source1 = new BlockRange(pkg1);

                source1.setSectionBreakBefore(SectionBreakBefore.NONE); // note this

               

                BlockRange tableContent = new BlockRange(WordprocessingMLPackage.load(new File(file2)));

                tableContent.setID("myTableContent");

               

                List<BlockRange> sources = new ArrayList<BlockRange>();

                sources.add(source1);

                sources.add(tableContent);

               

                // Add pkg1 again for our body level sectPr

                BlockRange emptyBR = new BlockRange(pkg1, 0, 0);  // none of the contents - just sectPr

                sources.add(emptyBR);

Resolving altChunk

altChunk is a way of telling a consuming application that certain content is to be included in the document.

For further details, please see http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx

Word 2007 understands what to do with an altChunk.

docx4j doesn't, unless you use the MergeDocx utility (or write or own code).  If your docx contains altChunks, it is important to be able to resolve them if you want to generate HTML or PDF output using docx4j.

MergeDocx handles altChunk of type docx, as opposed to html or plain text.  Support for altChunk of type xhtml is available in the docx4j-ImportXHTML jar.

The class ProcessAltChunk contains a method:

 

       public static WordprocessingMLPackage process(WordprocessingMLPackage srcPackage) throws Docx4JException

 

which will process docx altChunks in the Main Document Part (document.xml)

There is also the option to specify  how styles are handled:

        /**

         * Process srcPackage, replacing all alt chunks of type docx (as

         * opposed to HTML etc), with proper document content.

         *

         * @param srcPackage

         * @param styleHandler StyleHandler.USE_EARLIER or RENAME_RETAIN

         * @return

         * @throws Docx4JException

         * @since 3.2

         */

        public static WordprocessingMLPackage process(WordprocessingMLPackage srcPackage,

                        StyleHandler styleHandler) throws Docx4JException

Limitations/recommendations:

·         We recommend you avoid setting headers/footers in your altChunk.

Microsoft Word does strange things when an altChunk contains headers/footers; currently, MergeDocx does not attempt to duplicate this behaviour. 

·         altChunk elements in parts other than the Main Document Part (eg headers/footers, footnotes/endnotes and comments) are not converted.

Any comments and footnotes/endnotes in the altChunk should get added OK.

Deleting part of a docx

If you want to delete part of a docx, including the parts it references but which will no longer be used, you can use the constructor:

    /**

     * Specify the source package, from "n" (0-based index) and include "count"

     * block-level (paragraph, table etc) elements. **/

    public BlockRange(WordprocessingMLPackage wordmlPkg, int n, int count)

 

twice on the one input document, adding the bit before the stuff to be deleted, and the bit after it.

OpenDoPE

Background

The Open XML specification includes a technology called “Custom XML data binding”, which can be used in document automation and reporting scenarios to automatically inject data from an XML document of your choosing into your docx.

If a content control has an XPath, that XPath is used to retrieve the matching element from your XML document.

OpenDoPE (Open Document Processing Ecosystem) is a set of conventions for tagging a content control to enable:

·         conditional content

·         repeating content (eg rows of a table, or a bulleted or numbered list)

docx4j is the reference implementation of OpenDoPE.

MergeDocx support for OpenDoPE

You can use MergeDocx and OpenDoPE together.  Support for combining these technologies was significantly improved in MergeDocx v1.5.0

You can use MergeDocx first, and then docx4j’s OpenDoPEHandler.

Or you can use docx4j’s OpenDoPEHandler first, then MergeDocx.

Either order is supported, but it is probably more efficient to use MergeDocx first, followed by OpenDoPEHandler.  If you plan use MergeDocx first, and your documents include compound conditions (ie and|or|not operators), you must use docx4j 3.0.

MergeDocx is designed to ensure that each of the input docx uses its own OpenDoPE parts and XML answers, without interfering with the other input docx.

There are two approaches to supplying the XML answer files.

The first approach is to inject an appropriate answer file into each input docx before invoking MergeDocx and OpenDoPEHandler.  This is the approach which would be familiar to OpenDoPE users.

A second approach is to tell MergeDocx a Map of W3C DOM Documents containing answers which are to be used across the input documents.  The map is keyed by root element QName.  With this approach, you can skip the preliminary step of injecting real XML data into each input docx.

For example, suppose you were merging 3 documents, of which2 used an answer file with root element <supplier> and one used an answer file with root element <specification>.

With:

        Map<QName, org.w3c.dom.Document> answerDomDocs


you can set:

 

        documentBuilder.setOpenDoPEAnswers(answerDomDocs);

and the values supplied will be used in preference to whatever XML part (with corresponding root element QName) is in the input docx.

There is a helper class OpenDoPeRegistration, which adds an InputStream representation of your XML, to the Map<QName, org.w3c.dom.Document>.

        Map<QName, org.w3c.dom.Document> answerDomDocs = new HashMap<QName, org.w3c.dom.Document>();

        InputStream is = FileUtils.openInputStream(new File("supplier.xml"));

        OpenDoPeRegistration.register(answerDomDocs, is);

        is = FileUtils.openInputStream(new File("specification.xml"));

        OpenDoPeRegistration.register(answerDomDocs, is);

       

        documentBuilder.setOpenDoPEAnswers(answerDomDocs);

OpenDoPE processing of rich text fragments

OpenDoPE also allows you to bind to an XML node containing:

In both cases, docx4j will convert that to docx content.

In the Flat OPC XML case, it converts it to an AltChunk (see previous section).  MergeDocx can then convert the AltChunk to native document content.

MANY DOCUMENTS

If you invoke DocumentBuilder with List<BlockRange>, obviously all your BlockRanges are in memory at once.

If you are merging many documents, or even a smaller number of large documents, you may run out of memory.

DocumentBuilderIncremental is intended to help in this situation.  It allows you to work with a single BlockRange at a time.

Example of usage:

        DocumentBuilderIncremental dbi = new DocumentBuilderIncremental();

 

        for (int i = 0; i < MAX; i++) {

 

            BlockRange block = getBlockRange(i);  // Your method

 

            block.setSectionBreakBefore(BlockRange.SectionBreakBefore.NEXT_PAGE);

            if (i==0) {

                block.setHeaderBehaviour(BlockRange.HfBehaviour.DEFAULT);

                block.setFooterBehaviour(BlockRange.HfBehaviour.DEFAULT);

            } else {

                // Avoid creating unnecessary additional header/footer parts

                block.setHeaderBehaviour(BlockRange.HfBehaviour.INHERIT);

                block.setFooterBehaviour(BlockRange.HfBehaviour.INHERIT);      

            }

           

            System.out.println(i);

            dbi.addBlockRange(block, i==(MAX-1) );  // 2nd param is whether this is your last docx

        }

       

        WordprocessingMLPackage output = dbi.finish();// Get the output docx

In the example above, the headers/footers are taken from the first document only.  This avoids creating potentially thousands of header/footer parts, where just a couple suffice.

 

 

SETTINGS

Page breaks

MergeDocx ensures that each document is separated by a section properties element.  The relevant properties are actually contained in the first sectPr element in the second of any two BlockRanges. 

In other words, if 3 documents are concatenated, and each is just a single section, the resulting document will contain 3 sections.

By default each section starts on a new page.

If you want to avoid the page break, use BlockRange's setSectionBreakBefore method:

              BlockRange blockRange1 = ...

              BlockRange blockRange2 = ...

             

              // avoid page break

              blockRange2.setSectionBreakBefore(SectionBreakBefore.CONTINUOUS);

             

              etc.


The MergeBlockRangeFixedN sample utilizes this.

Your choices for the SectionBreakBefore property are:

·         NONE

·         NONE_MERGE_PARAGRAPH

·         NEXT_PAGE

·         NEXT_COLUMN

·         CONTINUOUS

·         EVEN_PAGE

·         ODD_PAGE

With the exception of "NONE" and "NONE_MERGE_PARAGRAPH" these mirror values available in Word.

Since what happens between documents is controlled by the first sectPr in the second of the two documents, MergeDocx will set the first sectPr in the second document with the value specified.  If there is no sectPr, it will add one at the end of the BlockRange and set that.

"NONE" is a bit different.  In this case, no sectPr will be added, and nor will any existing sectPr be altered.  So you can think of it as "unspecified".   NONE can be useful if you want to manipulate sectPr values in your own code.

"NONE_MERGE_PARAGRAPH" will attempt to merge the last paragraph of the previous block range with the first paragraph of this one.

 

If you leave the propery unset, MergeDocx will add a sectPr if one is not present.  MergeDocx will not set its type.  If the type is not set, the default is NEXT PAGE, according to the OpenXML spec. 

Note:  in Word, by default, ODD_PAGE is not honoured if you have set page numbering to restart.  Please see the section after Page Numbering below for details as to how to control this behaviour.

Also, Word will ignore a “continuous” setting, and insert a page break, if it detects that the page sizes of the two contiguous sections are different.  This can produce unexpected results where, for example, both page sizes are intended to be A4 portrait, but specified in units which differ (for whatever reason) by a few mm.  The sample NormalizePageSizes contains code which demonstrates how to address this issue.

Controlling Headers and Footers

Suppose you are merging docx1 and docx2.

The default behaviour is as follows:

·         If docx1 has a header, and docx2 does as well, then by default both sets of headers will be used.

·         If docx1 has a header, but docx2 doesn't, then by default the pages from docx2 will be shown using headers from docx1.

You can override this behaviour:

·         if you want no headers defined in the first section of docx2:

              blockRange2.setHeaderBehaviour(HfBehaviour.NONE);

·         if docx2 has headers defined in its first sectPr, but you want to ignore them and use the headers from docx1:

              blockRange2.setHeaderBehaviour(HfBehaviour.INHERIT);

There is a similar method for controlling footer behaviour, called setfooterBehaviour.

Page Numbering

Suppose you are merging docx1 and docx2, and showing page numbers or cross referencing to page numbers.

Unless docx2 explicitly restarts page numbering, the numbers will continue on from those in docx1.

You can make the page numbering restart with:

              blockRange2.setRestartPageNumbering(true);

If you are using page numbering of the form "page n of <total pages>" and you want <total pages> to reflect the number of pages in the relevant original document (rather than the number of pages in the resulting merged document), you should change your source documents so that they refer to <Total Number of Pages in Section>.  See further http://support.microsoft.com/kb/191029

This will work provided each source docx has a single section.  If the source documents have multiple sections, you will need to put a bookmark on the last page of each, and use a reference to that as the total number of pages.

If you have front matter you wish to exclude from the number of pages, you need to do a calculation[2]:

·         If you know the number of pages in the front matter (and it will not change), then you can use Page { Page } of { = { NumPages } - x }, where x is the number of pages in the front matter.  For example:


        (toggle field codes to see)

·         If not, then you insert a bookmark on the last page of the document and use a PageRef field to reference the page number of that bookmark instead of the NumPages field.

 

Macros

The default behaviour of MergeDocx is to produce an output docx which contains no macros.

You can configure DocumentBuilder to retain the macros present in one of the source documents.  To do this, you need to be using the BlockRange approach.

DocumentBuilder contains:

       /**

        * With this setting, you can embed macros from one of the input documents, in the output docx.

        * Without it, macros will simply be ignored.

        * The macros come from the docm or dotm underlying the specified BlockRange.

        * The setting will be ignored if a docx or dotx underlies the specified BlockRange.

        * @param br

        */

       public void setRetainMacros(BlockRange br)

So you can do something like:         

        documentBuilder.setRetainMacros(blockRanges.get(2));

to keep the macros from docm/dotm underlying the 3rd BlockRange.

If MergeDocx finds macros in that block range, the resulting output document will be set to be of the same type (ie docm or dotm).  It is your responsibility, when saving your output WordprocessingMLPackage, to save it with the correct filename extension.  If a docm is saved with a docx extension, if you try to open it in Word 2010, you will an error similar to the following:

So you need to ensure you use the correct filename extension.

Interaction between ODD_PAGE and Page Number restart

With MergeDocx, you can use the settings described above to have each new document start on the right (recto) page, with numbering starting again from one:

        block.setSectionBreakBefore(SectionBreakBefore.ODD_PAGE);

        block.setRestartPageNumbering(true);

 

Microsoft Word will not however, honour this combination, unless the docx is “tweaked” to make it do so.

There are two different ways MergeDocx can tweak the output docx in order to have Word behave as expected.  You’ll need to experiment with both approaches; this is best done by physically printing the output from Word to your printer or to PDF.  (You can print 4 pages per side to save paper, and still see what is going on.)

The first is:               

documentBuilder.setSectionBreak_ODD_PAGE(
                        BEHAVIOUR_SectionBreak_ODD_PAGE.
MIRROR_MARGINS);

This is the cleanest approach, and should be used where possible.  For it to work, you need to ensure your first docx being merged has a document settings part (since the mirror margins setting is stored in that part, and MergeDocx gets that part from the first docx).

The second is:               

documentBuilder.setSectionBreak_ODD_PAGE(
                        BEHAVIOUR_SectionBreak_ODD_PAGE.
FIELD_IF_MOD);

If you use this approach, MergeDocx will insert an arcane field into your docx before appropriate sections (hit Shift F9 to see field codes):


 

The table below summarises the advantages and disadvantages of each approach:

MIRROR_MARGINS

+ doesn’t introduce fields into the docx

- may not work if documents contain both portrait and landscape pages; see http://support.microsoft.com/kb/185528

- first docx must have a document settings part for this to work (you can add one with docx4j if it doesn’t)

- single setting per docx (though the other approach is the same in practice)

FIELD_IF_MOD

+ suited to a mixture of portrait and landscape pages
+ could in principle control each docx separately (contact Plutext if you need this)

+ can be adjusted to include “this page intentionally left blank”

- PDF output systems (other than Word) are less likely to support

 

Bullets & Numbering

When documents using the "same" numbering are merged, by default, the numbering will continue, not restart.

This is useful if you are merging chapters of a book, or sections of a contract, and you want the numbering to continue.

Sometimes however, you may want to force the numbering to restart.  To do this, you instruct MergeDocx to add new lists, rather than re-using existing lists.

To do this, NumberingHandler to ADD_NEW_LIST:

              BlockRange blockRange1 = ...

              BlockRange blockRange2 = ...

 

              source2.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);

 

The default is USE_EARLIER_IFF_SAME.  "same" means the formatting definition is the same (ie they look the same), and the list is based on the same abstract numbering definition identifier (nsid).

There is a third option, USE_EARLIER, which will use a list with the same nsid from an earlier BlockRange, irrespective of whether it looks the same.  The numbering will continue, not restart.  For example if the numbering of the list in the first BlockRange was decimal, and the second BlockRange contained a list with the same nsid but roman numbering, applying the USE_EARLIER to the second BlockRange would cause its numbering to be decimal (rather than roman).

Styles

 By default, if a style is encountered which is already defined in an earlier BlockRange, that earlier definition will be used.  If the definition is different, this will cause the appearance of text using this style to change. 

If the documents you are merging were styled independently, you will probably want them to retain their individual look.  This can be accomplished by importing the styles (and renaming them so they don't collide).

To do this, setStyleHandler to RENAME_RETAIN:

              BlockRange blockRange1 = ...

              BlockRange blockRange2 = ...

 

              source2.setStyleHandler(StyleHandler.RENAME_RETAIN);

 

Known limitation regarding Table of Contents: consider a style which will be renamed.  A TOC field which refers to that style will not be updated to use the new name.  This means entries in the table of contents will go missing.

If the document contains numbering, you'll also want to :

 

              source2.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);

 

(The default option is USE_EARLIER).

EVENT MONITORING

Since merging documents can take some time (depending on the number and complexity of the documents), the possibility exists (new in 3.1.0) of performing the merge in the background, and receiving notification when the job is complete.

See the MergeDocxProgress sample for an example of usage.

As per that example, you need to:

This is done as follows:

        // Creation of message bus

        MBassador<Docx4jEvent> bus = new MBassador<Docx4jEvent>(

                        BusConfiguration.Default());

        //  and registration of listeners

        ListeningBean listener = new ListeningBean();

        bus.subscribe(listener);       

        // tell Docx4jEvent to use your message bus for notifications

        Docx4jEvent.setEventNotifier(bus);

 

The sample class contains an example ListeningBean.  Note the @Handler annotation.

Docx4j’s approach to event monitoring relies on the MBassador library; see further https://github.com/bennidi/mbassador

For another example of monitoring events (docx load, save), please see https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/EventMonitoringDemo.java

ADVANCED TOPICS

Document Defaults

The styles part of a docx contains an element called w:docDefaults.  Example contents:

  <w:docDefaults>

    <w:rPrDefault>

      <w:rPr>

        <w:rFonts w:asciiTheme="minorHAnsi" w:eastAsiaTheme="minorEastAsia" w:hAnsiTheme="minorHAnsi" w:cstheme="minorBidi"/>

        <w:sz w:val="22"/>

        <w:szCs w:val="22"/>

        <w:lang w:val="en-US" w:eastAsia="ko-KR" w:bidi="ar-SA"/>

      </w:rPr>

    </w:rPrDefault>

    <w:pPrDefault>

      <w:pPr>

        <w:spacing w:after="200" w:line="276" w:lineRule="auto"/>

      </w:pPr>

    </w:pPrDefault>

  </w:docDefaults>

 

These are the basic/root settings, on which the formatting/appearance is based.  See further below for tips on seeing/manipulating w:docDefaults

When documents are merged, there can only be one w:docDefaults element.

If one or more blockrange have StyleHandler.RENAME_RETAIN (that is, you want to retain the existing look of each individual document), or incremental processing is being used, we merge the properties in doc defaults into the styles (with the exception of paragraph spacing – see further below).

 

overrideTableStyleFontSizeAndJustification

In a document created in Word, the settings part, by default contains:

    <w:compatSetting w:name="overrideTableStyleFontSizeAndJustification" .. w:val="1"/>

 

but this may vary by input document.

Where it is false, then anything in a table where font size 11/12 or jc left came from the Normal style was ignored (in favour of whatever the table style specified).

In the output docx, this is always set, so paragraph styles do override table styles.

Where that wasn’t true in a particular input document, appropriate adjustments are made.

Paragraph spacing exception

Consider the above example, where w:docDefaults contains a setting for w:spacing

    <w:pPrDefault>

      <w:pPr>

        <w:spacing w:after="200" w:line="276" w:lineRule="auto"/>

      </w:pPr>

    </w:pPrDefault>

This is a special case, because if this is merged into a style used in a table, it will affect table row heights:- Word applies different layout rules inside a table cell, depending on whether this setting is in w:docDefaults or a paragraph style.

So w:spacing is not copied from w:docDefaults

Only the value from the first BlockRange is used, and if there are differing values in subsequent input documents, that information is lost.

So for best results, you should ensure each input document uses the same w:spacing setting in its w:docDefaults (no setting for w:spacing is a good option).

Editing Document Defaults

Microsoft Word provides ways to edit your document defaults, but no easy way to be sure what the settings are (since the Word interface conflates the default paragraph style (eg Normal) and DocDefaults/pPrDefault!).

To see the actual settings, we recommend looking at the raw XML.  There are a few different ways to do this:

In Java

            // Given WordprocessingMLPackage

            org.docx4j.wml.Styles styles = (org.docx4j.wml.Styles)wmlPkg.getMainDocumentPart().getStyleDefinitionsPart().getJaxbElement();

            System.out.println(

                        org.docx4j.XmlUtils.marshaltoString(styles.getDocDefaults()));

 

or just:

            System.out.println(

                        wmlPkg.getMainDocumentPart().getStyleDefinitionsPart().getXML() );

 

They’ll be at the top.

or  use the Docx4j Helper Word Addin (v3.3)

Clicking that, you’ll see your w:docDefaults in an editor window:

If you edit the XML then click the apply button, the result will be a new docx containing your new settings.

or, unzip the docx, then open styles.xml

 

or use the webapp, to navigate to the styles part

 

or, if you have Visual Studio,

use the Open XML Package Editor for Visual Studio:  https://visualstudiogallery.msdn.microsoft.com/450a00e3-5a7d-4776-be2c-8aa8cec2a75b

With that you can drag your docx onto Visual Studio, then navigate the tree to the styles part.

You can edit and save your changes.

 

With some of the above approaches, you can edit your w:docDefaults.

Alternatively, you can do this in Word:

·         To set paragraph level doc default properties, right click then choose “Paragraph” from the context menu. 

You should see:

The key is the "set as default" button.

·         To set run level doc default properties, right click then choose “Font” from the context menu. 

Again, when you have things set as you wish, click the "set as default" button.



[1] Since docx4j 2.7.0, you can also use the ContentAccessor interface (which is supported by various objects):

    public List<Object> getContent()

[2] http://www.eggheadcafe.com/microsoft/Word-Page-Layout/35979216/total-page-number-minus-number-of-pages-in-front-matter.aspx
http://wordribbon.tips.net/T010604_Field_Reference_to_Number_of_Prior_Pages.html