16 December, 2011

Comparing two XML documents in Java

How many times have you, for example, modified some XML file but wanted to make sure that only certain parts were changed?  Well it did occurred to me, a number of times.  Actually the problem is this:  I had a small application which modifies XML files in a graphical way and I wanted to ensure that nothing got messed up when I saved the file.  By "messed up" I mean tags changing their values or positions.

So what I did was build a simple program which will take in two files (XML, naturally) and then map all the nodes and compare their paths and values.  Quite a simple task.


So first, we'll start off with the usual Utilities class.  Static methods such as those reading a file should be placed here, ideally.


package com.jf.xmlcomp.util;




import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.Scanner;


import javax.xml.parsers.DocumentBuilderFactory;


import org.w3c.dom.Document;


public class Utils
{
public static String readStringFromFile(File file)
{
StringBuilder ds_text = new StringBuilder();
String str_sep = System.getProperty("line.separator");
try
{
Scanner ds_scan = new Scanner(new FileInputStream(file), "UTF-8");
while (ds_scan.hasNextLine())
ds_text.append(ds_scan.nextLine() + str_sep);
ds_scan.close();
}
catch (Exception e) 
{
System.err.println(file.getPath() + " :: Not read");
}

return ds_text.toString();
}

public static Document stringToDocument(String str_doc)
{
Document ds_doc = null;
try
{
ds_doc = DocumentBuilderFactory
.newInstance()
.newDocumentBuilder()
.parse(new ByteArrayInputStream(str_doc.getBytes()));
}
catch (Exception e) 
{
System.err.println("String could not be converted to Document");
}

return ds_doc;
}
}


We only have two simple methods here; one which reads in the contents of a file and one which converts an XML String to a W3C Document object.  I won't be detailing much about reading a file or converting a String as it is not the scope of this post.

Next thing to do is create a very simple implementation of a Map, sort of.  Now those that actually know what a Map is will be frothing at their mouths, since what we will be doing is a simple ArrayList mapping and we won't even be using Generics, hashes or even collision tables.  The reason is that this is a simple post about creating a simple XML Comparator.

So, to the SimpleMap thing...


package com.jf.xmlcomp.util;


import java.util.ArrayList;
import java.util.Comparator;
import java.util.Set;
import java.util.TreeSet;


public class SimpleMap
{
private ArrayList<String> keys;
private ArrayList<String> values;


public SimpleMap()
{
keys = new ArrayList<String>();
values = new ArrayList<String>();
}


public void put(String s, String v)
{
keys.add(s);
values.add(v);
}


public ArrayList<String> get(String k)
{
ArrayList<String> result = new ArrayList<String>();
for (int c = 0; c < keys.size(); c++)
if (keys.get(c).equals(k)) result.add(values.get(c));


return result;
}


As you can see, we created the usual methods you'd find in a Map: the put and get.  The rest should be pretty easy; declaring the ArrayLists containing the keys and values and then initialising them in the constructor.

The put method will simply add the new key and value Strings in the respective ArrayList while get, gets a little over-done.  In the get method we first set up a results ArrayList and then loop through the keys list and store every matching key.  The key won't be stored though, as they are of no use really; if you ask for door which is opened by a particular key, you will want to find a door.  So we get the index at which the key was found and then store the value found at the same index in the values list in the results list.  Finally the results list is returned.  Probably you now know that we can have the same key appearing more than once in this map.


public ArrayList<String> keys()
{
return keys;
}

public ArrayList<String> getCompleteItemList()
{
ArrayList<String> res = new ArrayList<String>();
for (String key : uniqueKeys())
for (String val : get(key))
res.add(key + " :: " + val);
return res;
}

Following the mapping details are two other simple methods.  One simply returns the list of keys while the other will return all the items we have as a String of key :: value.  At first it might look like an over kill.  Getting only the unique keys only to loop over the values obtained by every key.  What happens if you use the normal key list and loop over it, is a repetitive amount of values.

Let's say there are two keys named "a" and "b".  If we loop over the key list we get "a", "a", "b".  Now let's include the part which gets the values obtained from each key.  We will get two values for each key named "a", so eventually we'll end up with the values of a key named "a", twice.

So to "fix" this, we will simple get each key once and then simply add the resulting values of each.  That way we can be totally sure that no key-value pair is repeated.

Naturally we'll be needing the uniqueKeys method, so here it comes.




public ArrayList<String> uniqueKeys()

{
Set<String> s = new TreeSet<String>(new Comparator<String>()
{
@Override
public int compare(String o1, String o2)
{
if (o1.equals(o2))
return 0;
return 1;
}
});
s.addAll(keys);

ArrayList<String> result = new ArrayList<String>();
result.addAll(s);

return result;
}


This method will create a Set (which will not accept duplicates) and give it a String Comparator.  The compare method is overridden so basically you will have to write it ("you", as in "the programmer").  As these are Strings we are comparing, the equals method is needed (Don't even think about using "==" when comparing Strings.  Really, don't.)  Next, the key list is added to the set and it will automatically store each value once, so no duplicates.  Finally a new ArrayList is declared and the new modified set of unique keys are added to the list, and then it's returned.

This is a rather quick and dirty way to get unique values from a List object.  This is basically our new Map, which should be used only in this project.  And nowhere else.


package com.jf.xmlcomp;


import java.io.File;
import java.util.ArrayList;


import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;


import com.jf.xmlcomp.util.SimpleMap;
import com.jf.xmlcomp.util.Utils;


public class XMLValueComp
{
private File xfile1;
private File xfile2;

private SimpleMap f1v;
private SimpleMap f2v;

public static void main(String[] args)
{
if (args.length < 2)
printHelp();

XMLValueComp xmlComp = new XMLValueComp(args[0], args[1]);

xmlComp.compare();
}

public static void printHelp()
{
System.out.println("Usage: xmlcomp file1 file2");
System.exit(0);
}

This is the main class for this project.  There are the two files which shall be compared along with the SimpleMaps for each of them.  The main class will accept two arguments; the two files.  If not enough arguments are passed, a help message is printed out, and the program exits.


public XMLValueComp(String file1, String file2)
{
f1v = new SimpleMap();
f2v = new SimpleMap();
File f1 = new File(file1);
File f2 = new File(file2);

if (!f1.exists() || !f2.exists())
{
System.out.println("One of the files does not exist!");
System.exit(0);
}
xfile1 = f1;
xfile2 = f2;
}


Now we have the class constructor.  Doesn't do much except for initialisation and ensuring we are not comparing nothingness.  If the files do exist, then we will initilise them.  If they don't, the application will exit.


public void compare()
{
Document doc1 = Utils.stringToDocument(
Utils.readStringFromFile(xfile1));

Document doc2 = Utils.stringToDocument(
Utils.readStringFromFile(xfile2));

if (doc1 == null || doc2 == null)
{
System.err.println("Could not build docs");
System.exit(0);
}

buildMap(f1v, doc1);
buildMap(f2v, doc2);


ArrayList<String> uniques = new ArrayList<String>();


uniques.addAll(f1v.getCompleteItemList());
uniques.removeAll(f2v.getCompleteItemList());


System.out.println("Unique to F1");
for (String v : uniques)
System.out.println(v);




uniques = new ArrayList<String>();


uniques.addAll(f2v.getCompleteItemList());
uniques.removeAll(f1v.getCompleteItemList());

System.out.println("\n\nUnique to F2");
for (String v : uniques)
System.out.println(v);
}

That the core of the application, where we shall be dealing with the real comparator part.  So first off, the documents are created from the String we got from the files.  After that, we shall perform the usual checking to make sure we don't have corrupted files, for example, in which case we'll report the problem and stop.  Then the Maps for each Document are built and a list of unique pairs is created.

The duplicates from each document are removed by filling the unique list with the values from one document and then removing the ones which are found in the other document.  It is a quick way to remove the common nodes and getting only the ones that don't appear in both documents.  The same process is repeated for both documents, so eventually a list of unique items in document one and document two are printed separately.  If you compare a file with itself you can easily deduce that all the common nodes will be removed and the list of unique values will always be empty.


private void buildMap(SimpleMap map, Document doc)
{
Element e = doc.getDocumentElement();


mapNodes(map, e, "/" + e.getNodeName());
}


private void mapNodes(SimpleMap map, Element elem, String parent)
{
for (int x = 0; x < elem.getChildNodes().getLength(); x++)
{
Node n = elem.getChildNodes().item(x);
String path = parent + "/" + n.getNodeName(); 

if (n.getChildNodes().getLength() == 1)
map.put(path, n.getTextContent());

if (n.getNodeType() == Node.ELEMENT_NODE)
mapNodes(map, (Element) n, path);
}
}


The buildMap method is simply preparing some stuff for the mapNodes method.  The document element is passed as the root element to the mapNodes so that it can use recursion to obtain all nodes.

The mapNodes will loop the current child nodes and store the value and path.  Every time an Element Node is encountered it will call itself and the children of that node are stored.  That will go on until all the nodes in the document have been traversed.

You can now try it out by running it and passing in two paths that point to XML documents.  You should try it with the same document to ensure that it actually does not detect any differences.  Then you can go on to see if a modified XML document has kept its integrity :)

As usual, the full code is available here.