James Farrugia's Blog: testing

Showing posts with label testing. Show all posts

16 December, 2011

Comparing two XML documents in Java

How many times have you, for example, modified some XML file but wanted to make sure that only certain parts were changed? Well it did occurred to me, a number of times. Actually the problem is this: I had a small application which modifies XML files in a graphical way and I wanted to ensure that nothing got messed up when I saved the file. By "messed up" I mean tags changing their values or positions.

So what I did was build a simple program which will take in two files (XML, naturally) and then map all the nodes and compare their paths and values. Quite a simple task.

So first, we'll start off with the usual Utilities class. Static methods such as those reading a file should be placed here, ideally.

package com.jf.xmlcomp.util;

import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.Scanner;

import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;

public class Utils
{
public static String readStringFromFile(File file)
{
StringBuilder ds_text = new StringBuilder();
String str_sep = System.getProperty("line.separator");
try
{
Scanner ds_scan = new Scanner(new FileInputStream(file), "UTF-8");
while (ds_scan.hasNextLine())
ds_text.append(ds_scan.nextLine() + str_sep);
ds_scan.close();
}
catch (Exception e)
{
System.err.println(file.getPath() + " :: Not read");
}

return ds_text.toString();
}

public static Document stringToDocument(String str_doc)
{
Document ds_doc = null;
try
{
ds_doc = DocumentBuilderFactory
.newInstance()
.newDocumentBuilder()
.parse(new ByteArrayInputStream(str_doc.getBytes()));
}
catch (Exception e)
{
System.err.println("String could not be converted to Document");
}

return ds_doc;
}
}

We only have two simple methods here; one which reads in the contents of a file and one which converts an XML String to a W3C Document object. I won't be detailing much about reading a file or converting a String as it is not the scope of this post.

Next thing to do is create a very simple implementation of a Map, sort of. Now those that actually know what a Map is will be frothing at their mouths, since what we will be doing is a simple ArrayList mapping and we won't even be using Generics, hashes or even collision tables. The reason is that this is a simple post about creating a simple XML Comparator.

So, to the SimpleMap thing...

package com.jf.xmlcomp.util;

import java.util.ArrayList;
import java.util.Comparator;
import java.util.Set;
import java.util.TreeSet;

public class SimpleMap
{
private ArrayList<String> keys;
private ArrayList<String> values;

public SimpleMap()
{
keys = new ArrayList<String>();
values = new ArrayList<String>();
}

public void put(String s, String v)
{
keys.add(s);
values.add(v);
}

public ArrayList<String> get(String k)
{
ArrayList<String> result = new ArrayList<String>();
for (int c = 0; c < keys.size(); c++)
if (keys.get(c).equals(k)) result.add(values.get(c));

return result;
}

As you can see, we created the usual methods you'd find in a Map: the put and get. The rest should be pretty easy; declaring the ArrayLists containing the keys and values and then initialising them in the constructor.

The put method will simply add the new key and value Strings in the respective ArrayList while get, gets a little over-done. In the get method we first set up a results ArrayList and then loop through the keys list and store every matching key. The key won't be stored though, as they are of no use really; if you ask for door which is opened by a particular key, you will want to find a door. So we get the index at which the key was found and then store the value found at the same index in the values list in the results list. Finally the results list is returned. Probably you now know that we can have the same key appearing more than once in this map.

public ArrayList<String> keys()
{
return keys;
}

public ArrayList<String> getCompleteItemList()

{

ArrayList<String> res = new ArrayList<String>();

for (String key : uniqueKeys())

for (String val : get(key))

res.add(key + " :: " + val);

return res;

}

Following the mapping details are two other simple methods. One simply returns the list of keys while the other will return all the items we have as a String of key :: value. At first it might look like an over kill. Getting only the unique keys only to loop over the values obtained by every key. What happens if you use the normal key list and loop over it, is a repetitive amount of values.

Let's say there are two keys named "a" and "b". If we loop over the key list we get "a", "a", "b". Now let's include the part which gets the values obtained from each key. We will get two values for each key named "a", so eventually we'll end up with the values of a key named "a", twice.

So to "fix" this, we will simple get each key once and then simply add the resulting values of each. That way we can be totally sure that no key-value pair is repeated.

Naturally we'll be needing the uniqueKeys method, so here it comes.

public ArrayList<String> uniqueKeys()

{
Set<String> s = new TreeSet<String>(new Comparator<String>()
{
@Override
public int compare(String o1, String o2)
{
if (o1.equals(o2))
return 0;
return 1;
}
});
s.addAll(keys);

ArrayList<String> result = new ArrayList<String>();
result.addAll(s);

return result;
}

This method will create a Set (which will not accept duplicates) and give it a String Comparator. The compare method is overridden so basically you will have to write it ("you", as in "the programmer"). As these are Strings we are comparing, the equals method is needed (Don't even think about using "==" when comparing Strings. Really, don't.) Next, the key list is added to the set and it will automatically store each value once, so no duplicates. Finally a new ArrayList is declared and the new modified set of unique keys are added to the list, and then it's returned.

This is a rather quick and dirty way to get unique values from a List object. This is basically our new Map, which should be used only in this project. And nowhere else.

package com.jf.xmlcomp;

import java.io.File;
import java.util.ArrayList;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;

import com.jf.xmlcomp.util.SimpleMap;
import com.jf.xmlcomp.util.Utils;

public class XMLValueComp
{
private File xfile1;
private File xfile2;

private SimpleMap f1v;
private SimpleMap f2v;

public static void main(String[] args)
{
if (args.length < 2)
printHelp();

XMLValueComp xmlComp = new XMLValueComp(args[0], args[1]);

xmlComp.compare();
}

public static void printHelp()
{
System.out.println("Usage: xmlcomp file1 file2");
System.exit(0);
}

This is the main class for this project. There are the two files which shall be compared along with the SimpleMaps for each of them. The main class will accept two arguments; the two files. If not enough arguments are passed, a help message is printed out, and the program exits.

public XMLValueComp(String file1, String file2)
{
f1v = new SimpleMap();
f2v = new SimpleMap();
File f1 = new File(file1);
File f2 = new File(file2);

if (!f1.exists() || !f2.exists())
{
System.out.println("One of the files does not exist!");
System.exit(0);
}
xfile1 = f1;
xfile2 = f2;
}

Now we have the class constructor. Doesn't do much except for initialisation and ensuring we are not comparing nothingness. If the files do exist, then we will initilise them. If they don't, the application will exit.

public void compare()
{
Document doc1 = Utils.stringToDocument(
Utils.readStringFromFile(xfile1));

Document doc2 = Utils.stringToDocument(
Utils.readStringFromFile(xfile2));

if (doc1 == null || doc2 == null)
{
System.err.println("Could not build docs");
System.exit(0);
}

buildMap(f1v, doc1);
buildMap(f2v, doc2);

ArrayList<String> uniques = new ArrayList<String>();

uniques.addAll(f1v.getCompleteItemList());
uniques.removeAll(f2v.getCompleteItemList());

System.out.println("Unique to F1");
for (String v : uniques)
System.out.println(v);

uniques = new ArrayList<String>();

uniques.addAll(f2v.getCompleteItemList());
uniques.removeAll(f1v.getCompleteItemList());

System.out.println("\n\nUnique to F2");
for (String v : uniques)
System.out.println(v);
}

That the core of the application, where we shall be dealing with the real comparator part. So first off, the documents are created from the String we got from the files. After that, we shall perform the usual checking to make sure we don't have corrupted files, for example, in which case we'll report the problem and stop. Then the Maps for each Document are built and a list of unique pairs is created.

The duplicates from each document are removed by filling the unique list with the values from one document and then removing the ones which are found in the other document. It is a quick way to remove the common nodes and getting only the ones that don't appear in both documents. The same process is repeated for both documents, so eventually a list of unique items in document one and document two are printed separately. If you compare a file with itself you can easily deduce that all the common nodes will be removed and the list of unique values will always be empty.

private void buildMap(SimpleMap map, Document doc)
{
Element e = doc.getDocumentElement();

mapNodes(map, e, "/" + e.getNodeName());
}

private void mapNodes(SimpleMap map, Element elem, String parent)
{
for (int x = 0; x < elem.getChildNodes().getLength(); x++)
{
Node n = elem.getChildNodes().item(x);
String path = parent + "/" + n.getNodeName();

if (n.getChildNodes().getLength() == 1)
map.put(path, n.getTextContent());

if (n.getNodeType() == Node.ELEMENT_NODE)
mapNodes(map, (Element) n, path);
}
}

The buildMap method is simply preparing some stuff for the mapNodes method. The document element is passed as the root element to the mapNodes so that it can use recursion to obtain all nodes.

The mapNodes will loop the current child nodes and store the value and path. Every time an Element Node is encountered it will call itself and the children of that node are stored. That will go on until all the nodes in the document have been traversed.

You can now try it out by running it and passing in two paths that point to XML documents. You should try it with the same document to ensure that it actually does not detect any differences. Then you can go on to see if a modified XML document has kept its integrity :)

As usual, the full code is available here.

07 December, 2011

Curious JCIFS Behaviour

For those of you that do not what JCIFS is, it is a Java library which helps you deal with SMB Files in a way very similar to handling normal local File objects. You can list files, open input streams, delete, rename and some other operations.

Recently I was working with this library, specifically building small search tool, which crawls the directories you specify and then return the results. Apparently I was having problems with something that many people tend to overlook, or forget about. Handles originating from my java program never seemed to close.

As you can see, many applications have a few hundreds, including the JVM I have running (see screenshot). So the other day I was testing memory and CPU usage, and they were pretty much under control, until I had a look at the handles column. I began testing small directories, containing less than 30 or so items in total. The handle count rose but it was negligible, since it was at maybe 50 and maybe rose to seventy something. I did not notice since a number of handles might be opened by the JVM and not my code so I ignored it.

After the initial tests, I thought some heavier directory should be crawled, so I pointed it to the desktop, which contains thousands of files (within sub directories obviously, not even a 100" TV could have a 1000 icons on the desktop :P). So again, I fired up the program and bam! I had over 15000 handles, which is absolutely unacceptable.

Practically I have found no solution to this, yet. I really would like to know if any one has had this problem, because as far as I know, there are no methods that close the connection. You can only close streams, and that's OK, but what should I close if I am calling list() or listFiles()?

06 December, 2011

Basic bench-marking in Java

No, not that basic, don't worry

Benchmarking is useful when dealing with applications where performance is vital. There are a million ways to bench mark your methods; profilers, tools, libraries, etc. But honestly, sometimes it's too much of a hassle if all you want is a rough estimate of how much time a method or loop or anything else is taking.

So what I did was just a simple class which simply logs the start time, and end time, and prints it out if necessary; nothing outlandish.

The idea is to call a static method start just before entering the code to be tested, and then calling stop or stopAndPrint() when you want to, well, stop profiling. The Timer class basically holds an array of "start times" and each time start is called, the pointer moves up, and down when stop is called.

PS: The complete code for this post is available in the downloads section, under Code, name Timer.java. As this is the first time, I think you should know that the files are hosted Google docs, so just go there and hit download at the top right of the screen :)

Let's start then:

package com.jf.utils;

public class Timer

{

    private static final int MAX_TIMERS = 50;

    private static long[] startTimes = new long[MAX_TIMERS];

    private static long stopTime;

    private static long time;

    private static int pointer = -1;

So here we got :

MAX_TIMERS which is a constant defining the maximum number of nested timers. This limits the number of consecutive starts we can have without stops;
startTime is an array of long which shall keep track of all the starting times;
stopTime is the time of the latest stop;
time is the time between the last start and stop;
pointer points to the current timer in the array, or the current consecutive timer.

public static void start()

{

    pointer ++;

    if (pointer > MAX_TIMERS)

    {

        System.err.println("The maximum timer count limit has been reached." +

        "  Close some timers first before attemptin to open a new one.");

        return;

    }

    startTimes[pointer] = System.currentTimeMillis();

}

So this will start "profiling". First, it will move the pointer up, so that it will put us in the next free location. Next it will check if we reached the limit. If we are at the limit, we will get a simple message, since we do not want a heavy class and will not be throwing any exceptions. Finally it will record the starting time. As you can see this is not at all an accurate way to test performance. There are some milliseconds "wasted" just to increase the pointer and check if we are at the limit.

public static void stop()

{

    stopTime = System.currentTimeMillis();

    time = stopTime - startTimes[pointer];

    pointer--;

}

stop will, quite obviously, stop the timer. What is does is rather simple. We immediately record the current time, so as to avoid wasting time doing other tasks. That is why we have a variable storing only the latest stop. Then it will calculate the total time since the last start, pointed to by the current value of pointer. The last operation is to move the pointer to the previous timer. So as you can see we can only nest timers and every call to stop will simply stop the last timer that we started.

public static void stopAndPrint()

{

    stop();

    System.out.print("Timer in ");

    System.out.print(Thread.currentThread().getStackTrace()[2]);

    System.out.println(" clocked approx. " + getTimeFromMs(time));

}

A rather simple and convenient method is this one. Here we will stop the timer and print out the values. Note that calling this method might produce less accurate results, as time is wasted again while calling the method stop. We can sacrifice code cleanliness here by placing the same code in stop here, so that a true stop operation is performed here without having to call another method.

public static String getTimeFromMs(long ms)

{

    if (ms < 1000)

        return ms + "ms";

    else if (ms >= 1000 && ms < 60000)

        return (ms/1000) + "s " + getTimeFromMs(ms - ((ms/1000)*1000)); 

    else if (ms >=60000 && ms < 3600000)

        return (ms/60000) + "m " + getTimeFromMs(ms - ((ms/60000)*60000)); 

    else

        return (ms/3600000) + "h " + getTimeFromMs(ms - ((ms/3600000)*3600000));

}

This is purely a convenience method which will cleanly print out a time passed to it in milliseconds and print it out in milliseconds, seconds, minutes and hours. It recurses over it self until it gets to hours and cleanly groups minutes, seconds and millisecods.

So there you go. A simple and ultra-basic way to test your code performance. The thing is that this does not require any libraries (not even imports), or special IDEs or tools. Just place this somewhere in your project and call start() and stop() or stopAndPrint() when ever you need to quickly get a rough idea.

The following code will give you an idea:

public static void main(String[] args) 

{

    //Start profiling the whole main class

    Timer.start();

    int x = 10;

    int y = 5;

    //Start profiling the mul(x,y) method

    Timer.start();

    mul(x, y);

    //Stop profiling the mul(x,y) method and print result

    Timer.stopAndPrint();

    //Stop profiling the whole main class and print this one too

    Timer.stopAndPrint();

}

public static void mul(int x, int y)

{

    System.out.println(x * y);

}

Happy coding :D !