The Java Explorer

Tips and insights on Java

  • Subscribe

  • If you find this blog useful, please enter your email address to subscribe and receive notifications of new posts by email.

    Join 29 other followers

Comparators – part 3

Posted by Eyal Schneider on April 30, 2012

In this post and in this one, I illustrate how a bad implementation of comparators can cause annoying bugs.
Lately I saw a strangely looking Comparator implementation, which looks like this:

class WeirdComparator implements Comparator<Long> {
    @Override
    public int compare(Long a, Long b) {
        if (a == null || b == null)
            return 0;
        if (a > b)
            return 1;
        if (b > a)
            return -1;
        return 0;
    }
}

The implementor decided not to rely on the natural ordering of Long, since it doesn’t handle null values, which are acceptable in his use-case.

The first two lines caught my eye: they are obviously wrong in terms of the Comparator contract, but it is not immediately clear what kind of damage they can cause. The contract of Comparator/Comparable states that the implementation must define a total order on all values of the type it handles. That means, among other things, the transitivity property:
If compare(b,a)>=0 and compare(c,b)>=0, then compare(c,a)>=0. With the current implementation, compare(null, 2L)>=0 and compare(1L, null)>=0, but compare (1L, 2L)<0.

What happens if we sort using this comparator?

    Long[] arr = {100L, 101L, 102L, 103L,
                  50L , null,  51L,  52L};
    Arrays.sort(arr,new WeirdComparator());
    System.out.println(Arrays.toString(arr));

The output in this case is a mess: “[50, 100, 101, 102, 103, null, 51, 52]“.

Java uses the Mergesort algorithm for sorting arrays of objects, and a variation of the Quicksort algorithm for sorting arrays of primitive types. Java’s Mergesort implementation has an optimization that defines a threshold of 7, such that any array range with a smaller size is not sorted recursively with Mergesort, but rather with Insertion sort.
The array size in the example is 8, so the two halves are sorted with Insertion Sort (leaving them unchanged), and then they are merged. In the merge phase, the value 50 is chosen first, and then the new head of the second half is null, which is defined to be equal to any value. This causes all the contents of the first half to be copied to the output.

The faulty comparator is also a disaster for sorted sets:

    SortedSet<Long> set = new TreeSet<Long>(new WeirdComparator());
    set.add(null);
    set.add(10L);
    set.add(20L);
    System.out.println(set);

Although we add 3 values, only the null value remains. As explained in my other post, TreeSet respects the Set invariants which prohibit duplicate values, and it uses the comparator to determine equality. Therefore, the values 10L and 20L are considered duplicates of null and aren’t added to the tree.

Posted in java | Tagged: , , , , , , , | 1 Comment »

Generating DOM from XML preserving line numbers

Posted by Eyal Schneider on November 30, 2010

Lately I was implementing a simple scripting language based on XML. One of the goals was to emit informative and detailed error messages for syntax errors and script execution errors. In order to achieve this, I was required to include the relevant script line number in all error messages.  Default error messages for malformed XML or XML Schema validation failures do include line numbers, but they cover a limited set of possible errors. Scripting languages often require further validations, which are beyond the expressive power of standard XML schemas.

Parsing an XML into a corresponding DOM structure is straight forward when using the javax.xml.parsers.DocumentBuilder class:

public static Document readXML(InputStream is) throws IOException, SAXException{
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    try {
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        return docBuilder.parse(is);
    } catch (ParserConfigurationException e) {
        throw new RuntimeException("Can't create DOM builder.", e);
    }
}
 



The problem is that once we have a DOM tree, the association between elements and their line number is lost. I couldn’t find any way of tweaking the DOM parser to be aware of line numbers. Therefore I tried using a SAX parser instead. SAX parsers traverse a document in its textual order from beginning to end, triggering callback methods whenever XML building blocks are encountered or a syntax error occurs. A SAX parser can be supplied with a DefaultHandler that uses a Locator. The latter is used for tracking the position of XML entities in the input document.

Following is a utility method that converts an XML given in an InputStream into a DOM structure, by using a SAX parser. Instead of keeping the line numbers in a new data structure, the DOM is enriched with a new attribute per element, indicating the line number of the element in the input document.


public static Document readXML(InputStream is, final String lineNumAttribName) throws IOException, SAXException {
    final Document doc;
    SAXParser parser;
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        parser = factory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();           
    } catch(ParserConfigurationException e){
        throw new RuntimeException("Can't create SAX parser / DOM builder.", e);
    }

    final Stack<Element> elementStack = new Stack<Element>();
    final StringBuilder textBuffer = new StringBuilder();
    DefaultHandler handler = new DefaultHandler() {
        private Locator locator;

        @Override
        public void setDocumentLocator(Locator locator) {
            this.locator = locator; //Save the locator, so that it can be used later for line tracking when traversing nodes.
        }
       
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {               
            addTextIfNeeded();
            Element el = doc.createElement(qName);
            for(int i = 0;i < attributes.getLength(); i++)
                el.setAttribute(attributes.getQName(i), attributes.getValue(i));
            el.setAttribute(lineNumAttribName, String.valueOf(locator.getLineNumber()));
            elementStack.push(el);               
        }
       
        @Override
        public void endElement(String uri, String localName, String qName){
            addTextIfNeeded();
            Element closedEl = elementStack.pop();
            if (elementStack.isEmpty()) { // Is this the root element?
                doc.appendChild(closedEl);
            } else {
                Element parentEl = elementStack.peek();
                parentEl.appendChild(closedEl);                   
            }
        }
       
        @Override
        public void characters (char ch[], int start, int length) throws SAXException {
            textBuffer.append(ch, start, length);
        }
       
        // Outputs text accumulated under the current node
        private void addTextIfNeeded() {
            if (textBuffer.length() > 0) {
                Element el = elementStack.peek();
                Node textNode = doc.createTextNode(textBuffer.toString());
                el.appendChild(textNode);
                textBuffer.delete(0, textBuffer.length());
            }
        }           
    };
    parser.parse(is, handler);
   
    return doc;
}   

 

In order to compose a hierarchical structure from the linear traversal, a stack is used. The stack contains at any moment of the traversal the “path” to the current location. Whenever an element start is encountered we build the element and add it to the stack, and later when it closes we remove it from the stack and attach it to the parent node.

Note that the character(..) method implementation only appends the new text into a text buffer. This is due to the fact that it is not guaranteed that the method provides the full piece of text; it may return it in chunks.

Finally, note that while this implementation is fine for my purposes, it is not prepared to deal with any XML entity. Comments and processing instructions (such as <?xml version=”1.0″? encoding=”UTF-8″ standalone=”yes”?>) are ignored. In addition, CDATA sections are undressed and escaped, instead of copying them as is (actually there is no semantic difference between the two).

Posted in java, JDK packages | Tagged: , , , , , , , , | 1 Comment »

A utility for multithreaded unit tests

Posted by Eyal Schneider on July 13, 2010

Classes designed to run in a multithreaded environment are difficult to test. In a standard unit test, the test space may be already huge, since we try to cover all combinations of the object’s state and legal operations on that state. For a class designed to run in a multithreaded environment, the test space increases dramatically, due to the possible interleaving of the operations performed by the different threads on the object.

Beside the theoretic challenge, we have a practical one, that arises when simulating multiple caller threads in the test class, and trying to perform assertions in these threads. This can be demonstrated by the following numeric counter use case.  Suppose that we want to implement a thread safe counter. We will provide two implementations, so we start by defining the counter interface, consisting of a single atomic operation:


public interface Counter {
    public int incrementAndGet();
}

Following are two implementations. The first works fine only for a single threaded scenario:


public class DumbCounter implements Counter {
    private int counter;
    @Override
    public int incrementAndGet() {
        return ++counter;
    }
}

The second one is based on AtomicInteger, so it is completely thread safe:

public class AtomicIntCounter implements Counter {
    private AtomicInteger atomicInt = new AtomicInteger();
    @Override
    public int incrementAndGet() {
        return atomicInt.incrementAndGet();
    }
}

Now we want to test these implementations using JUnit. The following test simulates 5 threads operating on the same counter. Each thread tries to increment the counter 100000 times, so we expect each thread to see strictly increasing values of the counter:

 import static org.junit.Assert.assertTrue;

import java.util.concurrent.CountDownLatch;

import org.junit.Before;
import org.junit.Test;
public class TestCounters {
    private static final int THREADS_COUNT = 5;
    private Counter counter;

    @Before
    public void resetCounter() {
        counter = new DumbCounter();
    }

    @Test
    public void test() throws InterruptedException {
        final CountDownLatch latch = new CountDownLatch(THREADS_COUNT);
        for(int i = 0;i < THREADS_COUNT; i++) {
            new Thread(new Runnable(){

                @Override
                public void run() {
                    try{
                        int last = counter.incrementAndGet();
                        for (int i = 0; i < 1000000; i++) {
                            int value = counter.incrementAndGet();
                            assertTrue (value > last);
                            last = value;
                        }
                    } finally {
                        latch.countDown();
                    }
                }
            }).start();
        }

        latch.await();
    }

}

The CountDownLatch is used for waiting for all threads to terminate, before leaving the test method. The method resetCounter() is used for initializing the counter before any test method is being executed. It also chooses the particular implementation we want to test. In this case we test the non-thread-safe DumbCounter, so we expect to see the test fail. However, when we run the test we observe a strange behavior:  JUnit reports that the test passed successfully, but we see the following error messages printed to the error stream by each one of the 5 threads:

Exception in thread “Thread-{N}” java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:71)
at org.junit.Assert.assertTrue(Assert.java:34)
at org.junit.Assert.assertTrue(Assert.java:43)
at jnunit.TestCounters$1.run(TestCounters.java:31)
at java.lang.Thread.run(Thread.java:619)

The reason for this behavior is the usage of assertions outside of JUnit’s main thread. JUnit detects assertion failures by catching exceptions of type AssertionError thrown in its main thread. The framework is completely unaware of assertion errors being thrown in new threads spawned by the test methods. Consequently, these exceptions simply terminate the threads where they are thrown, and activate the default uncaught exception handler, which prints their details to the error stream. The main thread detects no errors, so it reports success.

In order to solve this situation, we need some asynchronous test runner, that delegates any unchecked exceptions to the main thread. Here is a possible solution : 


public class AsynchTester{
    private Thread thread;
    private volatile Error error;
    private volatile RuntimeException runtimeExc;

    public AsynchTester(final Runnable runnable) {
        thread = new Thread(new Runnable() {
            @Override
            public void run() {
                try {
                    runnable.run();
                } catch (Error e) {
                    error = e;
                } catch (RuntimeException e) {
                    runtimeExc = e;
                }
            }
        });
    }

    public void start() {
        thread.start();
    }

    public void test() throws InterruptedException {
        thread.join();
        if (error != null)
            throw error;
        if (runtimeExc != null)
            throw runtimeExc;
    }
}

The constructor receives the task to be executed, and start() triggers the asynchronous execution. The method test() is used for waiting the task termination, and throwing unchecked exceptions if needed. Any RuntimeException or Error thrown by the test thread will be re-thrown by the test() method, in order to reflect the actual termination status of the inner thread.

Now we can re-write the test() method of our test class as follows:

@Test
public void test() throws InterruptedException {
    AsynchTester[] testers = new AsynchTester[THREADS_COUNT];
    for(int i = 0;i < THREADS_COUNT; i++) {
        testers[i] = new AsynchTester(new Runnable() {

            @Override
            public void run() {
                int last = counter.incrementAndGet();
                for (int i = 0; i < 1000000; i++) {
                    int value = counter.incrementAndGet();
                    assertTrue (value > last);
                    last = value;
                }
            }
        });
        testers[i].start();
    }

    for(AsynchTester tester : testers)
        tester.test();
}

The method structure is very similar to the original one, and it’s quite easy to synchronize the threads termination now. When testing DumbCounter using this code, we finally get the right behavior: the test fails, and points us to the position in code where the assertion failed.

It is important to note that the solution presented here addresses only one aspect of multithreaded tests, and I chose it because it is both easy to implement and uses a nice technique. However, there are some available test frameworks on the web, suitable for concurrent tests, that cover other aspects of the problem. See for example the GroboUtils JUnit extension, and also ConTest, which uses an interesting approach for increasing the coverage of multithreaded tests.

Posted in java, Testing | Tagged: , , , , , , , | 2 Comments »

Comparators – part 2

Posted by Eyal Schneider on May 19, 2010

In the post Comparators and sorted sets I discuss the “consistency with equals” rule, and the importance of following it when working with sorted sets.  Lately I saw a few cases of another faulty comparator implementation.  

As usual, I will demonstrate it with an example. Suppose we have a bank account class that encapsulates, among other things, the account id and the balance, in cents:  

 

public class Account {
    private int accountId;
    private int balanceCents;
 
    public Account(int accountId, int cents) {
        this.accountId = accountId;
        this.balanceCents = cents;
    }
 
    public int getBalanceCents() {
        return balanceCents;
    }
 
    public int getAccountId() {
        return accountId;
    }
 
    public String toString() {
        return accountId + ":" + balanceCents;
    }
}

We want to be able to sort accounts in ascending balance order, so we implement a comparator for that purpose:  

 

private class BalanceComparator implements Comparator<Account> {
    public int compare(Account o1, Account o2) {
        return o1.getBalanceCents() - o2.getBalanceCents();
    }
}

Since the compare method should return a negative integer, zero, or a positive integer as the first argument is less than, equal to, or greater than the second, we simply subtract the balances in order to determine which one is greater. 

Let’s test this implementation: 

 

Account[] items = {new Account(1,100),new Account(2,-200), new Account(3,1000),new Account(4,0)};
Arrays.sort(items, new BalanceComparator());
System.out.println(Arrays.toString(items));

The output is ordered correctly:

[2:-200, 4:0, 1:100, 3:1000]

However, there are some cases where Arrays.sort() fails to sort the array correctly. This happens with the following input array for example:


Account[] items = {new Account(1,2000000000),new Account(2,-2000000000), new Account(3,1000),new Account(4,0)};

Now the output is not as expected:

[1:2000000000, 2:-2000000000, 4:0, 3:1000]

What did we do wrong? The flaw lies in the comparator implementation. Although it is very tempting to use the elegant one-liner implementation instead of condition statements, it turns out to be wrong in our case. Like all numeric primitives, ints have a limited range (-2^31 to 2^31-1), so subtracting two values may result in an arithmetic overflow. When this happens, no runtime exception is thrown, and the returned value has a wrong magnitude and sign. For example, if we subtract the int -2147483648 from the int 2147483647 we get -1 (which is the unsigned interpretation of the correct result: 2^32-1). In the code above, a wrong sign means wrong decisions of the sorting algorithm.

If we know for sure that an int overflow is impossible (e.g. if the numbers being compared are always positive ints, or if we compare bytes), then we can use the one line comparison technique. In all other cases, we must use explicit conditions:

public int compare(Account o1, Account o2) {
    int balance1 = o1.getBalanceCents();
    int balance2 = o2.getBalanceCents();   
    if (balance1 > balance2)
        return 1;
    if (balance1 < balance2)
        return -1;
    return 0; 
}

Posted in java | Tagged: , , , , , , | 1 Comment »

Taking Random Samples

Posted by Eyal Schneider on April 1, 2010

Sometimes we need to take a random sample from a given collection of objects. This is useful in simulations, games, and statistical applications.  In this post I discuss some efficient algorithms for choosing m items out of n available, such that the m items are chosen at random, with equal probability for all possible subsets of size m. For each algorithm I present the code in Java, as well as a formal analysis of its performance and correctness, where needed. As you may have already noticed from the introduction, this post is more theoretic than usual…

A somehow related problem is the problem of generating a random permutation of given collection. While Java provides an efficient utility method for that purpose (See Collections.shuffle(..), based on the Knuth Shuffle algorithm), there is no built-in utility for generating random subsets.

Before we start discussing some known algorithms, it is important to note that there are many variations of the problem, depending  on the following parameters:

  1. Is the collection random access (e.g. ArrayList)?
  2. Is the collection read only?
  3. Do we allow random running time?
  4. Is N known at invocation time, or are we dealing with a stream of items of unknown size?

Trial and Error

We will start with the simplest approach. Assuming  that the collection to take the sample from is random access, we can repeatedly add random items from the collection to our result set, until the set contains m different items. As an optimization, when m > n/2, we can choose (n-m) items instead of m, and then return the rest:

public static <T> Set<T> randomSample1(List<T> items, int m){
    HashSet<T> res = new HashSet<T>();
    int n = items.size();
    if (m > n/2){ // The optimization
        Set<T> negativeSet = randomSample1(items, n-m);
        for(T item : items){
            if (!negativeSet.contains(item))
                res.add(item);
        }
    }else{ // The main loop
        while(res.size() < m){
            int randPos = rnd.nextInt(n);
            res.add(items.get(randPos));
        }
    }
    return res;
}

Clearly, the number of iterations in the main loop is not bounded. However, we can compute the expected number of iterations, what makes this algorithm a Las Vegas algorithm. The iterations can be split into m different sequences (numbered 0 to m-1), where sequence i refers to the attempts needed for adding the (i+1)-th item into the result set. By observing that the length of sequence i has a geometric distribution defined by p=(n-i)/n, we can calculate the expected number of iterations as follows:

E(m,n) = \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n-2} + ... + \frac{n}{n-m+1}
And lets also define E(0,n)=0

If we examine E(m,n) as a function of m only, in the interval m=0 to m=\lfloor\frac{n}{2}\rfloor, it can be easily verified that the function is increasing and convex. Therefore we have:
E(m,n) \leq \frac{E(\lfloor\frac{n}{2}\rfloor,n)}{\lfloor\frac{n}{2}\rfloor} \cdot m=\frac{n\cdot(\frac{1}{n} + \frac{1}{n-1} + ... + \frac{1}{n-\lfloor\frac{n}{2}\rfloor+1})}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq \frac{n\cdot\lfloor\frac{n}{2}\rfloor\cdot\frac{1}{\lfloor\frac{n}{2}\rfloor}}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq 3m

In other words the expected time complexity is linear in m, which is optimal. This is a little surprising, given the naive nature of the algorithm, and it results from our optimization.

Swapping

If our collection is random access and its items can be freely reordered, then we can efficiently draw random items one by one from a candidates set, containing only items not chosen so far. By swapping items, we can guarantee that the candidates set is always contiguous . As a matter of fact, this algorithm is a bounded version of the Knuth Shuffle algorithm for generating random permutations, and its correctness is trivial.

public static <T> List<T> randomSample2(List<T> items, int m){
    for(int i=0;i<m;i++){
        int pos = i + rnd.nextInt(items.size() - i);
        T tmp = items.get(pos);
        items.set(pos, items.get(i));
        items.set(i, tmp);
    }
    return items.subList(0, m);
}

Full Scan

Sometimes our collection is not random access, so we have no choice but to traverse it completely in the worst case. Following is an elegant solution, that iterates once through the items, and selects an item with probability (#remaining to select)/(#remaining to scan): 

public static <T> List<T> randomSample3(List<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int visited = 0;
    Iterator<T> it = items.iterator();
    while (m > 0){
        T item = it.next();
        if (rnd.nextDouble() < ((double)m)/(items.size() - visited)){
            res.add(item);
            m--;
        }
        visited++;
    }
    return res;
}

Clearly, the running time is O(n), which is optimal given the constraints.
It is left to prove that for any collection C and any subset S, the chances of generating S are {1 \over {|C| \choose |S|}}. We can do it
by induction on the size of C.
For |C|=1 and |S| between 0 and 1 this is trivial. Now, let C be an ordered collection and let S be a subset of C. When the algorithm starts traversing C, it first encounters item v. If v \in S, we should choose it and then proceed by choosing items S-{v} from the collection C-{v}. We can use our induction assumption to calculate the probability of this:
p(C,S)={|S|\over|C|}\cdot{1\over{|C|-1\choose |S|-1}}={1\over{|C|\choose |S|}}

If on the other hand v \notin S, then we should reject v and proceed by choosing items S from the collection C-{v}. This happens with probability:
 
p(C,S)={|C|-|S|\over|C|}\cdot{1\over{|C|-1\choose |S|}}={1\over{|C|\choose |S|}}

In either cases we get the required probability.

Floyd’s Algorithm

What happens if we don’t want the time complexity to be dependent on N, and the collection is read only?
In this case, assuming the collection is random access, we can use Floyd’s algorithm, which is both brilliant and easy to implement. It iterates with variable i through the last m indexes of the collection, and on each iteration adds a single item from the range 0..i, with a non-uniform distribution:


public static <T> Set<T> randomSample4(List<T> items, int m){
    HashSet<T> res = new HashSet<T>(m);
    int n = items.size();
    for(int i=n-m;i<n;i++){
        int pos = rnd.nextInt(i+1);
        T item = items.get(pos);
        if (res.contains(item))
            res.add(items.get(i));
        else
            res.add(item);
    }
    return res;
}

 The time complexity is O(m) on the average, but we can bound the worst case time by O(m log(m)) if we use a balanced tree instead of a hash set.
The correctness follows by proving the following loop invariant: After completing an iteration with i=j, the set res contains a uniformly distributed random subset of size j-n+m+1 of the items in the range 0..j. We will prove this by induction on i. For the initial value of i (i=n-m), we simply choose a random position in the range 0..i, so it is also a random subset of size 1 of the same range. Now, let i=j (>n-m), and let S be any subset of size  j-n+m+1 from the range 0..j. If items[j] is in S, then the previous iterations must have completed with res=S-{items[j]}, and then either position j or any previously selected position should be chosen. Using the induction assumption, we can compute the probability of obtaining S as follows:

p_1 = {1 \over {j\choose |S|-1}}\cdot{|S|\over j+1}={1 \over {j+1 \choose |S|}}

If on the other hand items[j] is not in S, then we have many options of selecting |S|-1 items in the previous iterations, and then we should choose the remaining index:

p_2 = {|S| \choose |S|-1}{1 \over {j\choose |S|-1}}\cdot{1\over j+1}={1 \over {j+1 \choose |S|}}

In both cases we have a uniform probability for choosing |S| items from j+1 candidates, and that completes the proof.

Stream of items


Sometimes we don’t know the collection size in advance. We can only iterate through it, as if it was a data stream with unknown size. The following algorithm (Known as Reservoir sampling) performs only one pass on the input stream. While iterating, it maintains a set of items that represents a random subset (of size m) of the items visited so far. It starts by selecting the first m items, and then it selects the k-th item with probability m/k. If the item is selected, it replaces a randomly chosen item from the previously selected ones:

public static <T> List<T> reservoirSample(Iterable<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int count = 0;
    for(T item : items){
        count++;
        if (count <= m)
            res.add(item);
        else{
            int r = rnd.nextInt(count);
            if (r < m)
                res.set(r, item);
        }
    }
    return res;
}

Each iteration consumes constant time, therefore the algorithm runs in O(n) time, where n is the stream length.
The correctness can be proved by induction on the stream size. We want to prove that for every stream T and every value of m (m<=|T|), all subsets of T of size m are equally likely to be returned. The base case is a stream of size m. In this case we have only one possible subset, and the algorithm always returns it. Now assume that we have a given stream T, and we know that the induction property holds for stream lengths below |T|. Let S be a subset of T. We shall inspect the last item v in the stream.
If v is not in S, then we must have chosen all items of S by the time we completed the previous iteration. We should also reject the last item:
p_1 = {1 \over {|T|-1 \choose |S|}}\cdot{|T|-|S|\over |T|}={1 \over {|T|\choose|S|}}

If on the other hand v is in S, then we should have the other |S|-1 items of the sample in the previous iteration already (plus an extra item among the |T|-|S| possible ones), and then we should choose v and make it replace the extra item:
p_2 = {|T|-|S| \over {|T| \choose |S|}}\cdot{|S|\over |T|}\cdot{1\over |S|}={1 \over {|T|\choose|S|}}

In both cases we get the required probability.

A note about random number generators

The algorithms above assume that the random number generator has a pure random behavior, but this is rarely the case. The class java.util.Random uses a very popular pseudo number generation approach, called Linear Congruential Generator.  Due to the fact that every generated number is determined by the previously generated one (or initially the seed), there is a bounded number of sequences that can be produced. Java’s Random class uses a state record of 48 bits, so we can never exceed 2^{48} different sequences. This has a huge impact when generating random constructs from a large combinatorial space, such as in the case of subsets or permutations. For example, if we are interested in generating a random subset of size 20 from a set of size 60, we have {60 \choose 20} options, which exceeds 2^{48}. Therefore, the results of any algorithm that uses java.lang.Random would be completely biased because there are many possible subsets that will never be returned. Actually, we will cover only 7% of the subsets space! For many applications this behavior is good enough. For others, which require more accuracy, we should consider a different random source than java.util.Random. After searching the web a little, I found the RngPack library, which seems to have some powerful random number generator implementations.

Sources

http://delivery.acm.org/10.1145/320000/315746/p754-bentley.pdf?key1=315746&key2=0487028621&coll=GUIDE&dl=GUIDE&CFID=79451466&CFTOKEN=34515112 http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle http://en.wikipedia.org/wiki/Linear_congruential_generator http://comjnl.oxfordjournals.org/cgi/reprint/25/1/45.pdf http://comjnl.oxfordjournals.org/cgi/reprint/28/4/412.pdf http://www.jstor.org/stable/pdfplus/2347297.pdf http://stackoverflow.com/questions/48087/select-a-random-n-elements-from-listt-in-c# http://stackoverflow.com/questions/136474/best-way-to-pick-a-random-subset-from-a-collection

Posted in Algorithms, java | Tagged: , , , , , , , , , | 13 Comments »

Parameters and returned values

Posted by Eyal Schneider on February 11, 2010

When designing a class, we often deal with the question of how to protect against sloppy/malicious clients of our API. One of the major risks comes in the form of method parameters and returned values. References to these two offer a “backdoor” by which the client can mess with the class’ state in an uncontrolled manner.

The problem

To illustrate the problem, consider the following example, consisting of a movie collection managed by a video library.
A movie is a represented as simple bean:

public class Movie {
    private String title;
    private Genre genre;

    public Movie(){
    }

    public Movie(String title, Genre genre) {
        this.title = title;
        this.genre = genre;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Genre getGenre() {
        return genre;
    }

    public void setGenre(Genre genre) {
        this.genre = genre;
    }

    public int hashCode(){
        return (title == null? 0 : title.hashCode()) ^ (genre == null? 0 : genre.hashCode());
    }

    public boolean equals(Object other){
        if (!(other instanceof Movie))
            return false;
        Movie otherMovie = (Movie)other;
        return (title != null? title.equals(otherMovie.title) : otherMovie.title == null) &&
                genre == otherMovie.genre;
    }
}

The video library class manages a set of movies, allowing efficient retrieval of all movies for a given genre:

public class VideoLibrary {
    private Set<Movie> movies = new HashSet<Movie>();
    private Map<Genre,Set<Movie>> moviesByGenre = new HashMap<Genre,Set<Movie>>();

    public void addMovie(Movie movie){
        if (movies.add(movie)){
            Genre genre = movie.getGenre();
            Set<Movie> genreMovies = moviesByGenre.get(genre);
            if (genreMovies == null){
                genreMovies = new HashSet<Movie>();
                moviesByGenre.put(genre, genreMovies);
            }
            genreMovies.add(movie);
        }
    }

    public Set<Movie> getByGenre(Genre genre){
        Set<Movie> movies = moviesByGenre.get(genre);
        if (movies == null)
            movies = Collections.emptySet();
        return movies;
    }
}

For simplicity, the implementation assumes a single genre per movie.
If we inspect the two data members and the logic associated with them, we can clearly see that they follow an “internal contract” (or class invariant):
1) The set movies and the union of all movie sets inside moviesByGenre are identical (i.e. contain the same set of movies).
2) For every pair (K,V) in moviesByGenre, all movies in V have the same genre K.
When this invariant is broken, the object is in an inconsistent state, and can’t respect the contract with its clients anymore.

Now, consider the following usage of the class:

    VideoLibrary lib = new VideoLibrary();
    Movie pulpFiction = new Movie("Pulp Fiction",Genre.CRIME);
    lib.addMovie(pulpFiction);
    pulpFiction.setGenre(Genre.COMEDY);
    //What will the next call return?
    Set<Movie> crimeMovies = lib.getByGenre(Genre.CRIME);

The fact that the caller has references to mutable objects inside lib allows him to modify the internal state of the library. In this example we violated item (2) of the invariant, ending up with an inconsistent library, where a movie tagged as a comedy is returned for a query on crime movies. Furthermore, we have also corrupted the movies data member, because we modified items in the HashSet after adding them (i.e. the HashSet invariant regarding the hash code of entries was also broken).
We can also cause damage by manipulating a returned value instead of a method parameter. In the following example we break both components of the invariant:

    Set<Movie> crimeMovies = lib.getByGenre(Genre.CRIME);
    crimeMovies.add(new Movie("The Big Lebowski",Genre.COMEDY));
    //What will the next call return?
    crimeMovies = lib.getByGenre(Genre.CRIME);

Note that in both examples we managed to ruin the library’s state without even using its API. We never violated its methods’ preconditions, and yet we have damaged it.
Generally speaking, changing an object’s state by actions external to its API is not necessarily a problem. When we use an ArrayList for example, we can freely mutate the state of the stored objects. So how do the ArrayList and VideoLibrary differ? They differ in their invariants. While ArrayList is indifferent to the contents of the objects added to it, VideoLibrary can’t tolerate changes to Movie objects stored in it, since they may violate the consistency of its data members.

Summing up, any class that exposes internal references through method parameters or returned values has a data integrity risk if the following two are true:
1) The exposed references belong to mutable classes
2) Mutation of these objects can break the invariant of the class

In C++, it is common to solve this problem using the const qualifier on parameters/returned values.  If we use it wisely, we can share an internal reference with the API user, without putting the data integrity at risk. Of course, the user of the API can still do the damage if he explicitly performs a cast to the non-const equivalent data type, hence the solution is not perfect.
Java’s final qualifier is merely equivalent to a constant pointer in C++ (e.g. char * const); unfortunately there is no Java equivalent to C++’s constant referent (e.g. const char *).
It is interesting to note that Java does have the const qualifier as a reserved keyword, but its use is illegal. I guess that the language designers wanted to have the option to introduce some const mechanism into a future version of Java…
Meanwhile, we have a variety of other techniques for protecting the data members of our classes. Following is a list of some of these techniques, with some examples from the video library use case.

Documentation

Maybe the easiest way of dealing with the problem is to return the responsibility back to the client of our class. This requires providing method documentation that clearly states what the client can’t do with the parameters/returned values.
The class java.util.HashMap for example takes this approach. When a client adds a pair to the map, he should make sure that the key is not altered as long as the pair is in the map. Otherwise, the hashcode of the key may change, and the map would not be able to guarantee correctness anymore. Unfortunately, the method HashMap.put(..) does not make this clear in its documentation…

In the case of the video library, we should add the proper documentation to VideoLibrary.addMovie(..) and VideoLibrary.getByGenre(..).

While this approach requires no coding and has no performance cost, it has 2 major disadvantages:

1) It requires programmers to be aware of “object ownership”. In our example, once we add a movie object to the library, the library becomes the “owner” of the movie object. The caller transfers the ownership and it is not allowed to mutate the movie anymore. Furthermore, the caller is not allowed to pass the movie to a method that assumes ownership on the parameter, and so on.
The ownership awareness is very similar to the way we manage object disposal in languages such as C++, where no garbage collector is available. The ownership in that case specifies who can dispose the object, rather than who has write permissions on it.

2) It is very easy for a programmer to mistakenly ignore the ownership rules, and once the “unauthorized” write occurs at runtime, the problem may manifest itself only in much later stage, possibly in unrelated parts of the code. This makes these kind of bugs very difficult to analyze.

Defensive copies

A very common approach is to hide internal references completely from the API user, by copying parameters right after receiving them, and copying returned values just before returning them. Cloneable objects make things easier. We will start by making Movie cloneable:

public class Movie implements Cloneable{
    ....
    public Movie clone(){
        try {
            return (Movie)super.clone(); // The default clone in our case is enough
        } catch (CloneNotSupportedException e) {
            // Unreachable
            return null;
        }
    }
}

Next we can fix VideoLibray.addMovie(..) as follows:

    public void addMovie(Movie movie){
        movie = movie.clone();
        ....
    }

The method getByGenre has to perform a deep clone. We want to protect both the collection and the items in it:

public Set<Movie> getByGenre(Genre genre){
    Set<Movie> movies = moviesByGenre.get(genre);
    if (movies == null)
        return Collections.emptySet();

    Set<Movie> res = new HashSet<Movie>();
    for (Movie movie : movies)
        res.add(movie.clone());
    return res;
}

Note that when returning an array reference from a method, there is no other choice but to perform a defensive copy. The other techniques described below are useless in the case of arrays.

While defensive copying guarantees a safe API, it carries a performance penalty, especially when working with large collections.

Immutability

When possible, we can make the classes of the parameters/returned values immutable. The main class still shares references with the outside world, but no one can modify their state, so we are safe. If there is a need to mutate such objects,  we simply derive new ones from existing ones (see java.lang.String or java.math.BigInteger for example).
In our example, an immutable version of Movie would be:

public final class Movie {
    private final String title;
    private final Genre genre;

    public Movie(String title, Genre genre) {
        this.title = title;
        this.genre = genre;
    }

    public String getTitle() {
        return title;
    }

    public Genre getGenre() {
        return genre;
    }

    public int hashCode(){
        ....
    }

    public boolean equals(Object other){
        ....
    }
}

Once Movie is immutable, we can leave addMovie(..) implementation unchanged. However, getByGenre(..) requires more attention – the concrete type being returned (HashSet) is mutable. Here we can do a shallow defensive copy:

public Set<Movie> getByGenre(Genre genre){
    Set<Movie> movies = moviesByGenre.get(genre);
    if (movies == null)
        movies = Collections.emptySet();
    return new HashSet<Movie>(movies);
}

Sometimes we don’t really need full immutability of the class used as parameter/returned value. We only need to make immutable the parts involved in our main class’ invariant. However, by making the class immutable we enjoy other benefits of immutable objects:

  • They are easy to analyze and debug
  • There is no need to implement clone()
  • They are always thread safe
  • Their hash code can be cached for improved performance

As a matter of fact, Joshua Bloch, in his famous book Effective Java, recommends always making classes immutable, unless there is a good reason why not to.

Decorator

The Decorator design pattern can be useful for the case of returned values. Instead of returning an internal reference, we return a wrapper that implements the same interface, but slightly modifies the behavior in order to protect our main class’ invariant.
One option of doing it is using a “restrictive decorator”, which disables all modifier methods by throwing exceptions when trying to use them.
The class java.util.Collections provides such decorators for sets, maps, lists, sorted maps, sorted sets and general collections. Assuming that we use the immutable version of Movie, the method getByGenre(..) can be re-written as follows:

public Set<Movie> getByGenre(Genre genre){
    Set<Movie> movies = moviesByGenre.get(genre);
    if (movies == null)
        movies = Collections.emptySet();
    return Collections.unmodifiableSet(movies); // Returns a read only view of movies
}

When applicable, we should prefer immutability, since the use of restrictive decorators provides a somewhat weaker protection; the protection is performed at runtime rather than at compile time. However, in comparison to the defensive copy technique which works in time proportional to the collection size, the decoration runs in constant time and is therefore a very efficient technique for collections as returned values.

As an alternative to a restrictive decorator, one can implement a decorator that allows write access, but performs the required data adjustments behind the scenes. In the case of Movie, we can implement an inner class for that purpose:

private class SafeMovie extends Movie{
    private Movie movie;

    public SafeMovie(Movie movie) {
        this.movie = movie;
    }
    public String getTitle(){
        return movie.getTitle();
    }
    public void setTitle(String title) {
        boolean removed = removeMovie(movie);
        movie.setTitle(title);
        if (removed)
            addMovie(movie);
    }
    public Genre getGenre() {
        return movie.getGenre();
    }
    public void setGenre(Genre genre) {
        boolean removed = removeMovie(movie);
        movie.setGenre(genre);
        if (removed)
            addMovie(movie);
    }
}

This code assumes the existence of a private method removeMovie(Movie) which completely removes a movie from the collection, and returns a flag indicating whether the movie existed in first place.
This wrapper can be used whenever returning movie objects from our public methods.

What about enumerations and iterators as returned values? While the obsolete Enumeration type is read only and safe to use as a returned type, Iterator has the optional ability to remove items (and ListIterator adds the ability to set and add items as well). Therefore, decoration is also relevant for iterators. The Apache Commons project includes a decorator for that purpose: org.apache.commons.collections.iterators.UnmodifiableIterator.

Read only interfaces

Sometimes when designing a class we separate the modifier and query operations into separate interfaces. Such classes are “friendly” as returned values, because we can simply return the “read only” interface instead of the class itself. While this is not completely safe because a malicious client can still cast the object to its modifiable version, it is a reasonable way of stating the intents, and preventing unintentional modifications by the client.

Listeners

In some cases, we want to share an internal reference with the client, and not restrict the possible actions on it. Instead, we want to make sure that whenever the referenced object is altered, our main class is adjusted accordingly, in order to maintain its invariant. To accomplish this, we can follow the Observer design pattern. The main class registers listeners in all internal objects it wants to share, and they in turn notify the main class whenever they change their state.
In the Java libraries, this technique is very common in Swing. The class JTable for example is a visual component based on a tabular data model (TableModel). This duality allows a clear separation between the data itself and its visual representation. The table model is a data member of JTable, and it is being shared with the outside world both as a returned value and as a constructor parameter. At construction time, the JTable registers itself as a listener for data changes in the model. Therefore, any changes done by the client to the data model itself will be reflected in the UI.

While this technique can be appropriate in some cases, it is not a generic solution, because it is not applicable in many cases:

  • While a mutation of the shared object maintains its own invariant, it is not necessarily a legal operation from the main class point of view; the listener can not always “fix” the invariant of the main class for any change of the shared object.
  • The main class has to be extra careful when it mutates the shared objects by itself, since this triggers the listeners as well. If it does so in the middle of a transaction, the class may be in a completely inconsistent state when the listeners are executed. This complicates the implementation of the listeners.

Summary

Protecting the state of a class becomes relevant only if it shares mutable references with its clients (through parameters/returned values), and mutations of these referenced objects have the potential of breaking the class invariant.
However, we should not always apply protection even under these circumstances. If we trust our clients, and the performance impact of any of the applicable protection techniques is unacceptable, we may want to leave the class unprotected, and add detailed documentation regarding the client’s responsibility.
Remember that API documentation is important anyway, even if we apply active protection. If we perform a defensive copy, it is nice to let the client know it and avoid another defensive copy on his part.  Similarly, if we return a decorated object, the client should be aware of the consequences of using the modifier methods.

What is the price of not protecting a class from reference inflicted damage? An unprotected class is never a module by itself. It cannot be tested as a stand-alone unit, and any bug in other classes can easily “infect” it. Systems built up of these kind of classes are difficult to debug, since it is harder to isolate problems and analyze them.

Lastly, it’s worth noting that a client can always ruin our class invariant if he really wants to. Although Java does not allow pointer manipulations which can easily break data integrity, it does provide the ability to cause a similar damage by means of reflection. Another way of breaking invariants can work on classes which are not thread safe – race conditions are likely to cause data corruption.

I would like to thank Dr. David Faitelson, my friend and tutor, for inspiring this blog post with his work on the theory of modularity.


Posted in Design, java | Tagged: , , , , , , , , , , , , , , , , , , | 2 Comments »

Exceptional exceptions

Posted by Eyal Schneider on December 27, 2009

From time to time, I find myself wondering what happens when an exception is thrown from some exotic place in the code. It is not always trivial how these exceptions are handled, and what are the consequences. In this post I summarize some of these cases.

Exception in static initialization code

Static initialization occurs when a class is loaded by a class loader. Its code can either come in the form of a static block (i.e. static{ ..}), or as a call to  a static method for initializing a static data member. In both cases, checked exceptions are not allowed by the compiler. When an unchecked exception occurs, on the other hand, it is wrapped by the ExceptionInInitializerError, which is then thrown in the context of the thread that triggered the class loading. The following code snippet produces this scenario:

 
class Useless {
    static {
        if (true) // Bypasses the compiler check for obvious exceptions in initializers
            throw new RuntimeException();
    }
}  

public class TestUselessClass {
    public static void main(String[] args) {
        try {
            new Useless(); // Triggers class loading
        } catch (ExceptionInInitializerError e) {
            e.printStackTrace();
        }
        new Useless();
    }
} 

 

Since class loading is lazy, the class gets loaded and initialized only when we try to create an instance from it. The class loading fails because initialization can’t complete, and then the ExceptionInInitializerError is thrown. In line 16 we try to use the class anyway. The class loading will not be attempted again, and this time we get a NoClassDefFoundError. This error is more severe than ClassNotFoundException, and it indicates in this case that we are trying to use a class that is in an erroneous state due to a failed initialization.              

It is important to note that in some cases the exception is not wrapped by ExceptionInInitializerError. If in line 5 we chose to throw an Error instead of a RuntimeException, then the JVM would have thrown the original exception without wrapping it.              

Summing up, we should be careful with complex static initializers, and be aware that an exception during initialization makes the class (and possibly the whole application) useless. Unless we want to report a fatal situation, we should not throw a runtime exception from a static initializer.

Exceptions in constructors

 

Exceptions in constructors are not rare at all. A checked exception can be used to indicate a “legitimate” problem when trying to create an instance, while an unchecked exception typically indicates a bug either in the client code or in the constructor itself. In both cases, the object is actually allocated in the heap space, but a reference to it is not returned. The object remains in a partially initialized state, until it gets garbage collected.  We learn from these facts that saving a reference to the object from the constructor itself (by using “this” reference) is a risky thing, since we may give access to an object in an invalid state.              

Another interesting thing to note about exceptions in constructors is related to reflection. When we need to invoke the empty constructor using a class object clzz, we sometimes use the method clzz.newInstance(). Reading its documentation reveals something unusual: any exceptions thrown by the constructor are propagated without a change. In other words, the newInstance method may throw checked exceptions that it does not even declare (!). We can exploit this flaw in order to bypass the compiler’s exception checking: 

 
public class UnsafeThrower {
    private static Exception toThrow; 

    private UnsafeThrower() throws Exception{
        throw toThrow;
    } 

    public static synchronized void throwExc(Exception e){
        toThrow = e;
        try {
            UnsafeThrower.class.newInstance();
        } catch (InstantiationException e1) {
            throw new IllegalArgumentException("Can't deal with InstantiationException");
        } catch (IllegalAccessException e1) {
            throw new IllegalArgumentException("Can't deal with IllegalAccessException");
        } finally{
            toThrow = null;
        }
    }
} 

 

Since newInstance declares two checked exceptions (InstantiationException and IllegalAccessException), the trick can not work with these two exceptions, and we get an IllegalArgumentException instead.           

To summarize this, we should avoid using Class.newInstance, since it does not handle exceptions correctly. Instead, we should use Constructor.newInstance, which generates an InvocationTargetException wrapper for any constructor exception.

Exceptions in a finally block 

 

A finally block is executed immediately after the try block finishes executing, either gracefully or with an exception. Any returned value or exception produced by the finally block overrides any returned value or exception ending the try block previously.    

The following flawed method implementation illustrates such a scenario:  

 
public static int countLines(String filePath) throws IOException{
    int count = 0;
    BufferedReader br = null;
    try{
        br = new BufferedReader(new FileReader(filePath));   
        while (br.readLine()!=null)
            count++;
    }finally{
     br.close(); //May cause a NullPointerException!
    }
    return count;
}

 

The method countLines in supposed to throw an IOException for all cases where an IO related problem is encountered. However, when the file is not found we get a NullPointerException instead, since it hides the original IOException being thrown. Obviously, The solution is to add a null verification on br in line 10.  

Exceptions in finalizers

Finalizers are intended for disposing of system resources or doing other kind of cleanup. We implement a finalizer by overriding Object.finalize().
What happens when we throw an exception from within an implementation of the finalize() method?  

The documentation of Object.finalize() states that:  

“Any exception thrown by the finalize method causes the finalization of this object to be halted, but is otherwise ignored.”   

 In other words, the thrown exception is ignored, and the object will eventually be garbage collected (assuming that no finalizer made this object reachable again).  

Let’s put this statement to the test:  

   
public class FinalizerTester {
    public void finalize() {
        throw new RuntimeException("Halting the finalizer...");
    }  

    public static void main(String[] args) {
        while (true)
            new FinalizerTester();
    }
}  

 

We would expect this code to run indefinitely, performing GC continuously. However, running this code on my machine ends with a OutOfMemoryError. Does this prove that the documentation is wrong, and an exception in the finalizer does prevent the object from being garbage collected after all?  

Inspecting the error’s stack trace supplies a clue of what really happens:  

Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    at java.lang.ref.Finalizer.register(Finalizer.java:72)
    at java.lang.Object.<init>(Object.java:20)
    at exceptions.FinalizerTester.<init>(FinalizerTester.java:3)
    at exceptions.FinalizerTester.main(FinalizerTester.java:10)
  

Whenever an Object with a non-empty finalizer is created, the JVM registers the object in the finalizer queue. The registration does not actually add the object to the queue; it only makes sure that the object enters the queue later, if and when it is no longer reachable . A dedicated thread (called “Finalizer” in Sun’s JVM) works in the background , removes objects from the queue, and executes their finalizers. Sometimes the rate at which objects get ready for finalization is greater than the rate at which the finalizer thread empties the queue.  As a consequence, the queue may explode and cause a OutOfMemoryError. In our case, this is exactly what happens: we slowed down the dequeue rate by throwing an exception, which is a relatively expensive operation. In fact, any other time-consuming task inside finalize() can have the same effect. 

We can conclude that:

1) An exception in an object’s finalizer may cause the cleanup to be incomplete, but it does not make the object non-collectable by the GC.
2) There are many good reasons for avoiding the usage of finalizers. One of them is that time-consuming finalizers delay the recycling of dead objects, possibly causing OutOfMemoryError.

 

Posted in java, JDK packages | Tagged: , , , , , , , , , , , , , , | Leave a Comment »

Comparators and sorted sets

Posted by Eyal Schneider on November 23, 2009

Java suggests two techniques for defining an ordering on a collection of objects. The first is implementing the Comparator interface, as a stand alone comparison mechanism. The second is making the objects comparable, by making their class/es implement the Comparable interface.  

When the ordering is inherent to the class nature, such as with class Integer, implementing the Comparable interface makes sense. In the other cases, we don’t want to add specific ordering logic to the class implementation. Instead, we implement comparators on a need to basis. We may also want to implement multiple comparators for the same class, thus having many ordering schemes.

Regardless of the interface we choose to implement (Comparable/Comparator), there are some important rules to follow. Suppose that we need a class for maintaining a set of points in the plane, ordered by their distance from a fixed point. Let’s first define a class for representing a point:    

public class Point{
    private int x;
    private int y;
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }
    public int getX() {
        return x;
    }

    public int getY() {
        return y;
    }

    public boolean equals(Object other){
        if (!(other instanceof Point))
            return false;
        Point otherPoint = (Point) other;
        return x==otherPoint.x && y==otherPoint.y;
    }

    public int hashCode(){
        return x ^ y;
    }

    public double distanceFrom(Point other){
        int dx = x-other.x;
        int dy = y-other.y;
        return Math.sqrt(dx*dx+dy*dy);
    }

    public String toString(){
        return "("+x+","+y+")";
    }

 }

And now let’s define the OrderedPointSet class, which uses a TreeSet with a comparator for maintaining the sorted set of points efficiently: 


public class OrderedPointSet {
    private SortedSet<Point> points;
  private Point refPoint;
  public OrderedPointSet(final Point refPoint){
        this.refPoint = refPoint;
        points = new TreeSet<Point>(new Comparator<Point>(){
            public int compare(Point p1, Point p2) {
                double dist1 = refPoint.distanceFrom(p1);
                double dist2 = refPoint.distanceFrom(p2);    
                if (dist1>dist2)
                    return 1;
                if (dist2>dist1)
                    return -1;
                return 0;
            }
        });
    }
    public void add(Point p){
        points.add(p);
    }
    //returns the reference point followed by a list of points ordered in ascending distance from it

    public String toString(){
        return refPoint+":"+points.toString();
    }

    //Other methods
    ....
}

We can now test our implementation: 

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(1,5));
ps.add(new Point(10,10));
ps.add(new Point(5,20));
ps.add(new Point(5,5));
System.out.println(ps);

The output contains all points, with their right order:  

(0,0):[(1,5), (5,5), (10,10), (5,20)]
 

Apparently the job is done. However, the implementation has a major flaw, that can be detected with the following test scenario:  

OrderedPointSet ps = new OrderedPointSet(new Point(0,0));
ps.add(new Point(5,10));
ps.add(new Point(2,11));
ps.add(new Point(5,5));
System.out.println(ps);

We expect to see 3 points in the output, but we see only two:   

(0,0):[(5,5), (5,10)] 

How did this happen? The first two points, though not equal, have the same distance from the reference point  (square root of 125). The TreeSet implementation (as well as other SortedSet implementations) never invokes the equals() method on its items, and relies only on the comparator implementation for determining equality. Therefore, from its point of view, points (5,10) and (2,11) are equal. When trying to add the latter, the collection stays unchanged, because it obeys the contract of Set.add(), which doesn’t permit overriding. As a consequence, another part of Set’s contract is broken: we add two distinct items (according to equals method), and yet one of them is rejected.
The problem originates from the comparator implementation above, which doesn’t define an ordering consistent with equals, as required by the SortedSet documentation. The formal definition is as follows:  

The ordering imposed by a comparator c on a set of elements S is said to be consistent with equals if and only if c.compare(e1, e2)==0 has the same boolean value as e1.equals(e2) for every e1 and e2 in S.        

Actually, for solving the problem described above it suffices to enforce only one direction of this rule: if two items are equal according to the comparator, they must be equal according to the equals() method. However, failing to comply with the other direction will result in a different violation of the Set interface –  the collection may accept two items despite the fact that they are equal according to equals().
Returning to our example, the fixed version of the comparator enforces consistency with equals by breaking distance ties of distinct points arbitrarily:

public int compare(Point p1, Point p2) {
    double dist1 = refPoint.distanceFrom(p1);
    double dist2 = refPoint.distanceFrom(p2);    
    if (dist1>dist2)
        return 1;    
    if (dist2>dist1)
        return -1;
 
    int dx = p1.getX()-p2.getX();
    if (dx != 0)
        return dx;
    return p1.getY()-p2.getY();
}

Proving that the consistency rule holds for this code is easy:

Let p1,p2 be points.

(Direction 1) Assume that p1.equals(p2). Then, p1.x=p2.x, and p1.y=p2.y. In this case, when running compare(p1,p2), conditions on lines 4 , 6 and 10 will not be met. The returned value is then p1.y-p2.y, which is 0, as requested.
(Direction 2) Assume that compare(p1,p2)=0. That means that the method must have terminated at line 12 (the other return statements never return 0). Therefore we know that p1.y=p2.y. Also, we know that condition on line 10 was not met, hence p1.x=p2.x. In other words, p1 and p2 are identical, so p1.equals(p2) is true, as requested.

Summing up, when implementing a Comparator (or Comparable)  interface intended to be used in some SortedSet implementation (or as a SortedMap key), make sure the comparator complies with the consistency with equals rule. This will guarantee that the collection respects the Set contract, and does not exhibit weird behaviours. After implementing the comparator, it may be wise to prove that the consistency property really holds.

Posted in java, JDK packages | Tagged: , , , , , , | 5 Comments »

Strings and Memory leaks

Posted by Eyal Schneider on October 27, 2009

java.lang.String is one of the mostly used classes in Java. Interestingly,  it is important to know some of its implementation details in order to avoid memory leaks.
Consider the following seemingly innocent piece of code:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

 

public class StrangeStringThing {
    private static Random rnd = new Random();
    private static final int MEGA = 1024*1024;
 
    private static String createRandomStr(int length){
        StringBuilder sb = new StringBuilder(length);
        for(int i=0;i<length;i++)
            sb.append((char)('a'+rnd.nextInt('z'-'a'+1)));
        return sb.toString();
    }
 
    public static void main(String[] args){
        List<String> substrings = new ArrayList<String>();
        for(int i=0;i<100;i++){
            String randomStr = createRandomStr(MEGA);
            String subStr = randomStr.substring(1000,1004);
            substrings.add(subStr);
        }
    }
}

This code generates a very long random string, and then keeps only a 5 chars long substring of it. This process is repeated 100 times.  We expect to end this program with 100 strings of length 5.
Surprisingly, if we try to run this code with the heap size limited to 100MB (-Xmx100m), we get an OutOfMemoryError.  How did that happen? all temporary large strings should have been cleared by the garbage collection, so there should be a really small memory footprint. Furthermore, the documentation of java.lang.String does not indicate any misuse of the API.

How are Java Strings stored?

The Java language specification defines precisely the API of class String, its Unicode details, its immutability, and the way String literals should be handled. Naturally, it does not define how Strings should be  represented internally, and what are the required time/space complexities of the different operations on them. These aren’t functional requirements. As a consequence, different JVM providers have the freedom to choose their preferred design for class String. Sun’s implementation (and also IBM’s) represent a string with a combination of 3 data members:

  • value – A reference to a char array
  • offset – A start index, inside the array
  • count – The actual length of the string.

When allocating a new string, the char array contains exactly the string content. Offset is set to 0, and count is set to the char array length. Then, when calling the method substring(..) upon it, the new string being returned contains a reference to the same char array, but its offset and count members are modified, reflecting the requested subsequence of chars. The sharing of the same char array by multiple String instances is possible, since strings are immutable. There are two benefits of this implementation approach in comparison to a substring implementation based on copying:
1) Memory usage is usually reduced, specially in cases where many substrings of the same string are taken, or if the substrings are long
2) substring(..) runs in constant time, instead of linear time

 Obviously there is a tradeoff here – if the string utilization pattern is not as described in (1), we may suffer from excessive memory consumption. The code above demonstrates this edge case: we take very small substrings of very large strings, where the latter are temporary. The temporary strings (referenced by randomStr) ARE collected by the garbage collector during the loop. However, their internal char arrays can not be collected, since they are being shared with the sub-strings. Therefore, the very long char arrays are kept unintentionally in memory, and that constitutes a memory leak.

The solution is simple – forcing the substrings to be copied into a new compact char array. This is done by replacing line 22 above with:

  String subStr = new String(randomStr.substring(1000,1004));

It looks like the API designers were aware of the the possibility of substring being implemented as described, so they added a copy constructor specifically for avoiding the memory leak. Note that otherwise this constructor is useless, because we are dealing with an immutable class.

Is this a Java bug?

There is currently an open bug report in Sun’s bug database regarding this issue.  It doesn’t look like it will be fixed soon, because doing so may introduce performance regression in many existing programs. Besides, there is a very simple workaround (described above), what makes the priority of this bug relatively low.

Summary

Java developers should be aware of the prevalent String.substring(..) implementation, and its memory leak potential risk. Whenever we know that the original string we take the substring from has a shorter life span than the substring itself, we should consider using the copy constructor to avoid sharing of the underlying char array.

Posted in java, JDK packages | Tagged: , , , , , , , | 4 Comments »

Java Quiz

Posted by Eyal Schneider on September 17, 2009

This time I posted a 20 questions quiz, developed for testing general Java skills.
Most of the questions address fundamental topics in Java(5), but the test is not a trivial one.  The right answers plus their explanations can be found on the bottom. You are welcome to add your comments.

Enjoy, and good luck!

Question 1

public class CrazyCount {
    private static volatile int count = 0;

    public static void main(String[] args) {
        for(int i=0;i<1000;i++)
            new Thread(new Runnable(){
                public void run() {
	           count++;
                }
            }).start();

        //.....
    }
}

What will be the value of count after all 1000 threads finish their work?

(a) Garbage, because the writes are not synchronized
(b) A value between 1 and 1000
(c) Exactly 1000
(d) At least 1000
Answer

Question 2

Consider the following multi threaded producer/consumer model:

 public class ProducerConsumer {
    private Job currentJob = null;

    public synchronized void addJob(Job job) throws InterruptedException{
        while (currentJob!=null)
            this.wait();
        currentJob=job;
    }

    public synchronized Job consumeJob(){
        Job res = currentJob;
        res.process();
        currentJob=null;
        this.notify();
        return res;
    }
}

Which of the following is correct?

(a) addJob will wait till the current job is executed, but consumeJob will never block.
(b) Jobs will be executed in the exact order in which they are added with addJob.
(c) This code may not work because of deadlocks
(d) ‘while’ in addJob can be replaced by ‘if’, without changing the behavior.
Answer

Question 3

Consider the following Singleton design pattern implementation:

public class ConnectionsMgr {
    private static ConnectionsMgr instance = null;

    public static ConnectionsMgr getInstance(){
        if (instance==null){
	    synchronized(ConnectionsMgr.class){
		if (instance==null)
		    instance = new ConnectionsMgr();
	    }
    	}
	return instance;
    }

    //...
}

Why is this implementation problematic?

(a) An uninitialized ConnectionsMgr object may be returned
(b) Synchronizing on ConnectionsMgr.class is not allowed
(c) The instance may be created more than once
Answer

Question 4

Which of the following statements regarding the final reserved word is correct?

(a) A final method can not be overriden by subclasses, and a final class can not be subclassed
(b) A final data member must be given a value in its declaration line
(c) A local variable declared as final can still be mutated using setter methods
(d) (a) and (c) are correct
Answer

Question 5

public static void main(String[] args) {
    try{
	String s = null;
	s=s.substring(10);
    }
    catch(Exception e){
	System.out.println("Caught exception!");
    }
    catch(RuntimeException e){
	System.out.println("Caught a runtime exception!");
    }
}

This code:

(a) Displays “Caught exception!” when executed
(b) Displays “Caught runtime exception!” when executed
(c) Runs without throwing an exception
(d) Does not compile
Answer

Question 6

Which of the following statements regarding exceptions is correct?

(a) IOException can be thrown without the need to declare it in the method header
(b) catch(Exception) will catch OutOfMemoryError
(c) ClassCastException can be thrown without the need to declare it in the method header
(d) (b) and (c) are correct
Answer

Question 7

Which of the following is the preferred way to stop a thread?

(a) Calling the Thread’s interrupt() method
(b) Calling interrupt() and then stop()
(c) Calling the stop() method
(d) Calling join() to wait for the thread to end
Answer

Question 8

public class Swap {
    public static void swapStrings(String x, String y){
	String temp = x;
	x=y;
	y=temp;
    }

    public static void main(String[] args) {
	String a = "1";
	String b = "2";
	swapStrings(a, b);
	System.out.println("a="+a+" ,b="+b);
    }
}

What will be the output when executing this main?

(a) An exception will be thrown
(b) “a=2 ,b=1”
(c) “a=1 ,b=2” because strings are immutable
(d) “a=1 ,b=2” because Java passes parameters by value
Answer

Question 9

public static void main(String[] args) {
    ArrayList<Point> list = new ArrayList<Point>();
    list.add(new Point(100,200));

    ArrayList<Point> listCopy = (ArrayList<Point>) list.clone();

    listCopy.get(0).setLocation(150,250);

    System.out.println(list);
    System.out.println(listCopy);
}

What will be the output of this main?

(a) [java.awt.Point[x=100,y=200]]
[java.awt.Point[x=100,y=200]]

(b) [java.awt.Point[x=150,y=250]]
[java.awt.Point[x=100,y=200]]

(c) [java.awt.Point[x=100,y=200]]
[java.awt.Point[x=150,y=250]]

(d) [java.awt.Point[x=150,y=250]]
[java.awt.Point[x=150,y=250]]
Answer

Question 10

Assume that class B extends class A.
Consider a method that is declared as follows:  public void callMe(List<A> l)

Which of the following data types can be passed as parameters to callMe(..)?
(a) List<B>
(b) ArrayList<B>
(c) ArrayList<A>
(d) All answers are correct
Answer

Question 11

Assumes that class B extends class A.

public void callMe(List<? extends A> l){
    l.add(new A());
    l.add(new B());
}

This method:

(a) Compiles, but with a warning
(b) Does not compile
(c) Runs correctly
(d) May cause ClassCastException
Answer

Question 12

What will be the output of the following code?

class Base {
    int i = 10;

    public Base(){
        System.out.println(get());
    }

    public int get(){
        return i;
    }

}
public class Derived extends Base{
    int j = 20;

    public int get(){
        return j;
    }

    public static void main(String argv[]){
        Base b = new Derived();
    }
}

(a) 0
(b) 10
(c) 20
(d) None of the above
Answer

Question 13

What will be the output of the following code?

class Base {
    protected int i = 10;

    public int get(){
        return i;
    }
}

public class Derived extends Base{
    protected int i = 20;

    public int get(){
        return i;
    }

    public static void main(String argv[])
    {
        Base b = new Derived();
        System.out.println(b.i);
        System.out.println(b.get());
    }
}

(a) 10
10

(b) 10
20

(c) 20
10

(d)
20
20
Answer

Question 14

What is the output of the following code?

String s1 = new String("Test");
String s2 = "Test";
if (s1==s2)
    System.out.println("Same");
if (s1.equals(s2))
    System.out.println("Equals");

(a) Same
(b) Equals
(c) Same
Equals
(d) No output
Answer

Question 15

What is the output of the following code?

class Parent{
    private void f(){
        System.out.println("Parent.f()");
    }

    public void g(){
        System.out.println("Parent.g()");
        f();
    }
}

public class Child extends Parent{
    public void f(){
        System.out.println("Child.f()");
    }

    public static void main(String args[]){
        Parent p = new Child();
        p.g();
    }
}

(a) Parent.g()
Child.f()
(b) Parent.g()
Parent.f()
(c) The code does not compile
(d) An exception is thrown
Answer

Question 16

Consider the following overloading scenario:

public class Overloader{
    public static void read(String s){
        System.out.println("read(String)");
    }

    public static void read(Integer i){
        System.out.println("read(Integer)");
    }

    public static void read(Object o){
        System.out.println("read(Object)");
    }

    public static void main(String args[]){
        Object s = new String("Java");
        Integer i = 10;
//      Overloader.read(s);
//      Overloader.read(i);
//      Overloader.read(null);
    }
}

What would be the output of each of the 3 read commands in comment?

(a) read(s) prints “read(String)”
read(i) prints “read(Integer)”
read(null) prints “read(Object)”

(b) read(s) prints “read(Object)”
read(i) prints “read(Integer)”
read(null) prints “read(Object)”

(c) read(s) prints “read(Object)”
read(i) prints “read(Integer)”
read(null) does not compile

(d) None of the above
Answer

Question 17

Which of the following statements is correct?

(a) A daemon thread is terminated when all non-daemon threads terminate
(b) A thread is created as a daemon thread if and only if it is created by a daemon thread
(c) Non daemon threads may keep the application from shutting down, even if the main thread is terminated.
(d) All of the above are correct
Answer

Question 18

class Parent{
    protected void x(){}
    public void y(){}
}

public class Child extends Parent{
    public void x(){}
    protected void y(){}
}

When compiling this code:
(a) It compiles successfully
(b) Compilation error – x can’t have its visibility increased
(c) Compilation error – y can not have its visibility reduced
(d) Compilation error – neither x nor y can have their visibility changed
Answer

Question 19

String s = new String("Wanna live forever!");
WeakReference<String> r1 = new WeakReference<String>(s);
SoftReference<String> r2 = new SoftReference<String>(s);
String s2 = r1.get();
s = null;

Assuming that no garbage collection took place, what is the reference strength of the string created at line #1, after these lines of code?
(a) Not referenced
(b) Weakly referenced
(c) Softly referenced
(d) Strongly referenced
Answer

Question 20

int x=0;
int y=0;
while(true){
    if ((++x==10) && (++y==10))
        return;
}

When executed, this code:
(a) Performs 10 iterations and ends
(b) Performs more than 10 iterations and ends
(c) Performs less than 10 iterations and ends
(d) Enters an infinite loop
Answer

Answers

1)  (b)
A race condition between threads may cause a a write to cancel other writes, thus multiple concurrent “incrementations” may result in a single incrementation.

2) (a)
(b) is incorrect, since there is no guarantee about the order in which waiting threads are awakened using Object.notify().
(d) is incorrect because when a producer thread is awakened, there is no guarantee that another producer thread was not awakened just before, setting the current job to another value.
Besides, Java documentation recommends always to use ‘while’ loops on wait() rather than ‘if’.

3) (a)
The line “instance=new ConnectionsMgr()” is not atomic. It results in multiple machine instructions, one of which is the reference assignment itself. Due to the Java memory model rules, this specific instruction is allowed to be reordered with respect to the other instructions. This reordering may cause the assignment to take effect before the object is initialized. Another thread calling getInstance() may therefore receive a non null reference to an uninitialized memory block.  See this article for more details.

4) (d)
An object data member or a local object variable declared as final can not change its address. However, the referenced object can be freely mutated.
(b) is not true because the member can be initialized in the constructor/s also.

5) (d)
RuntimeException derives from Exception, so the ordering of the catch blocks makes the second unreachable.

6) (c)
ClassCastException is an unchecked exception, so it does not need to be declared. IOException, however, is a checked exception, and you must either catch it or declare the method as “throws IOException”. OutOfMemoryError is a Throwable, but it derives from Error and not from Exception.

7)  (a)
Thread.stop() is a deprecated method that is not recommended for use. It stops a thread violently without guaranteeing anything about the integrity of the data manipulated by that thread. This may cause corrupt data to be viewed by other threads. Furthermore, there are some cases in which Thread.stop() will not terminate a thread.
Thread.interrupt() is preferred, but it relies on the code run by the thread to be friendly and react to the interruption.

8) (d)
Since Java passes parameters by value (not by reference!), there is no way to implement a swap method in Java. In C++ for example, it is possible, using references (&).

9) (d)
Collections such as java.util.ArrayList do not perform deep copying when cloned. Item references stay the same.

10) (c)
(a) and (b) are incorrect because if legal, they would have break the type safety. The method callMe could have added an instance of type A, and the caller now has a collection of both A’s and B’s, while it assumes that it has only elements of type B. See my previous post for more details.

11) (b)
Assume that C descends from B. The method callMe may receive as a parameter an object of type List<C>. If the additions were considered legal, this would result in a list defined as List<C> containing instances of super classes such as B and A. That breaks the generics type safety, since it will not protect the collection from an unexpected ClassCastException. See my previous post for more details.

12) (a)
Overriding methods that are called directly/indirectly by the superclass’ constructor should be done with care. If the overriden method uses data members belonging to the subclass, then they are still not initialized at that point! In this example, the data member is of type int, so the default initialization will give it a value 0.
For more explanations and a workaround for this problem, see posts 86 and 86b of the Java Specialists’ Newsletter.

13) (b)
Methods are invoked virtually – the exact method to be executed is determined at runtime according to the actual class of the object on which it was called. Data members, however, do not really override each other. Each class has its member i, and the one to be retrieved is determines at compile time, according to the declared class of the reference (in this case Base).

14) (b)
Using new String(…) forces Java to actually create a new instance of a string. Strings defined as literals (e.g. String x=”hello”), however, will always be represented by the same object reference at runtime.

15) (b)

16) (c)
The method signature to be invoked is determined at compile time. Therefore, it depends on the declared type and not the runtime type of the argument/s.
Regarding the null parameter – the compiler can not determine which signature is the correct one, since all the three reads are legitimate.

17) (d)

18) (c)
Reducing visibility of methods when overriding is not permitted.

19) (d)
s2 still hard-references the string.

20) (b)
Tricky thing… The Java language specification specifies that the second operand of && is never evaluated when the first operand evaluates to false (See this section). Consequently, only when x reaches 10 y will become 1. Later, x will overflow and reach 10 again, and so on; until both counters reach 10 and the loop will stop.

Posted in java, Language | Tagged: , , | 10 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 29 other followers