The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem.
When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are still in the technology adoption phase evaluating this Big Data platform and with time the number of machines in their Hadoop clusters would probably grow into 3- or even 4-digit ranges.
Development on Hadoop is becoming more agile with shorter execution cycles — Apache Tez, Cloudera’s Impala, Databricks’ Spark are some of the technologies that aid in the process along the way.
While all those tools are great and all that, there are a large number of situations and use cases where all you need is some simple ad hoc application created with a specific end in mind and which does not require those big guns and a cluster of machines.
In this blog I will show you how to quickly put together very simple Java applications suitable for some of the typical data processing tasks. There is always a trade off between the time required to create such applications, their execution time, and the data set sizes they can crunch efficiently. The type of applications I am going to show you are very fast and you can develop them in a matter of minutes; however, they are not capable of processing large data sets — mostly those one that can fit into your computer’s RAM. Which is no longer a limiting factor with modern 64 bit OSes, 64 bit JDKs and double digit GB’s of RAM in your pocket.
Basically, the Java applications I am going to walk you through leverage the subset of set algebra operations as supported in Java via the java.util.Set interface.
To make things a bit simpler for you to follow, I removed the data loading part and put the data directly in the data structures that I use to illustrate the set algebra algorithms in suitable use cases.
Dev machine specs
In my development, I used 32-bit JDK 1.7.0_25 running on 64-bit Windows 7 with 16GB of RAM with four i5-3350P CPU @ 3.10GHz; data processing was done on a single thread, so the number of CPUs was not a big factor here.
Heap size settings
In order to process the test data set I generated, I needed to set the maximum Java heap size to 1024M; did so by using this program argument: -Xmx1024m.
Logging
In code I used slf4j logging package which you can replace with your favorite logging library or the proven System.out.println() logging facility.
The processing logic
Now, the main idea underpinning the applications’ logic is as follows: Use the set algebra to find an intersection of two data sets — an operation which, in essence, is an SQL’s INNER JOIN equivalent.
You can also use this approach to find the differences between sets.
With the java.util.Set class you can also remove duplicates from a given data set (a trivial task, not reviewed here).
First, I will walk you through a prototype solution to highlight the main points, and then we will see what mileage we can get out of this.
The basics of Set operations
When dealing with sets, you, usually, have tasks which are formulated as follows:
- In two (or more) data sets find all the elements that are contained in both sets (the intersection operation).
- In two (or more) data sets find all the elements that are different (the asymmetric set difference of the data sets).
Let’s see how the above tasks can be solved.
The set intersection procedure
private static void testBasicSetIntersection() { List> listOf2x = Arrays.asList(new Integer[] { 2, 4, 6, 8, 10, 12 }); List> listOf3x = Arrays.asList(new Integer[] { 3, 6, 9, 12, 15 }); Set> keySetOf2x = new HashSet<>(listOf2x); Set> keySetOf3x = new HashSet<>(listOf3x); logger.debug("The original key set: {}", listOf2x); // OUTPUT: The original key set: keySetOf2x.retainAll(keySetOf3x); logger.debug("Retained keys in the even set: {}", keySetOf2x); // OUTPUT: Retained keys in the even set: }
The result of this procedure is the set of which are elements found in both datasets.
The asymmetric set difference procedure
private static void testBasicSetDifference() { List> listOf2x = Arrays.asList(new Integer[] { 2, 4, 6, 8, 10, 12 }); List> listOf3x = Arrays.asList(new Integer[] { 3, 6, 9, 12, 15 }); Set> keySetOf2x = new HashSet<>(listOf2x); Set> keySetOf3x = new HashSet<>(listOf3x); logger.debug("The original key set: {}", listOf2x); // OUTPUT: The original key set: keySetOf2x.removeAll(keySetOf3x); logger.debug("Retained keys in the even set: {}", keySetOf2x); // OUTPUT: Retained keys in the even set: }
The set of contains all the elements from the 2x set that are not found in the 3x set.
Now the stress test.
Joining (intersecting) two data sets procedure
In this procedure I use two data sets backed-up by the Map type; you can extend this logic to support more data sets.
private static void testDataSetJoining() { int INITIAL_DATA_SET_SIZE = 10_000_000; // just a loop limit for generating two data sets MapmapOf2x = new HashMap<>(); Map mapOf3x = new HashMap<>(); // Preparing the data sets for (int i = 0; i < INITIAL_DATA_SET_SIZE; i++) { if (i % 2 == 0) { mapOf2x.put(String.valueOf(i), (i)); // multiples of 2 (2x) } if (i % 3 == 0) { mapOf3x.put(String.valueOf(i), (i)); // multiples of 3 (3x) } } logger.debug("Size of mapOf2x: {}; size of mapOf3x: {}", mapOf2x.size(), mapOf3x.size()); // OUTPUT: Size of mapOf2x: 5000000; size of mapOf3x: 3333334 Set keySetOf2x = mapOf2x.keySet(); Set keySetOf3x = mapOf3x.keySet(); keySetOf2x.retainAll(keySetOf3x); // Takes about 630 ms on my machine logger.debug("Retained keys (2x = 3x) size: {}", keySetOf2x.size()); // OUTPUT: Retained keys (2x = 3x) size: 1666667 int SELECT_SIZE = 10; // take a sample of the output ... String[] keys = keySetOf2x.toArray(new String); logger.debug("nnRandom {} records of the intersected data set (INNER JOIN):", SELECT_SIZE); // OUTPUT: Random 10 records of the intersected data set (INNER JOIN): for (int i = 0; i < SELECT_SIZE; i++) { logger.debug("{} -- {}", mapOf2x.get(keys), mapOf3x.get(keys)); } /* Part of the OUTPUT: ... 2031894 -- 2031894 8392644 -- 8392644 4852206 -- 4852206 0 -- 0 ... */ }
In the testDataSetJoining procedure we store input data in two Map-backed data sets. A map, as we know, contains a set of key-value pairs; one of our map contains 2x keys with the value which equals the key (of course, a value can be any data); the other map contains 3x numbers for its keys.
The core logic of the procedure is centered around the keySetOf2x.retainAll(keySetOf3x); statement which performs the required data set intersections among the keys in the maps.
Results
I expected the intersection process to be fast, but the result (pleasantly) exceeded my expectations:
Intersecting two data sets with sizes of 5,000,000 and 3,333,334 to produce a resulting data set with 1,666,667 records took only about 600 milliseconds. Changing the key type from String to Integer cut execution time by about 3 times (from about 600 ms down to about 200 ms). This is due to a much simpler hash function of the Integer type (which is simply the primitive int value represented by the Integer object).
Conclusions
As you can see, writing set-based applications in Java is a simple, fast, and effective way to solve some common data processing problems with ease; and I believe this technique should be part of your tool set of a Big Data specialist.
Feel free to use the above code to see it in action on your machine.
And that's all I wanted to say in this blog. Until next time!