Data parser structure & initial design / by Levko Ivanchuk

At this stage, the less code is produced - the better. Of course, at the moment I am running some performance evaluations on large files and testing the capabilities of file handling on the frameworks I am using as a wrapper for OpenGL. See the following link to documentation: http://openframeworks.cc/documentation/utils/ofFile.html. Currently, based on my observations, the system allocates a buffer for the largest file that has been loaded. However, there isn't a way to free that buffer - you can release the file and close it, yet the buffer remains allocated. This shouldn't be a problem, though, as the buffer will be automatically increased if a bigger file is loaded. If a smaller file is loaded, it will have some redundant, unused memory. Or perhaps the size of the buffer gets deallocated and reduced if a smaller file is loaded. Haven't tested that yet, however, that is not important. 

What is important is how to handle the file once it has been loaded and, even more importantly, how to streamline the process of shape generation and allow for easy customization to the rules of shape generation. For instance, how do we efficiently switch from a full representation to a representation that shows only association rules, as proposed on the meeting? What if we decided to cluster by item type? What if we want to show only Maximum and Closed Itemsets? These rules require manual coding and thus - a new strategy to shape generation. I propose the following method. At this stage, I will talk in detail about the Processor / Parser part. Briefly, I will just mention that it is very easy to save a shape into a file, as such functionality is build into the framework I am using. In other words, all that is required from the parser (and me as a developer) is to turn data into a Mesh, following the modifiers (pictured here as Shape strategy and Shape parameters)

Initial project structure. 

 Following this figure, a user will need to specify 3 parameters - 

  1. A data file (a single folder is auto-scanned for files at the moment, a file search box might be supported if time permits) 
  2. Shape strategy (in other words, a strategy is a unique way of treating the file that is tailored to the requirements of the user)
  3. Shape parameters (a list of adjustable parameters of the shape, depends on the strategy) 

All of these parameters will be specified in a GUI, where an additional preview window will show the look of the shape. Yet, not interaction code will actually be in the parser - it will only have a preview option. 

Notes on Data Parser

Data Parser is a class that will manage the data files. It will only be giving one line at a time, and allow for fileLoad and fileClose. Processor class will in turn have a currentShapeStrategy class. Shape Strategy will have a list of adjustable parameters, and its own instructions to generate a GUI to adjust those parameters. Once the Shape Strategy and Shape Parameters are selected, and the 'generate' button is pressed, it will start a simple while loop 

while (there is another line in the file) 
    process that line with current strategy

Here is the tricky bit - how do you minimize the amount of data that is cached while this process is running? Because, essentially, the system needs to maintain all the information about all the itemsets on per level basis - all at once. 

Hypothetically, it is possible to take a single itemset, analyze the line from the text file, determine its level and simply add a point the mesh, and repeat. However, that does not allow for 'smart' things, like clustering, connection display and data filtering. In other to take these decisions, the system need to keep all the information at its disposal. In other words, it is far too complex for a simple state machine. 

So, a data structure is needed. Initially, I thought about kd-tree as a candidate. However, the more I looked into it, the more I realized that there might just be a better way. 

In the simplest strategy that displays all the relations between the itemsets, I see two cases: 

  1. A Singleton itemset is added. No further action required (i.e. no connections are produced). A single point is added to the Mesh. 
  2. A k-level itemset is added. It needs to be connected, according to the strategy, to N number of itemsets in the k-1 level. A single point is added to the Mesh + a line between to points in the mesh is added. 

Case 2 has a tricky problem - how do you find the itemset(s) in the above level that you need to connect to? Since you only need to search one level up, it makes more sense to maintain a linked list (or even a dictionary, generated on itemset name) of EACH level. So, instead of searching through all itemsets in a huge tree, we only need to search a single level.  

Of course, as data files get bigger, this will become inefficient. However, it will always be faster than searching through all the itemsets, even if a best-case O(log n) is always achieved. By searching though a level at a time, the size of N is always significantly smaller. 

Overall, there is some more work to be done, but the initial layout it there. Hopefully, it will prove to be efficient and smart way of managing all of this data. I just need to see how it goes...