Research

Data parser structure & initial design by Levko Ivanchuk

At this stage, the less code is produced - the better. Of course, at the moment I am running some performance evaluations on large files and testing the capabilities of file handling on the frameworks I am using as a wrapper for OpenGL. See the following link to documentation: http://openframeworks.cc/documentation/utils/ofFile.html. Currently, based on my observations, the system allocates a buffer for the largest file that has been loaded. However, there isn't a way to free that buffer - you can release the file and close it, yet the buffer remains allocated. This shouldn't be a problem, though, as the buffer will be automatically increased if a bigger file is loaded. If a smaller file is loaded, it will have some redundant, unused memory. Or perhaps the size of the buffer gets deallocated and reduced if a smaller file is loaded. Haven't tested that yet, however, that is not important. 

What is important is how to handle the file once it has been loaded and, even more importantly, how to streamline the process of shape generation and allow for easy customization to the rules of shape generation. For instance, how do we efficiently switch from a full representation to a representation that shows only association rules, as proposed on the meeting? What if we decided to cluster by item type? What if we want to show only Maximum and Closed Itemsets? These rules require manual coding and thus - a new strategy to shape generation. I propose the following method. At this stage, I will talk in detail about the Processor / Parser part. Briefly, I will just mention that it is very easy to save a shape into a file, as such functionality is build into the framework I am using. In other words, all that is required from the parser (and me as a developer) is to turn data into a Mesh, following the modifiers (pictured here as Shape strategy and Shape parameters)

Initial project structure. 

 Following this figure, a user will need to specify 3 parameters - 

  1. A data file (a single folder is auto-scanned for files at the moment, a file search box might be supported if time permits) 
  2. Shape strategy (in other words, a strategy is a unique way of treating the file that is tailored to the requirements of the user)
  3. Shape parameters (a list of adjustable parameters of the shape, depends on the strategy) 

All of these parameters will be specified in a GUI, where an additional preview window will show the look of the shape. Yet, not interaction code will actually be in the parser - it will only have a preview option. 

Notes on Data Parser

Data Parser is a class that will manage the data files. It will only be giving one line at a time, and allow for fileLoad and fileClose. Processor class will in turn have a currentShapeStrategy class. Shape Strategy will have a list of adjustable parameters, and its own instructions to generate a GUI to adjust those parameters. Once the Shape Strategy and Shape Parameters are selected, and the 'generate' button is pressed, it will start a simple while loop 

while (there is another line in the file) 
    process that line with current strategy

Here is the tricky bit - how do you minimize the amount of data that is cached while this process is running? Because, essentially, the system needs to maintain all the information about all the itemsets on per level basis - all at once. 

Hypothetically, it is possible to take a single itemset, analyze the line from the text file, determine its level and simply add a point the mesh, and repeat. However, that does not allow for 'smart' things, like clustering, connection display and data filtering. In other to take these decisions, the system need to keep all the information at its disposal. In other words, it is far too complex for a simple state machine. 

So, a data structure is needed. Initially, I thought about kd-tree as a candidate. However, the more I looked into it, the more I realized that there might just be a better way. 

In the simplest strategy that displays all the relations between the itemsets, I see two cases: 

  1. A Singleton itemset is added. No further action required (i.e. no connections are produced). A single point is added to the Mesh. 
  2. A k-level itemset is added. It needs to be connected, according to the strategy, to N number of itemsets in the k-1 level. A single point is added to the Mesh + a line between to points in the mesh is added. 

Case 2 has a tricky problem - how do you find the itemset(s) in the above level that you need to connect to? Since you only need to search one level up, it makes more sense to maintain a linked list (or even a dictionary, generated on itemset name) of EACH level. So, instead of searching through all itemsets in a huge tree, we only need to search a single level.  

Of course, as data files get bigger, this will become inefficient. However, it will always be faster than searching through all the itemsets, even if a best-case O(log n) is always achieved. By searching though a level at a time, the size of N is always significantly smaller. 

Overall, there is some more work to be done, but the initial layout it there. Hopefully, it will prove to be efficient and smart way of managing all of this data. I just need to see how it goes...

First steps & goals by Levko Ivanchuk

In this post, I would like to outline some of the progress that has been made since the last post, or last 6 days and organize the suggestions & ideas for this project, to give a bit more focus to the initial stage, which is crucial. 

I would also like to thank Dr. Carson Leung for gathering his entire Database and Data mining lab team for a meeting to discuss my project. I truly appreciate all the suggestions and ideas expressed by Terry Jiang, Richard MacKinnon, Aaron Peddle, Vanessa Reimer and, of course, Dr. Carson himself. 

Things to do right now

  1. As this point, it is crucial to spend time developing an efficient data parser. Regardless of the visualization technique that we choose, it is critical to have fast and efficient storage of data. At least for the processing part, all the data need to be loaded into memory and then processed. In the latter stages, it might be useful to implement level-by-level data loading & processing, once the data sets exceed 500mb in size. However, for now, it is sufficient to develop a efficient storage of data for datasets of at least a million records, which will be the size of the test dataset. Some performance evaluation will follow. 
  2. Explore Unity. At this point, it might sound strange, but actually there is a lot of optimizations in Unity that I won't have neither time nor skills to implement. If, upon exploring, I discover that it is highly efficient at generating and rendering large sets of distinct graphical objects, and provides tools for interaction with them, I will consider switching to Unity. At this point, this is less important than point expressed in 1).

Things to do in the future

  1. Take the suggestions offered at the before mentioned meeting and put them into effect. My initial impressions are that it is not always necessary to display all the information (as there is just too much of it), instead focus on something that strikes a good balance between clarity and amount of information presented.
  2. Perhaps, build a quick and easy prototype to evaluate the feel and appeal of the design proposed in the meeting. I will be writing shortly about some designs that were proposed, outlining their pros and cons, giving details and my own thoughts.
  3. Think about clustering. It is as much as I have on this at the moment.

Overall, there is a lot of work ahead. Next post is planned in a week, and it will be going into more details on various designs and ideas that were discussed during the meeting. Some sketches will be created as well to better document the look of the prototype visualization. 

 

 

 

ConeViz: project start & backgroud information by Levko Ivanchuk

Screenshot of the preliminary prototype visualizer.

Screenshot of the preliminary prototype visualizer.

This is the first blog post about a project that has been on my mind for a while. Essentially, I have always looked for technically challenging areas in Comp.Sci, so for things like low-level, hard-core coding, as well as something that is more artistic, something that is graphical, interactive, intuitive to everyone the moment they see it. In other words, I looked for a project that is very tricky inside, but looks nice and simple on the outside. 

I think I might have found an idea that just might satisfy these two requirements. 

While taking a Data Mining course at the University of Manitoba, I have found it peculiar and weird as how people deal with the results of data mining algorithms. No doubt, throughout that course, I learned how the data can be mined and what one can discover, but, at least to me, the missing link was the ability for everyone to see that, in fact, after performing all of these calculations, we can observe X or Y or Z, etc. By everyone I mean everyone - even my mom, as she has no idea in computers, should be able to understand what is going on. With these thoughts in mind, I thought I would give it a try as a short course project.

Goal of the project is to visualize frequent patterns in 3D. 

Eventually, I realized that this idea might have more potential. Not only the visualization itself, but the challenges it contains 'under the hood' - simply processing 4+ million records is an interesting challenge. 

Here is a short list of what has been achieved so far: 

  • A prototype system has been developed. It allowed to identify potential problems, explore tricky areas and clarify the direction of the projects, its scope and the amount of work required to build a truly good visualization. 
  • Further project directions have been given some thought. A list of potential solutions to the existing problems in the prototype has been proposed, although, it is still quite possible the entire project will be rebuild in Unity instead of a conventional OpenGL framework. That said, however, there is nothing wrong with the framework that has been used
  • Since this research will be a part of a course project at the University of Manitoba, a project proposal has been worked on, some minor literature review has been done on the topic.

Next steps: 

  • I need to narrow down the list of things required to be done at the initial stages. It is always hard to find focus of the project at the initial stages, however, that is always a critical part. 
  • Decide on the platform and/or framework. That is also tricky, but much easier than the first point. 
  • Get all the formal stuff out of the way. Major step here was the proposal, with that done, it is time to narrow the focus and just go.
  • Just enjoy it. This step is easy - can't wait to get down to coding & seeing this thing come to life.