To be clear, I am not actually employing data mining (not yet, anyways). But I am sifting through lots and lots of data trying to find ways to 1) make sense of it all; 2) write it up in a way that makes sense to others; and 3) make it all look pretty. In this blog post, I will talk about my experiences with these three concepts in relation to my pilot test data. Pilot tests usually consist of small sample sizes and seek to establish some basic fundamental theories that are relevant to the overall research question. In my case, it consisted of creating 2 sets of 3D models from 5 crania and 5 mandibles, using individuals from a teaching collection at the University of Leicester.
My workspace, featuring three screens, two mice, and one big (refillable) cup of coffee.
Making Sense of Data
The point of my pilot test was to determine the accuracy and reproducibility of using a structured light scanner to create 3D models of skulls. Each 3D model (or point cloud, to be exact) consisted of 5,000,000 – 7,000,000 points for the mandibles and 12,000,000 – 15,000,000 points for the crania, which is a lot of data to check for accuracy/reproducibility! I had to think about sampling each point cloud in order to reduce the number of points I was working with in order to avoid running programs for hours just to get results. It’s not as simple as just deciding to subsample – it’s important to think of how you want to sample, how many points you want to use, and making sure that there were good reasons to justify this entire procedure.
Left: Original 3D model, viewed in Cloud Compare, consisting of several million points. Right: The subsampled point cloud consisting of ~100,000 points. The geometry of the cranium is still preserved, and 100,000 data points are much more manageable than millions!
Before doing my pilot test, I already had a good idea of what analyses I wanted to do, but it became necessary to re-evaluate how I was going to go about doing them. The deeper I delved into the data, the more I realized what limitations existed as to how I could interpret the results. It was therefore necessary to decide what other analyses were possible to run, which ones I should actually choose to do, and why. Going in, I thought I was going to do one type of analysis – in the end, I chose to perform four different ones. It doesn’t sound like much, but when each analysis needed to be repeated ten times (since I have 5 crania and 5 mandibles), each time using hundreds of thousands of data points, it took me a lot of time to process the data just to get results!
Making Data Understandable to Others
Once I had all my analyses figured out and knew how I could interpret my results, I had to write it up in a way that was understandable. This can be hard to do if your thought process was complex and convoluted when thinking about the analyses/results/interpretations. What was hardest for me was that, as my reasoning and thoughts developed, the terminology I was using started becoming more defined and rigorous. While this is a good thing, it also meant I had to go back and re-edit everything I had written earlier to ensure that I was using the correct terms in the correct way.
It’s worth noting that these terms will likely become even more refined as my PhD progresses, meaning I’ll probably have to go back and re-update everything.
Making Data Look Pretty
Being a very visual person, this is extremely important to me, and I have spent quite a lot of time thinking about how best to display results in my thesis chapter. This meant picking colours for graphs; choosing which types of graphs to use to best visualize my data; and deciding which angles to take screenshots of skulls should be included. Personally, I would have loved to embed 3D models into my PDF chapter. Unfortunately, I doubt that my supervisor would appreciate it if I sent her a document several gigabytes in size. And that would just be my pilot test chapter.
Here’s an example from my pilot test. I created two 3D models (point clouds) of the same cranium and then compared the two point clouds to each other to see how different they are. Theoretically, there should be no difference. But using Cloud Compare’s colour ramp, I can show that there actually is a difference, and where these differences are. The different colours correspond to how much difference there is depending on the mean (average) computed distance between the two. Blue means there is little difference; red means there is a lot!
Given how long it took me to run results for just five skulls, and the fact that my Alienware gaming laptop kept freezing even with the GPU in use, it is fairly evident that I’ll need a lot more computing power for my actual PhD data (150 skulls x 4 skeletal collections). My next step is to look into high performance computing (HPC) so that I can remotely access a much more powerful computer to run my data analysis programs.