Wednesday, June 3, 2009

Data Visualisation

NOTE: This post relates to project two, a copy of the visualisation can be obtained from here.

Introduction
The idea of these graphs is to forecast the type of contextual information within a given book. A user is able to discern the dominating theme of a book by analysing the growth and decline of word frequencies. Furthermore, the graphs present interesting information on many layers. One of which is the shift in the frequencies between the Old and New Testament. From a historical context, this suggests that the Old and New Testament were written in different periods of history which is defined by the religious beliefs and attitudes of society at that time. The following subheadings go on to discuss the challenges and processes that were involved in designing the graphs and the reasons behind the different decisions.

Challenges
Choosing A Project
The initial challenge was to define the scope of the project. This required me to assess the specifications and benefits of each project option. Among many of the specification, time management was a leading factor. Not thoroughly analysing what each project required and the associated skill set needed to fulfill those requirements really had an impact on the productivity and time management of the project. Initially, I was intending to choose the option that required me to create a geo-narrative using Google Maps. My idea was to develop a photographic tour of Canberra but due to transportation and technical constraints using Google Map's application programming interface (API), this was not achievable. As a result I was force to reassess the situation and consider a different option. Data visualisation appeared reasonable but not straight forward as I first assumed.

Gathering Data
The challenge in gathering data is that I had to ensure the data was accurate and reliable. For an observer, this would guarantee that information extracted from the visualisation was genuine. Secondly, I wanted to choose a data set that was unique and discrete. The type of visualisation I anticipated relied on data that wasn't continuously changing over time. I figured that data gathered algorithmically would eliminate accidental errors and the the need for human interaction. My first attempt for gathering data proved problematic. I intended to develop a program using the C language to analyse an electronic version of the Bible (King James version). However, I didn't take into account the functionalities needed before starting out. As a result, I ran into difficulties in maintaining code as new features were added. I resorted to an open source analysis tool called TextSTAT-2. The only drawback is that I had limited knowledge on how word frequencies were gathered and was obligated to rely on TextSTAT-2 judgment.

Visualizing Data
The initial challenge was to choose an effective visualisation tool. Having only limited knowledge using Excel, I decided to create the visualisation in MS-Paint. The drawback with using MS-Paint is that all the house-keeping is done by the user. I found that drawing the grid and calibrating the data along with connecting the lines to be very tedious and visually difficult to comprehend. After perfecting the graphs in MS-Paint I decided to use a web-based visualisation service called ManyEyes instead. ManyEyes, unlike MS-Paint, took care of the internal house-keeping and provided useful features that allowed users to interact with the data. Unfortunately, many attempts to upload my data set failed to be recognised despite massaging the data and working through detailed tutorials. With only one option left I resorted to using Excel. All the computations for normalising the data were done automatically and Excel had no trouble associating the data with its visualisation.

Processes
Choosing The Themes
The process of choosing a theme required looking at key words and determining if they related to each other. I looked through each book of the Bible and wrote out a list of key words for each particular theme. Then I did a rough analysis on the frequencies of the chosen words and omitted words that did not produce reliable data sets. I was particularly interested in data sets that gave a varying average in word frequencies throughout all the books. This decision was based on the fact that an observer could analyse the frequencies of different books and easily compare them.

Extracting The Frequencies
A soft copy of the Bible I used for this project can be obtained from here. Unfortunately TextSTAT-2 didn't allow analysis of multiple sources. This constraint required me to separate each of the books into different text files. Once partitioned, I had to open each book with TextSTAT-2 and individually query every word frequency. With 15 queries in each book and a total of 66 books this process did take some time. To eliminate potential inaccuracies I had to enter each word twice. Furthermore, I had to make the decision of whether or not I would accept variants of the same word, e.g. love, Love and love's. I decided to include the variants as it's just the grammar and not the definition that is changing.

Visualising The Data
After the data had been collected I needed to input it into Excel. A copy of this spreadsheet can be obtained from here. Before visualising I needed to ensure the data was normalised. This would allow users to compare relative word frequencies with out the frequencies being dependent on the size of a given book. This process involved taking the frequency of a chosen word, dividing by that book's length and multiplying it by the size of the largest book. Once normalised, it was matter of choosing a visualisation. Excel provided an excellent selection of visualisations and options for customising them. For this type of data set, the line graph was the most appropriate.

Analysis
The gradient of the lines between each book of each theme suggests a change in either content, message or mood. Although it is difficult to predict the author's exact intention or intended meaning without understanding the context in which the word is used. In saying this, the graphs represent valuable analytic information. The following conjectures are entirely based on my interpretation of these graphs which are likely to differ from someone else's.

The theme of Love and Will follow similar trends in the growth and decline of word frequencies. Both graphs present a smooth transition between each book with momentum only increasing for a small number of books. This suggests that the overall theme is consistent with emphasis only given to a select few books. On a different level, the frequency of words could imply that the authors of different books had similar insights. Comparing the theme of Love with Will and Anger illustrates that love is the dominant theme with the word “heaven” peaking at 1200 words in the book of John1.

The theme Anger unlike Love or Will present different analytic information. There are radical changes in word frequencies between all books. This could suggest authors at that time had varying interpretations of anger. Although from a biblical perspective this could imply God's guidance which would nicely work well with the key principles of the biblical teachings.

Cultural ideas are continually changing as society conforms to the social expectation placed upon them through certain historical events. In particular, the life, death and resurrection of Jesus and how it revolutionised people's interpretation of biblical teachings. Distinct shifts in frequencies between the Old and New Testament support this consensus. In conclusion, these graphs represent information in an easily interpreted format which otherwise would be difficult to extract a similar analysis by examining the raw data source.

No comments:

Post a Comment