by Pavel Simakov on 2007-05-07
The Problem
I planned to attend BarCamp Toronto meeting and was reviewing a list of attendees. The lists was long, over 150 people, and I did not know most of them. Each person registered to attend had a link back to his home page or a blog. I tried reading couple of pages from each person blog or web site hoping to get a sense of person's interests and areas of expertise. By the 5th or 6th site I began thinking "there got to be a better way" - it all took too long I was getting nowhere.
What if there was a software tool that could answer these questions:
- What is a blog or a web site about?
- What are the personal interests of a blogger?
- What are the hot topics covered by a web site?
The Solution
The most obvious way to figure out what the web site or blog is all about is to simply look at its title. But wait a minute, I get e-mails daily with the subject "You have just won a huge pile of money! Click here to claim!" and I never trusted it. Why should I trust the web site or blog title after all? I might, if I knew the authors to begin with, but I don't know the authors. And does the title reflect the content? Here is the title of Slashdot: "News for nerds, stuff that matters". You decide...It would be better to have a software tool to crawl the site and summarize it. I have seen tools for text analysis, but not quite for the job at hand...
Google comes to mind now. Google has to perform the similar analysis to make its AdSense advertising work. For advertising to be relevant, Google has to summarize and classify all crawled web pages by their relevance to specific ad keywords. Google works well at the moment and it would be a great tool for data mining, but, unfortunately, there is no way today to use crawled pages for non-Google purposes. Alexa Web Search Platform might be of use here, but I had trouble estimating the CPU cost and, consequently, the price tag on my research.
Naive Bayes classifiers work very well for spam filtering. They can also work well enough for the forum moderation and the document classification as we can gauge by the successful story of spam-free "Joel On Software" discussion boards. I could use Bayes classifier to classify blogs and web pages of the meeting attendees. But first, I would have to train the classifier by showing it which pages belong to which categories. Since I do not know the people attending or any of their work, training is difficult. Self-clustered Bayes classifiers are probably out there, but that's a whole other story.
So, instead I wrote a small software tool that I call WordMatrix, which turned out to do what I needed. For given groups of web pages or blogs the WordMatrix creates an HTML report containing two lists: the list of words common in all groups and the list of words common in only one group. Each word has color coding so that the highest frequency words have full intensity blue color fading into the white color as word frequency decreases. Each word or a phrase in the report is followed by a subscript number showing its score in some difficult-to-explain units, with 100 being the highest and 0 being the lowest frequency of word occurrence in text.
In a process, similar to single word analysis, the word phrase analysis is performed for phrases of different lengths; the most useful phrase length was found to be 2. I call the reports Frequent Words Report, and Frequent Phrases Report respectively. Overall, the processing requires large amounts of memory, but is fast. It takes only seconds to analyze the web pages and the blogs for this article.
The Subjects
Now, when we have the programming issues out of the way, let me show you how I used WordMatrix to summarize some web sites and to discover what their focus areas and the hot topics are. I have selected for analysis some well known online properties. Their authors are among the best writers and respected authorities (in their respective fields) in the world. Their writing touches on the software and social engineering subjects. I also happen to read most of them, which made it easier for me check if the WordMatrix's reports supported my personal observations. Here are the sites I chose:
- Martin Fowler's Bliki
- Clay Shirky's articles
- Dr. Phil's web site
- Noam Chomsky's writing
- Joel Spolsky's blog
- "Joel On Software" discussion boards
Pretend for a moment that you do not know who these people are or what they write about. Lets see if WordMatrix will helps us to find this out.
Comparing Interests of Several Authors
In this exercise we try to find out the unique interests that each author has compared to all other authors. As a side effect we will find the common vocabulary that all authors prefer to use. For this test I have selected 5-10 articles from each author as a representative set for WordMatrix to process. The 10 most common words per authors are:
- Martin Fowler: dsl100 |inject88 |workbench85 |software82 |agile79 |mock77 |use74 |language72 |representation65
- Joel Spolsky: window100 |hungarian71 |software70 |code57 |microsoft56 |bug52 |use49 |string45 |charge38 |joel37
- Noam Chomsky: terror100 |state39 |terrorist39 |israeli34 |bomb30 |us29 |military27 |israel26 |mai25 |war25
- Clay Shirky: web100 |group94 |user92 |software90 |weblog80 |social78 |pattern67 |people54 |because52 |semantic42
- Dr. Phil: family100 |rituals55 |factor55 |rhythm55 |crisis45 |child36 |parent36 |children27 |meaningful18 |step-by-step18
The 5 most common words phrases per author are:
- Martin Fowler: language workbench100 |software developer57 |service locator55 |language oriented45 |agile method44
- Joel Spolsky: hungarian notation100 |windows api100 |coding convention89 |operating system88 |joel spolsky78
- Noam Chomsky: international terror100 |united states84 |human rights56 |domestic constituencies35 |terrorist act32
- Clay Shirky: semantic web100 |social software80 |power law66 |web school60 |law distribution40
- Dr. Phil: family life100 |promote rhythm100 |traditions family100 |crisis will100 |children ears50
We have just discovered that Martin Fowler writes about programming languages and agile software development, Joel Spolsky worries about coding conventions and Windows API, Noam Chomsky has focus on international terrorism and human rights, and Dr. Phil is all about family and child-parent relations.
Here are the complete WordMatrix analysis reports for your own review: Frequent Words ReportFrequent Phrases Report
Discovering Change in Person's Interests over Time
In this exercise we will try to observe how the interests of one author change over time. In our fast paced and quickly changing world several authors managed to keep their past articles in good order. I collected articles and grouped them on a yearly basis for WordMatrix to process.
I have quickly learned that Noam Chomsky covers:
- in 2006: kamm100 |m-w65 | medical assistance100 |oil natural100 |energy sector100
- in 2005: language100 |chavez64 |orleans57 |social security100 |intelligent design95 |internal language85
- in 2004: haiti100 |palestinian92 |moral values100 |chemical warfare96 |war terror89
- in 2003: iraq100 |saddam79 |preventive war100 |war terror71 |grand strategy48
- in 2002: taliban100 |afghan79 |war terror100 |bin laden93 |international terror83
- in 2001: voter100 |disenfranchised57 |permanent interests100 |neoliberal reforms72 |capital mobility72
- in 2000: kosovo100 |serb86 |albanian63 |nato58 |colombia plan100 |nato bombs79 |bombing campaign66
- in 1999: kosovo100 |fbi93 |east timor100
Some things do not change for Noam Chomsky, however. For all these years his persistent interests, according to WordMatrix, are: state100 |us94 |world93 |war91 |united states100|human right91 |york times89 |state department83 |years ago83 |security council80 |united nations80 |international law78 |national security78 |tens thousands78.
At the same time period, but in the software engineering part of the world - Joel Spolsky covers:
- in 2006: ajax calendar100 |pointers recursion100 |functional program100 |cs degree67 |wiki37 |ajax37
- in 2005: project aardvark100 |usability test100 |hungarian notation98 |coding convention93 |hiring top82
- in 2004: social interface100 |rosh gadol100 |rosh katan100 |windows api61 |demand curve43 |social software37
- in 2003: aol100 |lease100 |landlord89 |code point100 |tenant broker64 |office space56 |character set60
- in 2002: dave100 |groove100 |commodity94 |product vision100 |leaky abstraction79 |asp net79 |vnc68 |open source67
- in 2001: citydesk100 |tile floor100 |task switch100 |citydesk beta86 |pascal string75 |usability test73 |dog food71
- in 2000: netscape100 |spam spam100 |work86 |bonus77 |stock option100 |program manager96
- in 1999: sabbatical100 |next big100 |last job86
Some things do not change for Joel Spolsky, as well. For all these years his persistent interests, according to WordMatrix, are: software100 |people93 |thing91 |company88 |fog creek100 |joel software96 |bug tracking94 |software developer92 |creek copilot89|joel spolsky88 |citydesk fogbugz88 |york city83|writing software82
Here are the corresponding WordMatrix analysis reports for your own review:
- Noam Chomsky writing during 1967-2006: Frequent Words Report + Frequent Phrases Report
- Joel Spolsky "Joel On Software" writing during 1999-2006: Full Length Articles Frequent Words Report + Frequent Phrases Report, Old Front Pages Frequent Words Report + Frequent Phrases Report
Discovering Hot Topics in a Website or Blog
The last exercise in the series is about finding hot topics in a web site or blog. The larger blogs and web sites are organizes in sections, categories and so on. It quite challenging to name these sections or categories with meaningful titles. The titles are short and might not properly reflect the content collected under them, especially when blogs and web site evolve over time. What kind of content do you think the categories "General", "Recent", "Interviews", "Design", "Leisure" or "Tools" have? What topics do they cover? What is the web site about?
Using WordMatrix I quickly discover that:
- "Leisure" category for Martin Fowler's Bliki focuses on: board game100 |music74 |saba47 |film43 |dive40 |us jazz38
- "Interviews" category for Clay Shirky's writings means: micropayment100 |media89 |news82 |media outlet100 |good design69 |recording industry67 |music industry58 |cable channels54
At the same time there seems to be no dominant theme on any of the "Joel On Software" Discussion Boards. Maybe with the exception of Tech. Interviews board where members talk a lot about char* manipulating C code. In my experience this is quite typical of a forum, where several individual contributors post short pieces of content with varying styles and purpose.
Here are the corresponding WordMatrix analysis reports for your own review:
- Martin Fowler's Bliki Frequent Words Report + Frequent Phrases Report
- Clay Shirky's Writings About the Internet Single Word Report + Word Pair Report
- "Joel On Software" Discussion Boards Single Word Report + Word Pair Report
Final Word
The results of WordMatrix are quite encouraging. You might suspect this work to be a self-fulfilling prophecy, because I have taken articles of people I already knew. But, it is not. The approach and the tool are applicable to any blog and any web site. These days I always run WordMatrix to learn about any new person I plan to meet or communicate with. It helps me to communicate better with other people. It helps to use the right words, so to speak.Just remember that we have entered a new world where every word you say, print, blog, SMS, draw or click can be recorded. It can be further analyzed, classified, processed, translated, stemmed, and cross-correlated with anything including your mother's maiden name, color of your eyes, time delay you took to comprehend the page you have just read, and the IP' address you use to connect to you favorite web pages, potentially including this one...
The Implementation Details
After couple of attempts to use Classifier4J code base and processing of OPML and RSS feeds from Bloglines and Java Blogs I gave up on pure Bayesian classifier approach. What is needed is a smart summarizer, a smart filter, a correlation finder - not a classifier.
In WordMatrix, similar to Naive Bayesian approach, blogs and web pages are modeled as sets of words with the independent probability. But instead of computing a document score, analysis is conducted on the basis of the scores for individual words and phrases. The words, not the documents, are classified as likely or not likely to be used by a specific author or set of authors. If a given word has a high probability to occur in articles of all authors it is classified as "comon to all authors". But if a given word has high probability of occurring only in articles of one author it is classified as "specific to particular author". This approach works without modification for any pairwise comparisons or measuring the similarity between any pairs of authors or any grouping of authors.
I wrote WordMatrix analyzer tool in Java. The analyzer is a command line tool; it takes a single XML configuration file as an input and produces a report. The input file contains the list of web pages to crawl, the list of web page groups and the assignments of a page to a group. The various processing options include use of stemming, list of common words to ignore, etc.
For each page: the page is fetched over HTTP and is converted to the plain text using Tidy HTML parsing library and custom XML DOM processor. The XML DOM processor allows to selectively include or exclude parts of the HTML document into resulting text. Thus we can filter out <script>, <head>, and HTML header/footer that are identical on all pages of a specific web site.
The resulting text is tokenized and stemmed using Snowball stemmer with default set of stop words. The word frequencies are computed by counting the word occurrences in the source text. For each page group: the word frequencies in a group are computed by aggregating the word frequencies for the individual pages in the group with the word weights proportional to the page size.
When all pages in all groups are processed, the similarity of groups and the word scores are computed using simple linear algebra. The various correlation reports are produced thereafter.
Related Projects
While I was working on this article I found that Nilesh Bansal has developed BlogScope. BlogScope is a very cool product that analyses "blogosphere" and finds blogs correlated on the basis of the terms they use. It visualizes the popularity and correlation of query terms as a function of time. Additionally, it displays a list of keywords closely associated with the query terms over the selected time window, hence providing an exploratory navigation system.
I am not familiar with details of BlogScope implementation, but I still think it can be adopted to conduct various forms of automated text analysis similar to one's I am illustrating in this article. It's not clear, however, if the BlogScope indexing mechanisms will be able to work on a subset of all blogs and feeds for the focused correlation.