Realtime classification of news
Written by Yannik Messerli - 17 November 2014
How one can compare news information? We are reading everyday news, but what kind? Are we influenced by them? It would be good to have a way to measure our news channels in order to understand better the orientation and their impact. I built therefore an index for my channel, the swiss RTS news channel (in french).
My goal is to build news index for different journal and being able to compare them. The first index shows the number of news per defined categories:
- Death of people: all about crimes, accidents, natural disaster, wars and conflicts.
- Economy and politics
- Culture and science: all about new discoveries, technology advances, museums, artist and writers announcement, art, festival and music.
- Other that do not fall in one of the above categories
The second one shows the number of positive and negative news:
How one can compare news information?
For the last few months I've collected the online news body and title of the RTS. The first step to mine the news is to transform them in a way we can compare and group them by category. I’m using the Vector Space Model (VSM) to embed the texts into an Euclidean space where they can be easily related by a linear relation. This technic is based on the vocabulary of the texts: I'm counting occurrences of each english words in the news, i.e. I'm building a vector of dimension of the english dictionary. Then I divide it by the number of terms in the document. (It is called the term frequency, more on it here) If we are missing a lot of characterises of texts with this technic - punctuation is ignored, and grammar and semantic information is discarded - it is really simple and efficient to deal with.
Because The dimension of our vectors is so high - potentially thousands of words - I was experiencing a phenomena called the "curse of dimensionality”. To takle it, I have then reduced the dimension of the space using preprocessing steps. E.g. I've reduced words to their basis form, removed words without any meaning, etc...
Finally, for each of the categories, I've manually picked ten relevant news. Because comparing two news is as simple as computing their euclidean distance, the simplest classification becomes the following problem: what is the label that minimizes the distance from the new document's vector to any of the manually-labelled documents vectors of the label's category?
I built the above index by grouping and counting news per week. I’ve implemented the routines using Python and the excellent framework pattern. The overall code, while simple and short, works quite well. However, the models built using the selected news are quite weak. The models would benefit to be selected by hands.
The number of news per week fluctuates a lot suprisingly. However, a normalized view shows that the repartition of the different categories, i.e. what the journal is mostly talking about are quite even: