It has been a pretty busy month at Frrole. We’re working towards putting together our technology infrastructure which should make the user experience at Frrole more productive and relevant.
Our major technological challenge was to be able to detect “Duplicate tweets”. That is, tweets which are para-phrased versions of each other. We wanted some mechanism to cluster tweets that talk about same topics or facts in similar context(s). The problem actually is harder than it appears. A temporary solution can be to do a simple keyword based match after tokenizing the tweet’s text and then rank the results. However, doing this becomes difficult since tweets normally possess custom hashtags, user handles and other grammatical idiosyncrasies. The situation is also exacerbated by the fact that any trivial clustering algorithm will take quadratic time to be able to cluster tweets. What it means is that we cannot tag hundreds of thousands of tweets effectively.
After careful considerations, we came up with a workable solution. We decided to produce a computationally feasible signature of a tweet, which we can compare to obtain an effective measure of similarity. Although finding such a signature is very difficult, we tried to approximate it with the entities which reside in the tweet. For us, a set of entities in a tweet provides a fair basis for its uniqueness. Hence, our first concern was to extract entities from tweets.
Entity extraction, or NER (Named Entity Extraction) as it is formally referred to in academic literature, is a technique to extract real-world entities from a piece of text. This is a fairly standard technique which works as a pre-requisite in many text mining applications. However, performing entity extraction on tweets is challenging because of limited number of words and characters in a tweet. To get a fair idea of the problem, we decided to pursue the research leads of two groups currently working on the problem of entity extraction from Twitter data. One research group is ARK from Carnegie Mellon University and second one is from University of Washington. Both these research groups are trying to find state-of-the-art to this problem. Hence, we decided to take their code-base and build on that.
We fixed the overall architecture of the project as following:
- Model Implementation: Python 2.7
- Web Framework: Bottle.py with Cherrypy
- Persistence: MongoDB 2.2+
- Core Library : NLTK 2.0+
The reason behind considering these specific technology artifacts was their simplicity in use and deployment. Python being a very versatile language has lots of plugins which makes our task easier and productive. After deciding on the architecture, we started with the main design of our approach. The overall pipeline of the clustering process is as following:
New tweets arrive in the system in batches. We refer to this set as newTweets. All the tweets that are exact duplicate of any tweet in our main_db (Main Database) are deleted.
All the newTweets are pushed into a database called stage_db and a sessionID is generated for this particular session.
All the tweets are processed sequentially, and entitySet for each tweet is found ( i.e. a set of entities residing in the tweet as reported by NER Engine)
Next, we find the neighborhood of this tweet with help of its entitySet. The neighborhood of a tweet comprises of all the tweets which share at least 2 entities with the original tweet. Any such pair will qualify a tweet to be considered in the neighborhood of another tweet.
After finding the neighborhood of a tweet, we process each tweet in the neighborhood to find Jaccard Coefficient (JC) with the original tweet. Jaccard Coefficient is a standard mathematical tool which quantifies the overlap of objects within two sets. In our case these sets are respective entitySet of two tweets under consideration. In our system Jaccard Coefficient approximates the similarity between two tweets.
Since our ultimate objective is to cluster tweets at any given point in our system, every tweet belongs to at least one cluster. The cluster(s) assignment happens when the tweet enters the system for the first time. If for a tweet a neighborhood cannot be found, it spawns its own new cluster. Moreover, every tweet can also participate in multiple clusters with varying degree of participation. A degree of participation is quantified with a floating point value called participationCoeff, .
After we have calculated the Jaccard Coefficient(s) of a tweet w.r.t to its neighbors, we need to assign this tweet to appropriate clusters. For doing this, we follow a recursive approach, in which we find out all possible paths originating from new tweet to all possible clusters via neighbors as intermediary nodes. Obviously, there will be many clusters which could not be reached using this and on the contrary there will be many clusters which are reached by multiple paths. For all such clusters which are reached with multiple paths, we assign the new tweet to these clusters with a ParticipationCoeff as the average of path length, where path length is defined as multiplication of Jaccard Coefficient between the two tweets and participationCoeff of the neighbor in a cluster.
After we have found all the clusters a tweet belongs to, we assign these clusters as a list to an attribute of the tweet.
Next, we delete the present tweet from stage_db and insert into commit_db.
We select only the relevant attributes of the new tweet and persist it into main_db. This is done to save space and our memory footprint, since we do not want unnecessary attributes of the tweets
Apart from maintaining clusters, we also perform other book-keeping which helps us expedite the whole clustering process. This includes (a) updating entity_db, which holds information of all the Entities in the DB (b) updating index_db, which helps us find neighborhood efficiently.
Finally, we pass back to the user all the tweets in commit_db.and delete the tweets simultaneously. This ends a successful batch session.
Even though our pipeline is straightforward and working as per our expectations, there is still room for optimization. Currently, the whole clustering process ails with too many DB queries. We plan to to minimize this overhead very soon and be able to scale up to clustering hundreds of tweets per second. That is the next challenge we’re going to work on. How we do it, I’ll let you know in my next post. Till then, enjoy Frroling and let us know what do you think.