👋 I'm Bhavik, a data scientist and developer. Under stage 4 lockdown in Melbourne, Australia, I built this website to better track, understand and share scienctific opinions on the Covid-19 pandemic. My hope is that this website will help provide a source of Covid-19 news and opinion that people can have a high degree of trust in. Below I've documented why and how I built this.
Get in touch with me on Twitter with any feedback.
Developments around Covid-19 are occurring fast. As a layman it's hard to know what information is important and find opinions you can trust. Early in the pandemic, I turned to Twitter to find experts tweeting about Covid-19. There really is no other place I know of to quickly find leading experts in any field sharing their opinions on current news.
I started building a list of experts, which was as a good start, but lists on Twitter are not easy to navigate and check regularly. It's also impossible to stay on top of all the tweets for a large group of users or to easily find all the tweets around a particular news issue.
Using the Twitter API I first expanded the list of scientists and experts I was tracking. I experimented with various heuristics to help surface relevant users. One of the most reliable signals to surface relevant profiles was finding users that a large percentage of the current list follows and who also follow back a high percentage of the current list users. Then iterating on this process. To ensure only relevant users were being added to the list I manually checked every suggested profile before adding it to the list.
I also searched for profiles that mentioned relevant fields of expertise, including but not limited to: epidemiology, biostatistics, virology, immunology, vaccinology, infectious diseases and public health.
The list currently has 604 different scientists and experts. You can subscribe to the list on Twitter. This is by no means exhaustive! I'm continually adding more users.
To form a digest of Covid related tweets from the list I experimented with various aggregations. First the raw list of tweets, retweets, quoted tweets and replies from the list has to be joined together into threaded groups with code (unfortunately the Twitter API just returns a flat list of tweets from users). These tweet 'groups' are futher aggregated into larger groups that have mentioned the same link. With some basic Natural Language Processing, these groups are then filtered for relevance to Covid-19 and tagged with sub-issues. I plan to make many improvements to this (such as clustering similar tweet groups and better tagging of sub-issues), but this provided a good start.
Tweet groups are scored by counting the expert interactions (tweets, retweets, quote tweets, replies) and aggregated over two time periods, the last 48 hours and the last 7 days (not including the last 48 hours). This seems to surface a good digest of the most important and recent issues.
The current website represents an MVP of the idea I had. In order of priority the current plan is to build out the following features and improvements:
- Geo-tag filtering
- Chronological view of all Covid tweets from the list that can be sorted (on date, interaction counts), filtered (on sub-issues) and expanded (on wider date ranges)
- Full text search of all list tweets
- Better clustering of links and tweets talking about the same sub-issues
- Better classifiers for sub-issue filters
A quick rundown of the tech stack I used.
Exploration and Prototyping: Jupyter Notebooks, FastAPI, Django
Data engineering and ETL: Python, Twitter API, Apache Airflow, Mongo DB
NLP and ML: spaCy, NLTK, sklearn, Huggingface Transformers
Frontend: Netlify, Jekyll, Bulma (CSS), jQuery