Data science challenge: classifying millions of Twitter users based on gender

I couldn't have been hired as Qualogy's new data scientist at a better time. Together with several co-workers, I was tasked with classifying millions of Tweets based on gender for UN Global Pulse. We summarized the results in an interactive tool in collaboration with Leiden University's Centre for Innovation.

United Nations Global Pulse is a subsidiary of the United Nations and uses big data to improve UN policies. They use mobile phone data to determine, for instance, how much damage a natural disaster has caused and which areas require the most help.

Twitter as the voice of the people

Twitter is another important source of information for the UN and could be seen as the voice of the people. UN Global Pulse collects a lot of Twitter data on various subjects, most of which are related to millennium goals like healthcare, education and discrimination.

This data provides the UN with information on subjects deemed important by people around the world and helps them improve regional policymaking. For example: if inadequate healthcare proves to be a popular Twitter topic in Germany, the UN can focus its attention on this issue and carry out research to determine what exactly is going on.

Me at the office Me at the office

Our assignment

An important piece of information that was missing was the gender of Twitter users. We identified this for millions of Twitter users around the world.

Gender is extremely important for UN Global Pulse, especially if you want to use Twitter data to optimize your policy. If inadequate healthcare is a topic most often tweeted by women, this may indicate problems surrounding pregnancy-specific healthcare. This information can then be used to conduct more in-depth research. For men, this topic is probably less important. In short: knowing the gender of Twitter users makes it possible to conduct more targeted research.

Data collection using Twitter API

We used Twitter API to collect data on Twitter profiles. The API also helped us "teach" our computers to identify the gender of Twitter users. In order to classify the tweets, we used several machine-learning libraries from Python, a common data science language.

Team meeting! Team meeting!

Data visualization

For projects like these, it's extremely important to inform clients of the results (in this case the number of male and female Twitter users). This was even more important in our case because the policymakers wanted to see the relationship between Twitter activity and gender in different countries.

Interactive tool

That's why we developed an interactive tool using D3.js, a JavaScript library for making web-based visualizations. Our tool generates a male-female ratio of Twitter users per country. It also shows you how Twitter activity has developed in a specific period. This visualization was well-received and our tool was projected on a big screen during the General Assembly of the United Nations! The tool will soon be added to the UN website.

Our interactive tool Our interactive tool

Next step: age classification

While we've rounded off this assignment for the time being, I wouldn't be surprised if Leiden University incorporated it into a follow-up project. This would be a great step towards classifying other demographic data, like age, which can prove extremely valuable when developing new policies. After all, a different perspective is required when creating policies for youth education versus young adult education. I look forward to seeing the new developments in the future. 

Gerard: “Our tool was shown during the National Assembly of the United Nations”