Ascertaining Gender of Github Users to Determine PR Comment Sentiment, or How to Spend Money in the Cloud and Pretend you're a Data Scientist

This is not so much of a SRE post, be ye warned, but it does discuss solving problems creatively. WARNING: Blindly following this will result in incurring costs from cloud providers, potentially quite a bit. It’s not my fault if you don’t calculate how much your task will cost. As an example, had I used AWS Comprehend for my sentiment analysis, I calculated the cost at ~$15,000. Why on earth would you need to know the gender of Github users, you ask? If you’re conducting research on bias in pull request acceptance rates, that’s why. I’ll leave the background and meta-analysis of existing literature for, you know, the actual report. The tl;dr is that research suggests that women’s pull requests are declined at a higher rate than men’s if they are identifiable as a woman in their avatar, despite statistically creating objectively better code. The tl;dr of my findings is that I failed to reject the null hypothesis, with the heavy caveats of me using a small subset of data, having only a single semester of experience with machine learning, and only training the model on a subset of my subset of data. YMMV, and I in no way want to express finality on my findings. The report and all scripts are here. You should follow along in the report, Chapter 3, if you want detailed instructions on how this all works together. ...

2020-11-10 · 7 min · Stephan Garland