A Sankey diagram showing user clicks starting from the English Wikipedia page on Alan Turing.

Last month I applied to Outreachy, a paid internship for non-students from underrepresented groups in technology. Instead of interviewing based on your resume, part of the application involves making a contribution to the open source project you're hoping to work for. Since the internship employers tend to be big names in open source (Apache, Red Hat, the Linux Foundation, Mozilla, Wikimedia) this can be a bit intimidating. The process is very well documented, however, and volunteer mentors are available to help you as you work.

I chose to submit two contributions to the Wikimedia Foundation, for two different projects of theirs. I appreciated that they carefully scaffolded the contribution process, asking every applicant to complete the same task, instead of just setting us free in the bug tracker. For large projects with many current and aspiring contributors, it can be tough to find a bug that's appropriate for a new programmer and claim it before someone else does, much less complete the work in a reasonable amount of time while following local conventions for contributors. I looked for such a thing with other projects and ultimately didn't find anything within the application timeframe. Luckily the Wikimedia application tasks were well-specified, the mentors were quite responsive, and the tasks themselves were right up my alley: writing Python to analyze data and teach others.

One contribution I made was part of a larger project to write tutorials and tools for users to explore Wikipedia datasets. The task required both writing Python to analyze and visualize the data, and strong technical writing to explain the problem and potential solutions. We started with a partial outline and filled in the gaps, resulting in a tutorial aimed at users who want to investigate trends in Wikipedia edits.

Another contribution was for a project to build a tool to analyze how users navigate through Wikipedia, based on clickstream data. This task only required writing Python, although I ran into some challenges: the clickstream dataset is quite large, and using the idiomatic if x in y was prohibitively slow, so I wrote a little implementation of binary search to improve performance; and I initially couldn't figure out how to create a nice diagram showing clicks to and from an article. I started using networkx but was dissatisfied with the results. Then a comment from another contributor turned me on to Sankey diagrams, and suggested Plotly as a way to graph them. I couldn't get Plotly to show up so I tried holoviews and that worked great! I also used pandas a bit in this project, collecting Top 10 sources and destinations in English and German for a specific page, Alan_Turing. Check out the results.

Working on these notebooks was pretty fun, although I spent more time than I should've. I don't anticipate budgeting that much time for other job app take-homes, for example. But I wanted to do a good job and give it my best shot, which turns out to be pretty decent. Hopefully the mentors will agree with me and I'll get an internship!