We’re working on a few different papers at the moment, but you can read this earlier paper from some of our team about a related topic: documentation in FOSS:
Many open-source software projects have become foundational components for many stakeholders and are now widely used behind-the-scenes to support activities across academia, the tech industry, government, journalism, and activism. OSS projects are often initially created by volunteers and provide immense benefits for society, but their maintainers can struggle with how to sustain and support their projects, particularly when widely used in increasingly critical contexts. Most OSS projects are maintained by only a handful of individuals, and community members often talk about how their projects might collapse if only one or two key individuals leave the project. Project leaders and maintainers must do far more than just write code to ensure a project’s long-term success: They resolve conflicts, perform community outreach, write documentation, review others’ code, mentor newcomers, coordinate with other projects, and more. However, many OSS project leaders and maintainers have publicly discussed the effects of burnout as they find themselves doing unexpected and sometimes thankless work.
The one-year research project — The Visible and Invisible Work of Maintaining Open-Source Digital Infrastructure — will study these issues in various software projects, including software libraries, collaboration platforms, and discussion platforms that have come to be used as critical digital infrastructure. The researchers will conduct interviews with project maintainers and contributors from a wide variety of projects, as well as analyze projects’ code repositories and communication platforms. The goal of the research is to better understand what project maintainers do, the challenges they face, and how their work can be better supported and sustained. This research on the invisible work of maintenance will help maintainers, contributors, users, and funders better understand the complexities within such projects, helping set expectations, develop training programs, and formulate evaluations.
]]>This post is a summary of the first BIDS Best Practices lunch, in which we bring people together from across the Berkeley campus and beyond to discuss a particular challenge or issue in doing data-intensive research. The goal of the series is to informally share experiences and ideas on how to do data science well (or at least better) from many disciplines and contexts. The topic for this week was doing data-intensive research in teams, labs, and other groups. For this first meeting, we focused on just identifying and diagnosing the many different kinds of challenges. In future meetings, we will dive deeper into some of these specific issues and try to identify best practices for dealing with them.
We began planning for this series by reviewing many of the published papers and series around “best practices” in scientific computing (e.g. Wilson et al, 2014), “good enough practices” (Wilson et al, 2017) and PLOS Computational Biology’s “ten simple rules” series (e.g. Sandve et al, 2013; Goodman et al, 2014). We also see this series as an intellectual successor to the collection of case studies in reproducible research published by several BIDS fellows (Kitzes, Turek, and Deniz, 2018). One reason we chose to identify issues with doing data science in teams and groups is because many of us felt like we understood how to best practice data-intensive research individually, but struggled with how to do this well in teams and groups.
Some of the major challenges in doing data-intensive research in teams is around technology use, particularly in using the same tools. Today’s computational researchers have an overwhelming number of options to choose in terms of programming languages, software libraries, data formats, operating systems, compute infrastructures, version control systems, collaboration platforms, and more. One of the major challenges we discussed was that members of a team often have been trained to work with different technologies, which also often come with their own ways of working on a problem. Getting everyone on the same technical stack often takes far more time than is anticipated, and new members can spend much time learning to work in a new stack.
One of the biggest divides our group had experienced was in the choice of using programming languages, as many of us were more comfortable with either R or Python. These programming languages have their own extensive software libraries, like the tidyverse vs. the numpy/pandas/matplotlib stack. There are also many different software environments to choose from at various layers of the stack, from development environments like Jupyter notebooks versus RStudio and RMarkdown to the many options for package and dependency management. While most of the people in the room were committed to open source languages and environments, many people are trained to use proprietary software like MATLAB or SPSS, which raises an additional challenge in teams and groups.
Another major issue is where the actual computing and data storage will take place. Members of a team often come in knowing how to run code on their own laptops, but there are many options for groups to work, including a lab’s own shared physical server, campus clusters, national grid/supercomputer infrastructures, corporate cloud services, and more.
Getting everyone to use an interoperable software and hardware environment is as much of a social challenge as it is a technical one, and we had a great discussion about whether a group leader should (or could) require members to use the same language, environment, or infrastructure. One of the technical solutions to this issue — working in staged data analysis pipelines — comes with its own set of challenges. With staged pipelines, data processing and analysis tasks are separated into modular tasks that an individual can solve in their own way, then output their work to a standardized file for the next stage of the pipeline to take as input.
The ideal end goal is often imagined to be a fully-automated (or ‘one click’) data processing and analysis pipeline, but this is difficult to achieve and maintain in practice. Several people in our group said they personally spend substantial amounts of time setting up these pipelines and making sure that each person’s piece works with everyone else’s. Even with groups that had formalized detailed data management plans, a common theme was that someone had to constantly make sure that team members were actually following these standards so that the pipeline keep running.
Many of the research projects we discussed involved not only handoffs between members of the team, but also handoffs between the team and external groups. The “raw” data a team begins with is often the final output of another research team, government agency, or company. In these cases, our group discussed issues that ranged from technical to social, from data formats that are technically difficult to integrate at scale (like Excel spreadsheets) to not having adequate documentation to be able to interpret what the data actually means. Similarly, teams often must deliver data to external partners, who may have very different needs, expectations, and standards than the team has for itself. Finally, some teams have sensitive data privacy issues and requirements, which makes collaboration even more difficult. How can these external relationships be managed in mutually beneficial ways?
Beyond technical challenges, a number of management issues face research groups aspiring to implement best practices for data-intensive research. Our discussion highlighted the difficulties of composing a well-balanced team, of dealing with fluid membership, and of fostering generative coordination and communication among group members.
Data-intensive research groups require a team with varied expertise. A consequence of varied expertise is varied capabilities and end goals, so project leads must devote attention to managing team composition. Whereas one or two members might be capable of carrying out tasks across the various stages of research, others might specialize in a particular area. How then can research groups ensure that no one member of the team departing would collapse the project and that the team holds the necessary expertise to accomplish the shared research goal? Furthermore, some members may participate simply to acquire skills, while others seek to establish or build an academic track record. How might groups achieve alignment between personal and team goals?
A practical management problem also relates to the quasi-voluntary and fluid nature of research groups. Research groups largely rely extensively on students and postdocs, with an expectation that they join the team temporarily to gain new skills and experience, then leave. Turnover becomes a problem when processes, practices, and tacit institutional knowledge are difficult to standardize or document. What strategies might project leads employ to alleviate the difficulties associated with voluntary, fluid membership?
The issues of team composition and voluntary or fluid membership raise a third challenge: fostering open communication among group members. Previous research and guidelines for managing teams (Edmondson, 1999; Google re:Work, 2017) emphasize the vital role of psychological safety in ensuring that team members share knowledge and collaborate effectively. Adequate psychological safety ensures that team members are comfortable speaking up about their ideas and welcoming of others’ feedback. Yet fostering psychological safety is a difficult task when research groups comprise members with various levels of expertise, career experience, and, increasingly, communities of practice (as in the case of data scientists working with domain experts). How can projects establish avenues for open communication between diverse members?
One of the major issues that resonated across our group was the tendency for a team to stop following various best practices when deadlines rapidly approach. In the rush to do everything that is needed to get a publication submitted, it is easy to accrue what software engineers call “technical debt.” For example, substantial “collaboration debt” or “reproducibility debt” can be foisted on a team when a member works outside of the established workflow to produce a figure or fails to document their changes to analysis code. These stressful moments can also be difficult for the team’s psychological safety, particularly if there is an expectation to work late hours to make the deadline.
At the conclusion of our first substantive meeting, we began to evaluate topics for future discussions that might help us identify potential solutions to the challenges faced by data-intensive research groups. In doing so, we were quickly confronted with the diversity of technologies, research agendas, disciplinary norms, team compositions, and governance structures, and other factors that characterize scientific research groups. Are solutions that work for large teams appropriate for smaller teams? Do cross-institutional or inter-disciplinary teams face different problems than those working in the same institution or discipline? Are solutions that work in astronomy or physics appropriate for ecology or social sciences? Dealing with such diversity and contextuality, then, might require adjusting our line of inquiry to the following question: At what level should we attempt to generalize best practices?
The differences within and between research groups are meaningful and deserve adequate attention, but commonalities do exist. This semester, our group will aggregate and develop input from a diverse community of practitioners to construct sets of thoughtful, grounded recommendations. For example, we’ll aim to provide recommendations on issues such as how to build and maintain pipelines and workflows, as well as strategies for achieving diversity and inclusion in teams. In our next post, we’ll offer some insights on how to manage the common problem of perpetual turnover in team membership. On all topics, we welcome feedback and recommendations.
Finally, many people who attended told us afterwards how positive and valuable it was to share these kinds of issues and experiences, particularly for combatting the “impostor syndrome” that many of us often feel. We typically only present the final end-product of research. Even sharing one’s final code and data in perfectly reproducible pipelines can still hide all the messy, complex, and challenging work that goes into the research process. People deeply appreciated hearing others talk openly about the difficulties and challenges that come with doing data-intensive research and how they tried to deal with them. The format of sharing challenges followed by strategies for dealing with those challenges may be a meta-level best practice for this kind of work, versus the more standard approach of listing more abstract rules and principles. Through these kinds of conversations, we hope to continue to shed light on the doing of data science in ways that will be constructive and generative across the many fields, areas, and contexts in which we all work.
]]>In our institutions, we employ multidisciplinary research staff who work with colleagues across many research fields to use and create software to understand and exploit research data. These researchers collaborate with others across the academy to create software and models to understand, predict and classify data not just as a service to advance the research of others, but also as scholars with opinions about computational research as a field, making supportive interventions to advance the practice of science.
Some of us use the term “data scientist” to refer to our team members, in others we use “research software engineer” (RSE), and in some both. Where both terms are used, the difference seems to be that data scientists in an academic context focus more on using software to understand data, while research software engineers more often make software libraries for others to use. However, in some places, one or other term is used to cover both, according to local tradition.
Regardless of job title, we hold in common many of the skills involved and the goal of driving the use of open and reproducible research practices.
Shared skill focuses include:
Shared attitudes and approaches to work are also important commonalities:
The very close relationship thus seen between the two professional titles is not an accident. In different places, different tactics have been tried to resolve a common set of frustrations seen as scholars struggle to make effective use of information technology.
In the UK, the RSE Groups have tried to move computational research forward by embracing a service culture while retaining participation in the academic community, sometimes described as being both a “craftsperson and a scholar”, or science-as-a-service. We believe we make a real difference to computational research as a discipline by helping individual research groups use and create software more effectively for research, and that this helps us to create genuine value for researchers rather than to build and publish tools that are not used by researchers to do research.
The Moore-Sloan Data Science Environments (MSDSE) in the US are working to establish Data Science as a new academic interdisciplinary field, bringing together researchers from domain and methodology fields to collectively develop best practices and software for academic research. While these institutes also facilitate collaboration across academia, their funding models are less based on a service model than in UKRSE groups and more based on bringing together graduate students, postdocs, research staff, and faculty across academia together in a shared environment.
Although these approaches differ strongly, we nevertheless see that the skills, behaviours and attitudes used by the people struggling to make this work are very similar. Both movements are tackling similar issues, but in different institutional contexts. We took diverging paths from a common starting point, but now find ourselves envisaging a shared future.
The Alan Turing Institute in the UK straddles the two models, with both a Research Engineering Group following a science-as-a-service model and comprising both Data Scientists and RSEs, and a wider collaborative academic data science engagement across eleven partner universities.
Observing this convergence, we recommend:
Data and software have enmeshed themselves in the academic world, and are a growing force in most academic disciplines (many of which are not traditionally seen as “data-intensive”). Many universities wish to improve their ability to create software tools, enable efficient data-intensive collaborations, and spread the use of “data science” methods in the academic community.
The fundamentally cross-disciplinary nature of such activities has led to a common model: the creation of institutes or organisations not bound to a particular department or discipline, focusing on the skills and tools that are common across the academic world. However, creating institutes with a cross-university mandate and non-standard academic practices is challenging. These organisations often do not fit into the “traditional” academic model of institutes or departments, and involve work that is not incentivised or rewarded under traditional academic metrics. To add to this challenge, the combination of quantitative and qualitative skills needed is also highly in-demand in non-academic sectors. This raises the question: how do you create such institutes so that they attract top-notch candidates, sustain themselves over time, and provide value both to members of the group as well as the broader university community?
In recent years many universities have experimented with organisational structures aimed at acheiving this goal. They focus on combining research software, data analytics, and training for the broader academic world, and intentionally cut across scientific disciplines. Two-such groups are the Moore-Sloan Data Science Environments based in the USA and the Research Software Engineer groups based in the UK. Representatives from both countries recently met at the Alan Turing Institute in London for the RSE4DataScience18 Workshop to discuss their collective experiences at creating successful data science and research software institutes.
This article synthesises the collective experience of these groups, with a focus on challenges and solutions around the topic of sustainability. To put it bluntly: a sustainable institute depends on sustaining the people within it. This article focuses on three topics that have proven crucial.
We’ll discuss each of these points below, and provide some suggestions, tips, and lessons-learned in accomplishing each.
The terms Research Software Engineer (i.e. RSE; most often used by UK partners) and Data Scientist (most often used by USA partners) have slightly different connotations, but we will not dwell on those aspects here (see Research Software Engineers and Data Scientists: More in Common for some more thoughts on this). In the current document, we will mostly use the terms RSE and Data Scientist interchangeably, to denote the broad range of positions that focus on software-intensive and data-intensive research within academia. In practice, we find that most people flexibly operate in both worlds simultaneously.
How can institutions find the financial support to run an RSE program?
The primary challenge for sustainability of this type of program is often financial: how do you raise the funding necessary to hire data scientists and support their research? While this doesn’t require paying industry-leading rates for similar work, it does require resources to compensate people comfortably. In practice, institutions have come at this from a number of angles:
Private Funding: Funding from private philanthropic organisations has been instrumental in getting some of these programs off the ground: for example, the Moore-Sloan Data Science Initiative funded these types of programs for five years at the University of Washington (UW), UC Berkeley, and New York University (NYU). This is probably best viewed as seed funding to help the institutions get on their feet, with the goal of seeking other funding sources for the long term.
Organisational Grants: Many granting organisations (such as the NSF or the UK Research Councils) have seen the importance of software to research, and are beginning to make funding available specifically for cross-disciplinary software-related and data science efforts. Examples are the Alan Turing Institute, mainly funded by the UK Engineering and Physical Sciences Research Council (EPSRC) and the NSF IGERT grant awarded to UW, which funded the interdisciplinary graduate program centered on the data science institute there.
Project-based Grants: There are also opportunities to gain funding for the development of software or to carry out scientific work that requires creating new tools. For example, several members of UC Berkeley were awarded a grant from the Sloan Foundation to hire developers for the NumPy software project. The grant provided enough funding to pay competitive wages with the broader tech community in the Bay Area.
Individual Grants: For organisations that give their RSEs principal investigator status, grants to individuals’ research programs can be a route to sustainable funding, particularly as granting organisations become more aware of and attuned to the importance of software in science. In the UK, the EPSRC has run two rounds of Research Software Engineer Fellowships, supporting leaders in the research software field for a period of five years to establish their RSE groups. Another example of a small grant for individuals promoting and supporting RSE activities is the Software Sustainability Institute fellowship.
Paid Consulting: Some RSE organisations have adopted a paid consulting model, in which they fund their institute by consulting with groups both inside and outside the university. This requires finding common goals with non-academic organisations, and agreeing to create open tools in order to accomplish those goals. An example is at Manchester, where as part of their role in research IT, RSEs provide paid on-demand technical research consulting services for members of the University community. Having a group of experts on campus able to do this sort of work is broadly beneficial to the University as a whole.
University Funding: Universities generally spend part of their budget on in-house services for students and researchers; a prime example is IT departments. When RSE institutes establish themselves as providing a benefit to the University community, the University administration may see fit to support those efforts: this has been the case at UW, where the University funds faculty positions within the data science institute. In addition, several RSE groups perform on-demand training sessions to research groups on campus in exchange for proceeds from research grants.
Information Technology (IT) Connections: IT organisations in universities are generally well-funded, and their present-day role is often far removed from their original mission of supporting computational research. One vision for sustainability is to reimagine RSE programs as the “research wing” of university IT, to make use of the relatively large IT funding stream to help enable more efficient computational research. This model has been implemented at the University of Manchester, where Research IT sits directly within the Division of IT Services. Some baseline funding is provided to support things like research application support and training, and RSE projects are funded via cost recovery.
Professors of Practice: Many U.S. universities have the notion of “professors of practice” or “clinical professors,” which often exist in professional schools like medicine, public policy, business, and law. In these positions, experts with specialised fields are recruited as faculty for their experience outside of traditional academic research. Such positions are typically salaried, but not tenure-track, with these faculty evaluated on different qualities than traditional faculty. Professors of practice are typically able to teach specialised courses, advise students, influence the direction of their departments, and get institutional support for various projects. Such a model could be applied to support academic data science efforts, perhaps by adopting the “Professor of practice” pattern within computational science departments.
Research Librarians: We also see similarities in how academic libraries have supported stable, long-term career paths for their staff. Many academic librarians are experts in both a particular domain specialty and in library science, and spend much of their time helping members of the community with their research. At some universities, librarians have tenure-track positions equivilant to those in academic departments, while at others, librarians are a distinct kind of administrative or staff track that often have substantial long-term job security and career progression. These types of institutions and positions provide a precedent for the kinds of flexible, yet stable academic careers that our data science institutes support.
How to create a successful environment where people feel valued?
From our experience, there are four main points that help create an enjoyable and successful environment to facilitate success and makes people feel valued in their role.
Physical Space. The physical space that hosts the group plays an important role to creating an enjoyable working environment. In most cases there will be a lot of collaboration going on between people within the group but also with people from other departments within the university. Having facilities (e.g. meeting spaces) that support collaborative work on software projects will be a big facilitator for successful outputs.
Get Started Early. Another important aspect to creating a successful environment is to connect the group to other researchers with the university early on. It is important to inform people about the tasks and services the group provides, and to involve people early on who are well connected and respected within the university so that they can promote and champion the group within the university. This helps get the efforts off the ground early, and spread the word and bring on further opportunities.
Celebrate Each Other’s Work. While it may not be possible to convince the broader academic community to treat software as first-class research output, data science organisations should explicitly recognise many forms of scientific output, including tools and software, analytics workflows, or non-standard written communication. This is especially true for projects where there is no “owner”, such as major open-source projects. Just because your name isn’t “first” doesn’t mean you can’t make a valuable contribution to science. Creating a culture that celebrates these efforts makes individuals feel that their work is valued.
Allow Free Headspace. The roles of individuals should (i) enable them to work in collaboration with researchers from other domains (e.g., in a support role on their research projects) and (ii) also allow them to explore their own ‘research’ ideas. Involvement in research projects not only helps these projects develop reliable and reproducible results but can be an important source to help identify areas and tasks that are currently poorly supported be existing research software. Having free head space allows individuals to further pursue ideas that help solve the identified tasks. There are a lot of examples for successful open source software projects that have started as small side projects.
How do we establish career trajectories that value people’s skills and experience in this new inter-disciplinary domain?
The final dimension that we consider is that of the career progression of data scientists. Their career path generally differs from the traditional academic progression, and the traditional academic incentives and assessment criteria do not necessarily apply to the work they perform.
Professional Development. A data science institute should prepare its staff both in technical skills (such as software development best practices and data-intensive activities) as well as soft skills (such as team work and communication skills) that would allow them to be ready for their next career step in multiple interdisciplinary settings. Whether it is in academia or industry, data science is inherently collaborative, and requires working with a team composed of diverse skillsets.
Where Next. Most individuals will not spend their entire careers within a data science institute, which means their time must be seen as adequately preparing them for their next step. We envision that a data scientist could progress in their career either staying in academia, or moving to industry positions. For the former, career progression might involve moving to new supervisory roles, attaining PI status, or building research groups. For the latter, the acquired technical and soft skills are valuable in industrial settings and should allow for a smooth transition. Members should be encouraged to collaborate or communicate with industry partners in order to understand the roles that data analytics and software play in those organisations.
The Revolving Door. The career trajectory from academia to industry has traditionally been mostly a one-way street, with academic researchers and industry engineers living in different worlds. However, the value of data analytic methods cuts across both groups, and offers opportunities to learn from one another. We believe a Data Science Institute should encourage strong collaborations and a bi-directional and fluid interchange between academic and industrial endeavours. This will enable a more rapid spread of tools and best-practices, and support the intermixing of career paths between research and industry. We see the institute as ‘the revolving door’ with movement of personnel between different research and commercial roles, rather than a one-time commitment where members must choose one or the other.
Though these efforts are still young, we have already seen the dividends of supporting RSEs and Data Scientists within our institutions in the USA and the UK. We hope this document can provide a roadmap for other institutions to develop sustainable programs in support of cross-disciplinary software and research.
]]>We recently held a workshop at ETHOS Lab and the Data as Relation project at ITU Copenhagen, as part of Stuart Geiger’s seminar talk on “Computational Ethnography and the Ethnography of Computation: The Case for Context” on 26th of March 2018. Tapping into his valuable experience, and position as a staff ethnographer at Berkeley Institute for Data Science, we wanted to think together about the role that computational methods could play in ethnographic and interpretivist research. Over the past decade, computational methods have exploded in popularity across academia, including in the humanities and interpretive social sciences. Stuart’s talk made an argument for a broad, collaborative, and pluralistic approach to the intersection of computation and ethnography, arguing that ethnography has many roles to play in what is often called “data science.”
Based on Stuart’s talk the previous day, we began the workshop with three different distinctions about how ethnographers can work with computation and computational data: First, the “ethnography of computation” is using traditional qualitative methods to study the social, organizational, and epistemic life of computation in a particular context: how do people build, produce, work with, and relate to systems of computation in their everyday life and work? Ethnographers have been doing such ethnographies of computation for some time, and many frameworks — from actor-network theory (Callon 1986, Law 1992) to “technography” (Jansen and Vellema 2011, Bucher 2012) — have been useful to think about how to put computation at the center of these research projects.
Second, “computational ethnography” involves extending the traditional qualitative toolkit of methods to include the computational analysis of data from a fieldsite, particularly when working with trace or archival data that ethnographers have not generated themselves. Computational ethnography is not replacing methods like interviews and participant-observation with such methods, but supplementing them. Frameworks like “trace ethnography” (Geiger and Ribes 2010) and “computational grounded theory” (Nelson 2017) have been useful ways of thinking about how to integrate these new methods alongside traditional qualitative methods, while upholding the particular epistemological commitments that make ethnography a rich, holistic, situated, iterative, and inductive method. Stuart walked through a few Jupyter notebooks from a recent paper (Geiger and Halfaker, 2017) in which they replicated and extended a previously published study about bots in Wikipedia. In this project, they found computational methods quite useful in identifying cases for qualitative inquiry, and they also used ethnographic methods to inform a set of computational analyses in ways that were more specific to Wikipedians’ local understandings of conflict and cooperation than previous research.
Finally, the “computation of ethnography” (thanks to Mace for this phrasing) involves applying computational methods to the qualitative data that ethnographers generate themselves, like interview transcripts or typed fieldnotes. Qualitative researchers have long used software tools like NVivo, Atlas.TI, or MaxQDA to assist in the storage and analysis of data, but what are the possibilities and pitfalls of storing and analyzing our qualitative data in various computational ways? Even ethnographers who use more standard word processing tools like Google Docs or Scrivener for fieldnotes and interviews can use computational methods to organize, index, tag, annotate, aggregate and analyze their data. From topic modeling of text data to semantic tagging of concepts to network analyses of people and objects mentioned, there are many possibilities. As multi-sited and collaborative ethnography are also growing, what tools let us collect, store, and analyze data from multiple ethnographers around the world? Finally, how should ethnographers deal with the documents and software code that circulate in their fieldsites, which often need to be linked to their interviews, fieldnotes, memos, and manuscripts?
These are not hard-and-fast distinctions, but instead should be seen as sensitizing concepts that draw our attention to different aspects of the computation / ethnography intersection. In many cases, we spoke about doing all three (or wanting to do all three) in our own projects. Like all definitions, they blur as we look closer at them, but this does not mean we should abandon the distinctions. For example, computation of ethnography can also strongly overlap with computational ethnography, particularly when thinking about how to analyze unstructured qualitative data, as in Nelson’s computational grounded theory. Yet it was productive to have different terms to refer to particular scopings: our discussion of using topic modeling of interview transcripts to help identify common themes was different than our discussion of analyzing of activity logs to see how prevalent a particular phenomenon, which were different than our discussion a situated investigation of the invisible work of code and data maintenance.
We then worked through these issues in the specific context of two cases from ETHOS Lab and Data as Relation project, where Bastian and Michael are both studying public sector organizations in Denmark that work with vast quantities and qualities of data and are often seeking to become more “data-driven.” In the Danish tax administration (SKAT) and the Municipality of Copenhagen’s Department of Cultural and Recreational Activities, there are many projects that are attempting to leverage data further in various ways. For Michael, the challenge is to be able to trace how method assemblages and sociotechnical imaginaries of data travel between private organisations and sites to public organisations, and influence the way data is worked with and what possibilities data are associated with. Whilst doing participant-observation, Michael suggested that a “computation of ethnography” approach might make it easier to trace connections between disparate sites and actors.
In one group, we explored the idea of the Perfect Information Organisation, or PIO, in which there are traces available of all workplace activity. This nightmarish panopticon construction would include video and audio surveillance of every meeting and interaction, detailed traces of every activity online, and detailed minutes on meetings and decisions. All of this would be available for the ethnographer, as she went about her work.
The PIO is of course a thought experiment designed to provoke the common desire or fantasy for more data. This is something we all often feel in our fieldwork, but we felt this raised many implicit risks if one combined and extended the three types of ethnography detailed earlier on. By thinking about the PIO, ludicrous though it might be, we would challenge ourselves to look at what sort of questions we could and should ask in such a situation. We came up with the following questions, although there are bound to be many more:
What the list shows is that although the PIO may come off as a wet-dream of the data obsessed or fetisitch researcher, even it has limits as a hypothetical thought experiment. Information is always situated in a context, often defined in relation to where and what information is not available. Yet as we often see in our own fieldwork (and constantly in the public sphere), the fantasies of total or perfect information persist for powerful reasons. Our suggestion was that such a thought experiment would be a good initial exercise for the researcher about to embark on a mixed-methods/ANT/trace ethnography inspired research approach in a site heavily infused with many data sources. The challenge of what topics and questions to ask in ethnography is always as difficult as asking what kind of data to work with, even if we put computational methods and trace data aside. We brought up many tradeoffs in our own fieldwork, such as when getting access to archival data means that the ethnographer is not spending as much time in interviews or participant observation.
This also touches on some of the central questions which the workshop provoked but didn’t answer: what is the phenomenon we are studying, in any given situation? Is it the social life in an organisation, that life distributed over a platform and “real life” social interactions or the platform’s affordances and traces itself? While there is always a risk of making problematic methodological trade-offs in trying to get both digital and more classic ethnographic traces, there is also, perhaps, a methodological necessity in paying attention to the many different types of traces available when the phenomenon we are interested in takes place both online, at the bar and elsewhere. We concluded that ethnography’s intentionally iterative, inductive, and flexible approach to research applies to these methodological tradeoffs as well: as you get access to new data (either through traditional fieldwork or digitized data) ask what you are not focusing on as you see something new.
In the end, these reflections bear a distinct risk of indulging in fantasy: the belief that we can ever achieve a full view (the view from nowhere), or a holistic or even total view of social life in all its myriad forms, whether digital or analog. The principles of ethnography are most certainly not about exhausting the phenomenon, so we do well to remain wary of this fantasy. Today, ethnography is often theorized as documentation of an encounter between an ethnographer and people in a particular context, with the partial perspectives to be embraced. However, we do believe that it is productive to think through the PIO and to not write off in advance traces which do not correspond with an orthodox view of what ethnography might consider proper material or data.
In the second group conversation originated from the wish of an ethnographer to gain access to a document sharing platform from the organization in which the ethnographer is doing fieldwork. Of course, it is not just one platform, but a loose collection of platforms in various stages of construction, adoption, and acceptance. As we know, ethnographers are not only careful about the wishes of others but also of their own wishes — how would this change their ethnography if they had access to countless internal documents, records, archives, and logs? So rather than “just doing (something)”, the ethnographer took a step back and became puzzled over wanting such a strange thing in the first place.
In the group, we speculated about if ethnographer got their wish to get access to as much data as possible from the field. Would a “Google Street view” recorded from head-mounted 360° cameras into the site be too much? Probably. On highly mediated sites — Wikipedia serving as an example during the workshop — plenty of traces are publicly left by design. Such archival completeness is a property of some media in some organizations, but not others. In ethnographies of computation, the wish of total access brings some particular problems (or opportunities) as a plenitude of traces and documents are being shared on digital platforms. We talked about three potential problems, the first and most obvious being that the ethnographer drowns in the available data. A second problem, is for the ethnographer to believe that getting more access will provide them with a more “whole” or full picture of the situation. The final problem we discussed was whether the ethnographer would end up replicating the problems of the people in the organization they are studying, which was working out how to deal with a multitude of heterogeneous data in their work.
Besides the problems we also discussed, we asked why the ethnographer would want access to the many documents and traces in the first place. What ideas of ethnography and epistemology does such a desire imply? Would the ethnographer want to “power up” their analysis by mimicking the rhetoric of “the more data the better”? Would the ethnographer add their own data (in the form of field notes and pictures) and through visualisations, show a different perspective on the situation? Even though we reject the notion of a panoptic view on various grounds, we are still left with the question of how much data we need or should want as ethnographers. Imagine that we are puzzled by a particular discussion, would we benefit from having access to a large pile of documents or logs that we could computationally search through for further information? Or would more traditional ethnographic methods like interviews actually be better for the goals of ethnography?
“Bringing data home” is an idea and phrase that originates from the fieldsite and captures something about the intentions that are playing out. One must wonder what is implied by that idea, and what does the idea do. A straightforward reading would be that it describes a strategic and managerial struggle to cut off a particular data intermediary — a middleman — and restore a more direct data-relationship between the agency and actors using the data they provide. A product/design struggle, so to say. Pushing the speculations further, what might that homecoming, that completion of the re-redesign of data products be like? As ethnographers, and participants in the events we write about, when do we say “come home, data”, or “go home, data”? What ethnography or computation will be left to do, when data has arrived home? In all, we found a common theme in ethnographic fieldwork — that our own positionalities and situations often reflect those of the people in our fieldsites.
It is interesting that our two groups did not explicitly coordinate our topics – we split up and independently arrived at very similar thought experiments and provocations. We reflected that this is likely because all of us attending the workshop were in similar kinds of situations, as we are all struggling with the dual problem of studying computation as an object and working with computation as a method. We found that these kinds of speculative thought experiments were useful in helping us define what we mean by ethnography. What are the principles, practices, and procedures that we mean when we use this term, as opposed to any number of others that we could also use to describe this kind of work? We did not want to do too much boundary work or policing what is and isn’t “real” ethnography, but we did want to reflect on how our positionality as ethnographers is different than, say, digital humanities or computational social science.
We left with no single, simple answers, but more questions — as is probably appropriate. Where do contributions of ethnography of computation, computational ethnography, or computation of ethnography go in the future? We instead offer a few next steps:
Of all the various fields and disciplines that have taken up ethnography in a computational context, what are their various theories, methods, approaches, commitments, and tools? For example, how is work that has more of a home in STS different from that in CSCW or anthropology? Should ethnographies of computation, computational ethnography, and computation of ethnography look the same across fields and disciplines, or different?
Of all the various ethnographies of computation taking place in different contexts, what are we finding about the ways in which people relate to computation? Ethnography is good at coming up with case studies, but we often struggle (or hesitate) to generalize across cases. Our workshop brought together a diverse group of people who were studying different kinds of topics, cases, sites, peoples, and doing so from different disciplines, methods, and epistemologies. Not everyone at the workshop primarily identified as an ethnographer, which was also productive. We found this mixed group was a great way to force us to make our assumptions explicit, in ways we often get away with when we work closer to home.
Of computational ethnography, did we propose some new, operationalizable mathematical approaches to working with trace data in context? How much should the analysis of trace data depend on the ethnographer’s personal intuition about how to collect and analyze data? How much should computational ethnography involve the integration of interviews and fieldnotes alongside computational analyses?
Of computation of ethnography, what does “tooling up” involve? What do our current tools do well, and what do we struggle to do with them? How do their affordances shape the expectations and epistemologies we have of ethnography? How can we decouple the interfaces from their data, such as exporting the back-end database used by a more standard QDA program and analyzing it programmatically using text analysis packages, and find useful cuts to intervene in, in an ethnographic fashion, without engineering everything from some set of first principles? What skills would be useful in doing so?
]]>We’re pleased to be organizing one of the open panels at the 2018 Meeting of the Society for the Social Studies of Science (4S). Please submit an abstract!
In this continuation of the previous Critical Data Studies / Studying Data Critically tracks at 4S (see also Dalton and Thatcher 2014; Iliadis and Russo 2016), we invite papers that address the organizational, social, cultural, ethical, and otherwise human impacts of data science applications in areas like science, education, consumer products, labor and workforce management, bureaucracies and administration, media platforms, or families. Ethnographies, case studies, and theoretical works that take a situated approach to data work, practices, politics, and/or infrastructures in specific contexts are all welcome.
Datafication and autonomous computational systems and practices are producing significant transformations in our analytical and deontological framework, sometimes with objectionable consequences (O’Neill 2016; Barocas, Bradley, Honovar, and Provost 2017). Whether we’re looking at the ways in which new artefacts are constructed or at their social consequences, questions of value and valuation or objectivity and operationalization are indissociable from the processes of innovation and the principles of fairness, reliability, usability, privacy, social justice, and harm avoidance (Campolo, Sanfilippo, Whittaker, and Crawford, 2017).
By reflecting on situated unintended and objectionable consequences, we will gather a collection of works that illuminate one or several aspects of the unfolding of controversies and ethical challenges posed by these new systems and practices. We’re specifically interested in pieces that provide innovative theoretical insights about ethics and controversies, fieldwork, and reflexivity about the researcher’s positionality and her own ethical practices. We also encourage practitioners and educators who have worked to infuse ethical questions and concerns into a workflow, pedagogical strategy, collaboration, or intervention.
]]>I use the case of Wikipedia's unusually open algorithmic systems to rethink the "black box" metaphor, which has become a standard way to think about ethical, social, and political issues around artificial intelligence, machine learning, expert systems, and other automated, data-driven decisionmaking processes. Entire conferences are being held on these topics, like Fairness, Accountability, and Transparency in Machine Learning (FATML) and Governing Algorithms. In much current scholarship and policy advocacy, there is often an assumption that we are after some internal logic embedded into the codebase (or "the algorithm") itself, which has been hidden from us under reasons of corporate or state secrecy. Many times this is indeed the right goal, but scholars are increasingly raising broader and more complex issues around algorithmic systems, such as work from Nick Seaver (PDF), Tarleton Gillespie (PDF), and Kate Crawford (link), and Jenna Burrell (link), which I build on in the case of Wikipedia. What happens when the kind of systems that are kept under tight lock-and-key at Google, Facebook, Uber, the NSA, and so on are not just open sourced in Wikipedia, but also typically designed and developed in an open, public process in which developers have to explain their intentions and respond to questions and criticism?
In the article, I discuss these algorithmic systems as being a part of Wikipedia's particular organizational culture, focusing on how becoming and being a Wikipedian involves learning not just traditional cultural norms, but also familiarity with various algorithmic systems that operate across the site. In Wikipedia's unique setting, we see how the questions of algorithmic transparency and accountability subtly shift away from asking if such systems are open to an abstract, aggregate "public." Based on my experiences in Wikipedia, I instead ask: For whom are these systems open, transparent, understandable, interpretable, negotiable, and contestable? And for whom are they as opaque, inexplicable, rigid, bureaucratic, and even invisible as the jargon, rules, routines, relationships, and ideological principles of any large-scale, complex organization? Like all cultures, Wikipedian culture can be quite opaque, hard to navigate, difficult to fully explain, constantly changing, and has implicit biases – even before we consider the role of algorithmic systems. In looking to approaches to understanding culture from the humanities and the interpretive social sciences, we get a different perspective on what it means for algorithmic systems to be open, transparent, accountable, fair, and explainable.
I should say that I'm a huge fan and advocate of work on "opening the black box" in a more traditional information theory approach, which tries to audit and/or reverse engineer how Google search results are ranked, how Facebook news feeds are filtered, how Twitter's trending topics are identified, or similar kinds of systems that are making (or helping make) decisions about setting bail for a criminal trial, who gets a loan, or who is a potential terrorist threat. So many of these systems that make decisions about the public are opaque to the public, protected as trade secrets or for reasons of state security. There is a huge risk that such systems have deeply problematic biases built-in (unintentionally or otherwise), and many people are trying to reverse engineer or otherwise audit such systems, as well as looking at issues like biases in the underlying training data used for machine learning. For more on this topic, definitely look through the proceedings of FATML, read books like Frank Pasquale's The Black Box Society and Cathy O'Neill's Weapons of Math Destruction, and check out the Critical Algorithms Studies reading list.
Yet when I read this kind of work and hear these kinds of conversations, I often feel strangely out of place. I've spent many years investigating the role of highly-automated algorithmic systems in Wikipedia, whose community has strong commitments to openness and transparency. And now I'm in the Berkeley Institute for Data Science, an interdisciplinary academic research institute where open source, open science, and reproducibility are not only core values many people individually hold, but also a major focus area for the institute's work.
So I'm not sure how to make sense of my own position in the "algorithms studies" sub-field when I hear of heroic (and sometimes tragic) efforts to try and pry open corporations and governmental institutions that are increasingly relying on new forms of data-driven, automated decision-making and classification. If anything, I have the opposite problem: in the spaces I tend to spend time in, the sheer amount of code and data I can examine can be so open that it is overwhelming to navigate. There are so many people in academic research and the open source / free culture movements who are wanting a fresh pair of eyes on the work they've done, which often use many the same fundamental approaches and technologies that concern us when hidden away by corporations and governments.
Wikipedia has received very little attention from those who focus on issues around algorithmic opacity and interpretability (even less so than scientific research, but that's a different topic). Like almost all the major user-generated content platforms, Wikipedia deeply relies on automated systems for reviewing and moderating the massive number of contributions made to Wikipedia articles every day. Yet almost all of the code and most of the data keeping Wikipedia running is open sourced, including the state-of-the-art machine learning classifiers trained to distinguish good contributions from bad ones (for different definitions of good and bad).
The design, development, deployment, and discussion of such systems generally takes place in public forums, including wikis, mailing lists, chat rooms, code repositories, and issue/bug trackers. And this is not just a one-way mirror into the organization, as volunteers can and do participate in these debates and discussions. In fact, the people who are paid staff at the Wikimedia Foundation tasked with developing and maintaining these systems are often recruiting volunteers to help, since the Foundation is a non-profit that doesn't have the resources that a large company or even a smaller startup has.
From all this, Wikipedia may appear to be the utopia of algorithmic transparency and accountability that many scholars, policymakers, and even some industry practitioners are calling for in other major platforms and institutions. So for those of us who are concerned with black-boxed algorithmic systems, I ask: is open source, open data, and open process the solution to all our problems? Or more constructively, when those artificial constraints on secrecy are not merely removed by some external fiat, but something that people designing, developing, and deploying such systems strongly oppose on ideological grounds, what will our next challenge be?
In trying to work through my understanding of this issue, I argue we need to take an expanded micro-sociological view of algorithmic systems as deeply entwined with particular facets of culture. We need to look at algorithmic systems not just in terms of how they make decisions or recommendations by transforming inputs into outputs, but also asking how they transform what it means to participate in a particular socio-technical space. Wikipedia is a great place to study that, and many Wikipedia researchers have focused on related topics. For example, newcomers to Wikipedia must learn that in order to properly participate in the community, they have to directly and indirectly interact with various automated systems, such as tagging requests with machine-readable codes so that they are properly circulated to others in the community. And in terms of newcomer socialization, it probably isn't wise to tell someone about how to properly use these machine-readable templates, then send them to the code repository for the bot that parses these templates to assist with the task at hand.
It certainly makes sense that newcomers to a place like Wikipedia have to learn its organizational culture to fully participate. I'm not arguing that these barriers to entry are inherently bad and should be dismantled as a matter of principle. Over time, Wikipedians have developed a specific organizational culture through various norms, jargon, rules, processes, standards, communication platforms beyond the wiki, routinized co-located events, as well as bots, semi-automated tools, browser extensions, dashboards, scripted templates, and code directly built into the platform. This is a serious accomplishment and it is a crucial part of the story about how Wikipedia became one of the most widely consulted sources of knowledge today, rather than the frequently-ridiculed curiosity I remember it being in the early 2000s. And it is an even greater accomplishment that virtually all of this work is done in ways that are, in principle, accessible to the general public.
But what does that openness of code and development mean in practice? Who can meaningfully make use of what even to a long-time Wikipedian like me often feels like an overwhelming amount of openness? My argument isn't that open source, open code, and open process somehow doesn't make a difference. It clearly does in many different ways, but Wikipedia shows us that we should asking: when, where, and for whom does openness make more or less of a difference? Openness is not equally distributed, because openness takes certain kinds of work, expertise, self-efficacy, time, and autonomy to properly take advantage of it, as Nate Tkacz has noted with Wikipedia in general. For example, I reference Ezster Hargattai's work on digital divides, in which she argues that just giving access to the Internet isn't enough; we have to also teach people how to use and take advantage of the Internet, and these "second-level digital divides" are often where demographic gaps widen even more.
There is also an analogy here with Jo Freeman's famous piece The Tyranny of Structurelessness, in which she argues that documented, formalized rules and structures can be far more inclusive than informal, unwritten rules and structures. Newcomers can more easily learn what is openly documented and formalized, while it is often only possible to learn the informal, unwritten rules and structures by either having a connection to an insider or accidentally breaking them and being sanctioned. But there is also a problem with the other extreme, when the rules and structures grow so large and complex that they become a bureaucratic labyrinth that is just as hard for the newcomer to learn and navigate.
So for veteran Wikipedians, highly-automated workflows like speedy deletion can be a powerful way to navigate and act within Wikipedia at scale, in a similar way that Wikipedia's dozens of policies make it easy for veterans to speak volumes just by saying that an article is a CSD#A7, for example. For its intended users, it sinks into the background and becomes second nature, like all good infrastructure does. The veteran can also foreground the infrastructure and participate in complex conversations and collective decisions about how these tools should change based on various ideas about how Wikipedia should change – as Wikipedians frequently do. But for the newcomer, the exact same system – which is in principle almost completely open and contestable to anyone who opens up a ticket on Phabricator – can look and feel quite different. And just knowing "how to code" in the abstract isn't enough, as newcomers must learn how code operates in Wikipedia's unique organizational culture, which has many differences from other large-scale open source software projects.
So this article might seem on the surface to be a critique of Wikipedia, but it is more a critique of my wonderful, brilliant, dedicated colleagues who are doing important work to try and open up (or at least look inside) the proprietary algorithmic systems that are playing important roles in major platforms and institutions. Make no mistake: despite my critiques of the information theory metaphor of the black box, their work within this paradigm is crucial, because there can be many serious biases and inequalities that are intentionally or unintentionally embedded in and/or reinforced through such systems.
However, we must also do research in the tradition of the interpretive social sciences to understand the broader cultural dynamics around how people learn, navigate, and interpret algorithmic systems, alongside all of the other cultural phenomena that remain as "black boxed" as the norms, discourses, practices, procedures and ideological principles present in all cultures. I'm not the first one to raise these kinds of concerns, and I also want to highlight the work like that of Motahhare Eslami et al (PDF1, PDF2) on people's various "folk theories" of opaque algorithmic systems in social media sites. The case of Wikipedia shows how when such systems are quite open, it is perhaps even more important to understand how these differences make a difference.
]]>This track brings together Science and Technology Studies scholars who are investigating data-driven techniques in academic research and analytic industries. Computational methods with large datasets are becoming more common across disciplines in academia (including the social sciences) and analytic industries. However, the sprawling and ambiguous boundaries of “big data” makes it difficult to research. The papers in this track investigate the relationship between theories, instruments, methods, practices, and infrastructures in data science research. How are such practices transforming the processes of knowledge creation and validation, as well as our understanding of empiricism and the scientific method?
Many of the papers in this track are case studies that focus on one particular fieldsite where data-intensive research is taking place. Other papers explore connections between emerging theory, machinery, methods, and practices. These papers examine a wide variety of data-collection instruments, software, inscription devices, packages, algorithms, disciplines, institutions, and many focus on how a broad sociotechnical system is used to produce, analyze, share, and validate knowledge. In looking at the way these knowledge forms are objectified, classified, imagined, and contested, this track looks critically on the maturing practices of quantification and their historical, social, cultural, political, ideological, economical, scientific, and ecological impacts.
When we say “critical,” we are drawing on a long lineage from Immanuel Kant to critical theory, investigating the conditions in which thinking and reasoning takes place. To take a critical approach to a field like data science is not to universally disapprove or reject it; it is more about looking at a broad range of social factors and impacts in and around data science. The papers in this track ask questions such as: How are new practices and approaches changing the way science is done? What does the organization of “big science” look like in an era of “big data”? What are the historical antecedents of today’s cutting-edge technologies and practices? How are institutions like hospitals, governments, schools, and cultural industries using data-driven practices to change how they operate? How are labor and management practices changing as data-intensive research is increasingly a standard part of major organizations? What are the conditions in which people are able to sufficiently understand and contest someone else’s data analysis? What happens when data analysts and data scientists are put in the position of keeping their colleagues accountable to various metrics, discovering what music genres are ‘hot’, or evaluating the impacts of public policy proposals? And how ought we change our own concepts, theories, approaches, and methods in Science and Technology Studies given these changes we are witnessing?
Sat Sept 3rd, 09:00-10:30am; Room 116
Chairs: Charlotte Cabasse-Mazel and Stuart Geiger
Sat Sept 3rd, 11:00am-12:30pm; Room 116
Chair: Nick Seaver
Sat Sept 3rd, 14:00-15:30; Room 116
Chair: TBD
Sat Sept 3rd, 16:00-17:30; Room 116
Chairs: Stuart Geiger and Charlotte Cabasse-Mazel
Sat Sept 3rd, 09:00-10:30am; Room 116
Chairs: Charlotte Cabasse-Mazel and Stuart Geiger
Irene Pasquetto (UCLA) and Ashley E. Sands (UCLA)
Openness of publicly funded scientific data is policy enforced, and its benefits are normally taken for granted: increasing scientific trustworthiness, enabling replication and reproducibility, and preventing duplication of efforts.
However, when public data are made open, a series of social costs arise. In some fields, such as biomedicine, scientific data have great economic value, and new business models based on the reuse of public data are emerging. In this session we critically analyze the relationship between the potential benefits and social costs of opening scientific data, which translate in changes in the workforce and challenges for current science funding models. We conducted two case studies, one medium-scale collaboration in biomedicine (FaceBase II Consortium) and one large-scale collaboration in astronomy (Sloan Digital Sky Server). We have conducted ethnographic participant observations and semi-structured interviews of SDSS since 2010 and FaceBase since 2015. Analyzing two domains sharpened our focus on each by enabling comparisons and contrasts. The discussion is also based on extensive document analysis.
Our goal is to unpack open data rhetoric by highlighting its relation to the emergence of new mixed private and public funding models for science and changes in workforce dynamics. We show (1) how open data are made open “in practice” and by whom; (2) how public data are reused in private industry; (3) who benefits from their reuse and how. This paper contributes to the Critical Data Studies field for its analysis of the connections between big data approaches to science, social power structures, and the policy rhetoric of open data.
Martina Merz (Alpen‐Adria‐University Klagenfurt / Wien / Graz)
Contemporary experimental particle physics is amongst the most data-intensive sciences and thus provides an interesting test case for critical data studies. Approximately 30 petabytes of data produced at CERN’s Large Hadron Collider (LHC) annually need to be controlled and processed in multiple ways before physicists are ready to claim novel results: data are filtered, stored, distributed, analyzed, reconstructed, synthesized, etc. involving collaborations of 3000 scientists and heavily distributed work. Adopting a science-as-practice approach, this paper focuses on the associated challenges of data analysis using as an example the recent Higgs search at the LHC, based on a long-term qualitative study. In particle physics, data analysis relies on statistical reasoning. Physicists thus use a variety of standard and advanced statistical tools and procedures.
I will emphasize that, and show how, the computational practice of data analysis is inextricably tied to the production and use of specific visual representations. These “statistical images” constitute “the Higgs” (or its absence) in the sense of making it “observable” and intelligible. The paper puts forward two main theses: (1) that images are constitutive of the prime analysis results due to the direct visual grasp of the data that they afford within large-scale collaborations and (2) that data analysis decisively relies on the computational and pictorial juxtaposition of “real” and “simulated data”, based on multiple models of different kind. In data-intensive sciences such as particle physics images thus become essential sites for evidential exploration and debate through procedures of black-boxing, synthesis, and contrasting.
Samir Passi (Cornell University)
This paper conceptualizes data analytics as a situated process: one that necessitates iterative decisions to adapt prior knowledge, code, contingent data, and algorithmic output to each other. Learning to master such forms of iteration, adaption, and discretion then is an integral part of being a data analyst.
In this paper, I focus on the pedagogy of data analytics to demonstrate how students learn to make sense of algorithmic output in relation to underlying data and algorithmic code. While data analysis is often understood as the work of mechanized tools, I focus instead on the discretionary human work required to organize and interpret the world algorithmically, explicitly drawing out the relation between human and machine understanding of numbers especially in the ways in which this relationship is enacted through class exercises, examples, and demonstrations. In a learning environment, there is an explicit focus on demonstrating established methods, tools, and theories to students. Focusing on data analytic pedagogy, then, helps us to not only better understand foundational data analytic practices, but also explore how and why certain forms of standardized data sensemaking processes come to be.
To make my argument, I draw on two sets of empirics: participant-observation of (a) two semester long senior/graduate-level data analytic courses, and (b) a series of three data analytic training workshops taught/organized at a major U.S. East Coast university. Conceptually, this paper draws on research in STS on social studies of algorithms,sociology of scientific knowledge, sociology of numbers, and professional vision.
Michael Castelle (University of Chicago)
Presently existing theorizations of “big data” practices conflate observed aspects of both “volume” and “velocity” (Kitchin 2014). The practical management of these two qualities, however, have a comparably disjunct, if interwoven, computational history: on one side, the use of large (relational and non-relational) database systems, and on the other, the handling of real-time flows (the world of dataflow languages, stream and event processing, and message queues). While the commercial data practices of the late 20th century were predicated on an assumption of comparably static archival (the site-specific “mining” of data “warehouses”), much of the novelty and value of contemporary “big data” sociotechnics is in fact predicated on the harnessing/processing vast flows of events generated by the conceptually-centralized/ physically-distributed datastores of Google, Facebook, LinkedIn, etc.
These latter processes—which I refer to as “big codata”—have their origins in IBM’s mainframe updating of teletype message switching, were adapted for Wall Street trading firms in the 1980s, and have a contemporary manifestation in distributed “streaming” databases and message queues like Kafka and StormMQ, in which one differentially “subscribes” to brokered event streams for real-time visualization and analysis. Through ethnographic interviews with data science practitioners in various commercial startup and academic environments, I will contrast these technologies and techniques with those of traditional social-scientific methods—which may begin with empirically observed and transcribed “codata”, but typically subject the resultant inert “dataset” to a far less real-time sequence of material and textual transformations (Latour 1987).
Sat Sept 3rd, 11:00am-12:30pm; Room 116
Chair: Nick Seaver
Brittany Fiore‐Gartland (University of Washington) and Anissa Tanweer (University of Washington)
Reproducibility has long been considered integral to scientific research and increasingly must be adapted to highly computational, data-intensive practices. Central to reproducibility is the sharing of data across varied settings. Many scholars note that reproducible research necessitates thorough documentation and communication of the context in which scientific data and code are generated and transformed. Yet there has been some pushback against the generic use of the term context (Nicolini, 2012); for, as Seaver puts it, “the nice thing about context is everyone has it” (2015). Dourish (2004) articulates two approaches to context: representational and interactional. The representational perspective sees context as stable, delineable information; in terms of reproducibility, this is the sort of context that can be captured and communicated with metadata, such as location, time, and size. An interactional perspective, on the other hand, views context not as static information but as a relational and dynamic property arising from activity; something that is much harder to capture and convey using metadata or any other technological fix.
In two years of ethnographic research with scientists negotiating reproducibility in their own data-intensive work, we found “context” being marshalled in multiple ways to mean different things within scientific practice and discourses of reproducibility advocates. Finding gaps in perspectives on context across stakeholders, we reframe reproducibility as a scientific communication problem, a move that recognizes the limits of representational context for the purpose of reproducible research and underscores the importance of developing cultures and practices for conveying interactional context.
Kathleen Pine (ASU)
This paper examines the implementation and consequences of data science in a specific domain: evaluation and regulation of healthcare delivery. Recent iterations of data-driven management expand the dimensions along which organizations are evaluated and utilize a growing array of non-financial measures to audit performance (i.e. adherence to best practices). Abstract values such as “quality” and “effectiveness” are operationalized through design and implementation of certain performance measurements—it is not just what outcomes that demonstrate the quality of service provision, but the particular practices engaged during service delivery.
Recent years have seen the growth of a controversial new form of data-driven accountability in healthcare: application of performance measurements to the work of individual clinicians. Fine-grained performance measurements of individual providers were once far too resource intensive to undertake, but expanded digital capacities have made provider-level analyses feasible. Such measurements are being deployed as part of larger efforts to move from “volume-based” to “value- based” or “pay for performance” payment models.
Evaluating individual providers, and deploying pay for performance at the individual (rather than the organizational) level is a controversial idea. Critics argue that the measurements reflect a tiny sliver of any clinician’s “quality,” and that such algorithmic management schemes will lead professionals to focus on only a small number of measured activities. Despite these and other concerns, such measurements are on the horizon. I will discuss early ethnographic findings on implementation of provider-level cesarean section measurements, describing tensions between professional discretion and accountability and rising stakes of data quality in healthcare.
Daan Kolkman (University of Surrey)
The rapid development and dissemination of data science methods, tools and libraries, allows for the development of ever more intricate models and algorithms. Such digital objects are simultaneously the vehicle and outcome of quantification practices and may embody a particular world-view with associated norms and values. More often than not, a set of specific technical skills is required to create, use or interpret these digital objects. As a result, the mechanics of the model or algorithm may be virtually incomprehensible to non-experts.
This is of consequence for the process of knowledge creation because it may introduce power asymmetries and because successful implementation of models and algorithms in an organizational context requires that all those involved have faith in the model or algorithm. This paper contributes to the sociology of quantification by exploring the practices through which non-experts ascertain the quality and credibility of digital objects as myths or fictions. By considering digital objects as myths or fictions, the codified nature of these objects comes into focus.
This permits the illustration of the practices through which experts and non-experts develop, maintain, question or contest such myths. The paper draws on fieldwork conducted in government and analytic industry in the form of interviews, observations and documents to illustrate and contrast the practices which are available to non-experts and experts in bringing about the credibility or incredibility of such myths or fictions. It presents a detailed account of how digital objects become embedded in the organisations that use them.
Howard Rosenbaum (Indiana University)
There are no big data without algorithms. Algorithms are sociotechnical constructions and reflect the social, cultural, technical and other values embedded in their contexts of design, development, and use. The utopian “mythology” (boyd and Crawford 2011) about big data rests, in part, on the depiction of algorithms as objective and unbiased tools operating quietly in the background. As reliable technical participants in the routines of life, their impartiality provides legitimacy for the results of their work. This becomes more significant as algorithms become more deeply entangled in our online and offline lives. where we generate the data they analyze. They create “algorithmic identities,” profiles of us based on our digital traces that are “shadow bodes,” emphasizing some aspects and ignoring others (Gillespie 2012). They are powerful tools that use these identities to dynamically shape the information flows on which we depend in response to our actions and decisions made by their owners
Because this perspective tends to dominate the discourse about big data, thereby shaping public and scientific understandings of the phenomenon, it is necessary to subject it to critical review as an instance if critical data studies. This paper interrogates algorithms as human constructions and products of choices that have a range of consequences for their users and owners; issues explored include: The epistemological implications of big data algorithms; The impacts of these algorithms in our social and organizational lives; The extent to which they encode power ways in which this power is exercised; The possibility of algorithmic accountability
Sat Sept 3rd, 14:00-15:30; Room 116
Chair: TBD
Klara Benda (IT University of Copenhagen)
The Digital methods approach seeks the strategic appropriation of digital resources on the web for social research. I apply the grounded theory to theorize how data practices in Digital methods are entangled with the web as a socio-technical phenomenon. My account draws on public sources of Digital methods and ethnographic research of semester-long student projects based on observations, interviews and project reports. It is inspired by Hutchin’s call for understanding how people “create their cognitive powers by creating the environments in which they exercise those powers”. The analysis draws on the lens of infrastructuring to show that making environments for creativity in Digital methods is a distributed process, which takes place on local and community levels with distinct temporalities. Digital methods is predicated on creating its local knowledge space for social analysis by pulling together digital data and tools from the web, and this quick local infrastructuring is supported by layers of slower community infrastructures which mediate the digital resources of the web for a Digital methods style analysis by means of translation and curation.
Overall, the socially distributed, infrastructural style of data practice is made possible by the web as a socio-technical phenomenon predicated on openness, sharing and reuse. On the web, new digital resources are readily available to be incorporated into the local knowledge space, making way for an iterative, exploratory style of analysis, which oscillates between infrastructuring and inhabiting a local knowledge space. The web also serves as a socio-technical platform for community practices of infrastructuring.
Charlotte Mazel‐Cabasse (University of California, Berkeley)
The scientific computing, or e-science, has enabled the development of large data driven scientific initiatives. A significant part of these projects relies on the software infrastructures and tool stacks that make possible to collect, clean and compute very large data sets.
Based on an anthropological research among a community of open developers and/or scientists contributing to SciPy, the open source Python library used by scientists to enable the development of technologies for big data, the research focuses on the socio-technical conditions of the development of free and reproducible computational scientific tools and the system of values that supports it.
Entering the SciPy community for the first time is entering a community of learners. People who are convinced that for each problem there is a function (and if there is not, one should actually create one), who think that everybody can (and probably should) code, who have been living between at least two worlds (sometime more) for a long time: academia and the open software community, and for some, different versions of the corporate world.
Looking at the personal trajectories of these scientists that turned open software developers, this paper will investigate the way in which a relatively small group of dedicated people has been advancing a new agenda for science, defined as open and reproducible, through carefully designed data infrastructures, workflows and pipelines.
Jeremy Knox (The University of Edinburgh)
Education has become an important site for computational data analysis, and the burgeoning field of ‘learning analytics’ is gaining significant traction, motivated by the proliferation of online courses and large enrolment numbers. However, while this ‘big data’ and its analysis continue to be hyped across academic, government and corporate research agendas, critical and interdisciplinary approaches to educational data analysis are in short supply. Driven by narrow disciplinary areas in computer science, learning analytics is not only ‘blackboxed’, - in other words a propensity to ‘focus only on its inputs and outputs and not on its internal complexity’ (Latour 1999, p304), but also abstracted and distanced from the activities of education itself. This methodological estrangement may be particularly problematic in an educational context where the fostering of critical awareness is valued.
The first half of this paper will describe three ways in which we can understand this ‘distancing’, and how it is implicated in enactments of power within the material conditions of education: the institutional surveilling of student activity; the mythologizing of empirical objectivity; and the privileging of prediction. The second half of the paper will describe the development of a small scale and experimental learning analytics project undertaken at the University of Edinburgh that sought to explore some of these issues. Entitled the Learning Analytics Report Card (LARC), the project investigated playful ways of offering student choice in the analytics process, and the fostering of critical awareness of issues related to data analysis in education.
Cathryn Carson (University of California, Berkeley)
Inside universities, data science is practically co-located with science studies. How can we use that proximity to shape how data science gets done? Drawing on theorizations of collaboration as a research strategy, embedded ethnography, critical technical practice, and design intervention, this paper reports on experiments in data science research and organizational/strategic design. It presents intellectual tools for working on data science (conceptual distinctions such as data science as specialty, platform, and surround; temporal narratives that capture practitioners’ conjoint sense of prospect and dread) and explores modes of using these tools in ways that get uptake and do work. Finally, it draws out possible consequences of the by now sometimes well-anchored situation of science studies/STS inside universities, including having science studies scholars in positions of institutional leverage.
Sat Sept 3rd, 16:00-17:30; Room 116
Chairs: Stuart Geiger and Charlotte Cabasse-Mazel
Yanni Loukissas (Georgia Tech); Matt Ratto (University of Toronto); Gabby Resch (University of Toronto)
Big Data has been described as a death knell for the scientific method (Anderson, 2008), a catalyst for new epistemologies (Floridi, 2012), a harbinger for the death of politics (Morozov, 2014), and “a disruptor that waits for no one” (Maycotte, 2014). Contending with Big Data, as well as the platitudes that surround it, necessitates new kind of data literacy. Current pedagogical models, exemplified by data science and data visualization, too often introduce students to data through sanitized examples, black-boxed algorithms, and standardized templates for graphical display (Tufte, 2001; Fry, 2008; Heer, 2011). Meanwhile, these models overlook the social and political implications of data in areas like healthcare, journalism and city governance. Scholarship in critical data studies (boyd and Crawford, 2012; Dalton and Thatcher, 2014) and critical visualization (Hall, 2008; Drucker 2011) has established the necessary foundations for an alternative to purely technical approaches to data literacy.
In this paper, we explain a pedagogical model grounded in interpretive learning experiences: collecting data from messy sources, processing data with an eye towards what algorithms occlude, and presenting data through creative forms like narrative and sculpture. Building on earlier work by the authors in the area of ‘critical making’ (Ratto), this approach—which we call critical information practice—offers a counterpoint for students seeking reflexive and materially-engaged modes of learning about the phenomenon of Big Data.
Tommaso Venturini (King’s College); Anders Kristian Munk (University of Aalborg); Mathieu Jacomy (Sciences Po)
In the last few decades, the idea of ‘network’ has slowly but steadily colonized broad strands of STS research. This colonization started with the advent of actor-network theory, which provided a convenient set of notions to describe the construction of socio-technical phenomena. Then came network analysis, and scholars who imported in the STS the techniques of investigation and visualization developed in the tradition of social network analysis and scientometrics. Finally, with the increasing ‘computerization’ of STS, scholars turned their attention to digital networks a way of tracing collective life.
Many researchers have more or less explicitly tried to link these three movements in one coherent set of digital methods for STS, betting on the idea that actor-network theory can be operationalized through network analysis thanks to the data provided by digital networks. Yet, to be honest, little proves the continuity among these three objects besides the homonymy of the word ‘network’. Are we sure that we are talking about the same networks?
Nicholas Seaver (Tufts University)
Data scientists summon space into existence. Through gestures in the air, visualizations on screen, and loops in code, they locate data in spaces amenable to navigation. Typically, these spaces embody a Euro-American common sense: things near each other are similar to each other. This principle is evident in the work of algorithmic recommendation, for instance, where users are imagined to navigate a landscape composed of items arranged by similarity. If you like this hill, you might like the adjacent valley. Yet the topographies conceived by data scientists also pose challenges to this spatial common sense. They are constantly reconfigured by new data and the whims of their minders, subject to dramatic tectonic shifts, and they can be more than 3-dimensional. In highly dimensional spaces, data scientists encounter the “curse of dimensionality,” by which human intuitions about distance fail as dimensions accumulate. Work in critical data studies has conventionally focused on the biases that shape these spaces.
In this paper, I propose that critical data studies should not only attend to how representative data spaces are, but also to the techniques data scientists use to navigate them. Drawing on fieldwork with the developers of algorithmic music recommender systems, I describe a set of navigational practices that negotiate with the shifting, biased topographies of data space. Recalling a classic archetype from STS and anthropology, these practices complicate the image of the data scientist as rationalizing, European map-maker, resembling more closely the situated interactions of the ideal-typical Micronesian navigator.
When I was an M.A. student back in 2009, I was trying to explain various things about how Wikipedia worked to my then-advisor David Ribes. I had been ethnographically studying the cultures of collaboration in the encyclopedia project, and I had gotten to the point where I could look through the metadata documenting changes to Wikipedia and know quite a bit about the context of whatever activity was taking place. I was able to do this because_Wikipedians_ do this: they leave publicly accessible trace data in particular ways, in order to make their actions and intentions visible to other Wikipedians. However, this was practically illegible to David, who had not done this kind of participant-observation in Wikipedia and had therefore not gained this kind of socio-technical competency.
For example, if I added “” to the top an article, a big red notice would be automatically added to the page, saying that the page has been nominated for “speedy deletion.” Tagging the article in this way would also put it into various information flows where Wikipedia administrators would review it. If any of Wikipedia’s administrators agreed that the article met speedy deletion criteria A7, then they would be empowered to unilaterally delete it without further discussion. If I was not the article’s creator, I could remove the trace from the article to take it out of the speedy deletion process, which means the person who nominated it for deletion would have to go through the standard deletion process. However, if I was the article’s creator, it would not be proper for me to remove that tag — and if I did, others would find out and put it back. If someone added the “” trace to an article I created, I could add “” below it in order to inhibit this process a bit — although a hangon is a just a request, it does not prevent an administrator from deleting the article.
I knew all of this both because Wikipedians told me and because this was something I experienced again and again as a participant observer. Wikipedians had documented this documentary practice in many different places on Wikipedia’s meta pages. I had first-hand experience with these trace data, first on the receiving end with one of my own articles. Then later, I became someone who nominated others’ articles for deletion. When I was learning how to participate in the project as a Wikipedian (which I now consider myself to be), I started to use these kinds of trace data practices and conventions to signify my own actions and intentions to others. This made things far easier for me as a Wikipedian, in the same way that learning my university’s arcane budgeting and human resource codes helps me navigate that bureaucracy far easier.
This “trace ethnography” emerged out of a realization that people in mediated communities and organizations increasingly rely on these kinds of techniques to render their own activities and intentions legible to each other. I should note that this was not my and David’s original insight — it is one that can can be found across the fields of history, communication studies, micro-sociology, ethnomethodology, organizational studies, science and technology studies, computer-supported cooperative work, and more. As we say in the paper, we merely “assemble their various solutions” to the problem of how to qualitatively study interaction at scale and at a distance. There are jargons, conventions, and grammars learned as a condition of membership in any group, and people learn how to interact with others by learning these techniques.
The affordances of mediated platforms are increasingly being used by participants themselves to manage collaboration and context at massive scales and asynchronous latencies. Part of the trace ethnography approach involves coming to understand why these kinds of systems were developed in the way that they were. For me and Wikipedia’s deletion process, it went from being strange and obtuse to something that I expected and anticipated. I got frustrated when newcomers didn’t have the proper literacy to communicate their intentions in a way that I and other Wikipedians would understand. I am now at the point where I can even morally defend this trace-based process as Wikipedians do. I can list reason after reason why this particular process ought to unfold in the way that it does, independent of my own views on this process. I understand the values that are embedded in and assumed by this process, and they cohere with other values I have found among Wikipedians. And I’ve also met Wikipedians who are massive critics of this process and think that we should be using a far different way to deal with inappropriate articles. I’ve even helped redesign it a bit.
Trace ethnography is based in the realization that these practices around metadata are learned literacies and constitute a crucial part of what it means to participate in many communities and organizations. It turns our attention to an ethnographic understanding of these practices as they make sense for the people who rely on them. In this approach, reading through log data can be seen as a form of participation, not just observation — if and only if this is how members themselves spend their time. However, it is crucial that this approach is distinguished from more passive forms of ethnography (such as “lurker ethnography”), as trace ethnography involves an ethnographer’s socialization into a group prior to the ability to decode and interpret trace data. If trace data is simply being automatically generated without it being integrated into people’s practices of participation, if people in a community don’t regularly rely on following traces in their everyday practices, then the “ethnography” label is likely not appropriate.
Looking at all kinds of online communities and mediated organizations, Wikipedia’s deletion process might appear to be the most arcane and out-of-the-ordinary. However, modes of participation are increasingly linked to the encoding and decoding of trace data, whether that is a global scientific collaboration, an open source software project, a guild of role playing gamers, an activist network, a news organization, a governmental agency, and so on. Computer programmers frequently rely on GitHub to collaborate, and they have their own ways of using things like issues, commit comments, and pull requests to interact with each other. Without being on GitHub, it’s hard for an ethnographer who studies software development to be a fully-immersed participant-observer, because they would be missing a substantial amount of activity — even if they are constantly in the same room as the programmers.
More about trace ethnography
If you want to read more about “trace ethnography,” we first used this term in “The Work of Sustaining Order in Wikipedia: The Banning of a Vandal,” which I co-authored with my then-advisor David Ribes in the proceedings of the CSCW 2010 conference. We then wrote a followup paper in the proceedings of HICSS 2011 to give a more general introduction to this method, in which we ‘inverted’ the CSCW 2011 paper, explaining more of the methods we used. We also held a workshop at the 2015 iConference with Amelia Acker and Matt Burton — the details of that workshop (and the collaborative notes) can be found athttp://trace-ethnography.github.io.
Some examples of projects employing this method:
Ford, H. and Geiger, R.S. “Writing up rather than writing down: Becoming Wikipedia literate.” Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. ACM, 2012. http://www.stuartgeiger.com/writing-up-wikisym.pdf
Ribes, D., Jackson, S., Geiger, R.S., Burton, M., & Finholt, T. (2013). Artifacts that organize: Delegation in the distributed organization. Information and Organization, 23(1), 1-14. http://www.stuartgeiger.com/artifacts-that-organize.pdf
Mugar, G., Østerlund, C., Hassman, K. D., Crowston, K., & Jackson, C. B. (2014). Planet hunters and seafloor explorers: legitimate peripheral participation through practice proxies in online citizen science. In_Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing_ (pp. 109-119). ACM. http://dl.acm.org/citation.cfm?id=2531721
Howison, J., & Crowston, K. (2014). Collaboration Through Open Superposition: A Theory of the Open Source Way. Mis Quarterly, 38(1), 29-50. http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3156&context=misq
Burton, M. (2015). Blogs as Infrastructure for Scholarly Communication. Doctoral Dissertation, University of Michigan.http://deepblue.lib.umich.edu/bitstream/handle/2027.42/111592/mcburton_1.pdf
]]>