Thinking Through the Strava Data

Lots of people are talking about the announcement that the Oregon Department of Transportation (ODOT) is using a Strava dataset to conduct a research study (click here for the article). As with everything in this world, there is a range of people who have different views on this project.

I am a researcher. Spending an afternoon analyzing data sounds fun to me. I read about methodology for kicks. I love long conversations (okay…debates) about epistemology and philosophy.

20140502-153959.jpg

My weekend plans? oh, you know, just catching up on some feminist methodology before I dig into one of my Stephen King books.

Why? 

The first question that should be ever be asked about a research project is Why. Why do it, what’s the point? Unfortunately, too many researchers get confused by this simple question. I’ve had a of experiences that go something like this:

research.001

Usually, it’s because they’re doing what’s called “Basic Research” — studying something just to learn something. 

Personally, I am not even interested in doing a research project unless it will address social inequity.

With such limited resources to go around, it seems both wasteful and harmful for any social science project to be about anything besides addressing inequity. So, that’s an important question to ask about this study. 

The main justification for the study is this:

The problem for many transportation agencies today is that, while bicycling is on the rise (for both transportation and recreation), there remains a major lack of data. This gap in data makes it much harder to justify bicycle investments, plan for future bicycle traffic growth, illustrate the benefits of bike infrastructure investments, and so on. It also makes non-auto use of roads very easy for agencies to overlook. And while ODOT and many cities do bike counts already, they only measure one location for a short period of time. Most importantly, current bicycle count methods don’t provide any context about how people actually ride. It’s this element of “bicycle travel behavior” that ODOT is most excited about.  (emphasis added)

It sounds like the purpose of this study is to make data-informed decisions that will increase cycling and improve the infrastructure for transportation and recreational cycling.  Cool. 

Who is Represented? 

Most of the buzz is about how Strava users don’t represent non-Strava users well. This makes sense, because we all use the road in different ways and choose different routes for varying reasons. 

research.002

Not to mention that this can even change day-to-day from the same person

Nonrepresentative samples are nothing new. While ideally research projects use a representative sample, you might be surprised to know that most of it is not. But that doesn’t mean the study is necessarily worthless. It just means that it’s critically important to be transparent about this, justify decisions and choices about sampling, and use the results responsibly

Unfortunately, it is too common for researchers to state the sample limitation and then move on with the data as though the limitation didn’t really exist. Worse, one of the main justifications is: the sample was convenient. The data were there or were easy to get. That’s what this Strava dataset is–it’s about convenience. Not good enough.

That’s like buying customer data from Whole Foods and using it to understand the grocery shopping behaviors of everyone in the city, all because the data were there. 

Researchers can use a nonrepresentative convenience sample and still get useful data that can address inequity. Sometimes, what people do is they learn something about the people who aren’t represented in that study. For example, are the people being excluded more likely to have lower incomes, be people of color, and women? 

On the one hand, now you know and can understand the results within that limited, specific context. On the other hand…maybe that means you should question your methodology

But I digress. 

So it looks like they kind of tried to do that–find out how well their sample is representative of commuters:

The Strava Workgroup has done some analysis of trips using Portland’s Hawthorne Bridge bike counter. When they compared those numbers with Strava data of the same day and time, they found 2.5% of the trips were made by Strava users. Given that the Hawthorne Bridge is primarily a route for bicycle commuters, Bradway feels it offers a conservative sample size. “In other areas, like Skyline or Rock Creek Road [both of which are popular training routes], it would be much higher.”

Strava only represented 2.5% of all the commuters!? That’s a glaring red flag that Strava data could have some major flaws when trying to apply it to people who commute by bike. I really hope that the investigators are doing additional work on this. 

After all, the stated purpose of this study was to inform policy and infrastructure. So, no, the limitation is likely not that it’s a “small sample size.”  The limitation is that it’s probably an inappropriate sample to address the project goals. 

research.001

How Research is Used 

But even with all that said, it didn’t have to be so bad. Every project has to start somewhere, and pilots have many limitations. $20,000.00 is nothing for a sample size that large (Calculate paying each participant $25.00 for their participation), and this could be a great pilot to test how to go about studying cyclists’ behavior using GPS–both in terms of its strengths and limitations as an approach. 

But, the problem is that this (most likely) systematically flawed sample is already being applied to real-life important changes:

The third, and most interesting task for ODOT’s Strava Workgoup is to explore pilot projects where the data can inform policy and project decisions. And Bradway says, that work has already begun.

So far, based on the Strava data they have changed where they do in-person bicycle counts and where to install rumble strips on the highway. 

This is a problem.

First, the counter locations have moved so that that they can track more Strava cyclists. In this world, you only matter if you’re “counted” by the policy makers, and now they’re going to be even more likely to count Strava users and maybe not commuters, then how will this improve infrastructure for everyday commuters 

Second, they are adding safety features to the roads where Strava users go. Are all or most of the infrastructure improvements going to privilege Strava users over everyday commuters? 

Inequity in, and as a result of, research 

Research and science have a horrible reputation for favoring the privileged, and this is still a major problem today.   It’s usually the privileged groups that get to do the research, that get heard in the research, and benefit from the research. So far, my discomfort with this project (as currently described) is that it doesn’t seem like it will do much to change this historical problem. 

If the researchers aren’t careful and aren’t paying special attention to equity, then they might just end up using all that “big data” to make improvements for the a small subgroup of cyclists, rather than the cyclist community as a whole. 

research.001

And For Next Time?

This is why I am a fan of community-based participatory research. I think it’s better to work with communities and have them lead research projects from the start, rather than the other way around. Otherwise, people do projects that they find geeky-cool, but don’t address any real concerns that most people have (at best) or just reinforcing the inequity problems that have without even realizing it. 

{{{ hopefully the researchers prove me wrong, eh? }}}

~ * Don’t forget to check out my store!

 

 

About these ads

35 thoughts on “Thinking Through the Strava Data

  1. Thanks so much for posting this! I’m getting ready to prepare a presentation soon on the how tools such as Strava affect different riding practices and the cultural habits of bicyclists, and this will be a great example to integrate. It will mostly be based on my research on mountain biking, but I wanted to have some clear tie-ins with everyday cycling as well. Part of what I want to move towards is a historical and contemporary understanding of the connections and disconnects between cycling for sport/recreation and transportation cycling. I want to work toward breaking down this division (or at least looking at it more complexly), while also attending to the inequalities across riding practices. I’ll definitely give you a shout-out in the talk!

    Like

    • Hi Sarah,

      So glad you liked it! And that’s really interesting — looking at how technology changes cycling. Very cool question, I think! And I’m totally with you–I bike to commute, for exercise, and for relaxation/to connect with nature. I am a strong proponent for multiple types of cycling. As much as I talk about these “differences”, I’m really an advocate for blurring the lines and crossing boundaries! But, just like you said — always addressing the inequity. That takes center stage for me!

      best of luck on your talk! :D

      Like

  2. Wouldn’t it be interesting if some national cycling organization conducted an on-going survey of household bicycle use? Then you could collect demographic data on cyclists and non-cyclists, usage data, and attitudinal data. Then they could release the dataset and let geeks loose on it.

    Like

    • Hi root chopper, thanks for the comment!
      Apparently the American Community Survey does collect some (or all) of that data, so I think there is some of the information you talked about. And, what’s nice is they use random sampling and collect other information so there aren’t the same sampling issues as with this study. And it’s available online, too. I have it on my to-do list to check out once I graduate, too! :P Data geek, is me!

      Like

  3. There is no dispute that Strava provides only a sample of all cyclists in Oregon. Regardless, this is believed to be the first data set where planners can see actual routes on streets. Previously data sets provided single points along a route. Knowing the route taken is a significant benefit to ODOT.

    Cycling citizens of Oregon can improve the data set and be counted simply by downloading the Strava app to there cell phone and using it when they ride. Of course even this will not provide a complete data since some percentage of riders don’t own or don’t ride with cell phones.

    Lastly, as eloquently stated, “Researchers can use a nonrepresentative convenience sample and still get useful data that can address inequity.” That is exactly what the ODOT Strava data is. ODOT has now indicated they are using Strava data for bicycling infrastructure improvements and that citizens can be counted simply by using Strava.

    Keep on cycling…

    Like

    • Hi cycling bob, this is going to sound a bit snarky, but based on your comment it appears you either didn’t read or fully understand my post. So, I’m not really sure how to respond or what to say.

      For one, I’m not critiquing the fact that it’s a sample. That much should be painstakingly clear both from the text and the graphics I created. Do you know of research that doesn’t use a sample? I sure don’t.

      Also, doesn’t that article link to a study by Dr. Dill where she used GPS to track routes of cyclists? So, then this isn’t the “first data set.” And there are other ways to get routes on data from a better sample. See the study by Dr. Dill.

      The rest of your comment is embedded with tons of assumptions, especially regarding class privilege. How many “cycling citizens of Oregon” have a cell phone that can do what you suggested? The ones with the most money, most likely. So you still have the same problems with the sampling technique.

      Not sure why you referenced “complete data” again. Again, my post should make it clear that I’m not looking for “complete data” or data from the entire population. Not only is that unrealistic, but it’s also unnecessary. A representative sample would do just fine.

      The last part of your comment is the biggest signal to me that you either just skimmed my post or didn’t understand it (or you don’t understand what equity is). So I’m not even going to respond to that. My blog isn’t the place to explain that 101 kind of stuff. This might sound mean, but really it’s just that I don’t have the time or energy, especially when other people have written about this much more in depth. The information about equity is out there if you search for it.

      Like

      • Echo,

        You are right, it does sound snarky, but you use snarky to attack an opinion you disagree with. Why not call out everyone who commented on your post by making assumptions, for example Jean who assumes every Strava user is a fitness junkie who uses a heart rate monitor. I am not a fitness junkie and don’t own an HRM.

        If you are going to post for public view and comment then you should be open to public discourse that just might include a viewpoint different than your own. You can call it snark but it is clearly an attack on an opinion you disagree with. See also your response to EnergyAnd Infrastructure.

        Please consider my opinion as different from yours, but still valid.

        Cycling Bob

        Like

      • Just because something is an opinion does not actually make it valid. I disagree with that and I actually think that’s the main reason that we have so many problems in society is because too many people accept the opinions others just because it’s an opinion. More is required than just the thoughts and feelings from inside your head.

        Even so, you’re wrong to say I’m just disagreeing with your opinion. My main point is that you either didn’t read my post or didn’t understand it. My point is: do those first and THEN post a reasonable, information-based comment. THEN, we can have a discussion.

        Also, again, you’re not reading. Further proves my point. I don’t like to say the same thing OVER and OVER again. Maybe take a look at all my comments and see how I talked about my partner and how he hates Strava? So yea, I already did what you claim I didn’t do.

        My post to Energyandinfrastructure (someone I’m friends with on twitter, by the way) is a question. I didn’t understand his post so I’m asking for clarification.

        Like

  4. Yes! Great post. From the first line, my first – and ongoing – thought was, but I don’t use Strava. There are lots of people like me who ride bikes but don’t use Strava. People who use Strava are generally the ones who care about how far they’ve gone and how fast they’ve gone. I just care about getting to my final destination safely. So, I agree, this is just lazy research and doesn’t represent cyclists as a whole.

    If they want to use Strava to do their research, they’d be better off inviting people who cycle regularly – preferably commuters – and then giving them all Strava accounts just for logging their participation in the survey. You’d still have selection bias, but short of sticking GPS devices onto random bicycle commuters I’m not sure how you’d get around that.

    Liked by 1 person

    • Hey there!!

      Thank you for commenting, glad you like the post!

      I agree there is some kind of systematic difference. But we really don’t know what that is–there are a LOT of thoughts about it, and many people have ideas, but I don’t know.

      For example, my partner is kind of obsessed with how far he’s gone and how fast he went. He likes to train and all that kind of stuff. Yet, he hates Strava and doesn’t use it. So there’s something else. Maybe it’s the social component? Like, it’s people who like to compete socially? I dunno :P

      and you NAILED IT. That’s exactly right, and I think the article links to a study by Dr. Dill and she kind of did that. FIRST you find the people, THEN you get the GPS data. It’s a smaller number of people, but if you do a better job sampling, that won’t even matter as much.

      Oh, and there’s a way to do it. There is something called respondent driven sampling. It’s pretty awesome, and it ends up looking very similar to a random sample. I think it’d be perfect for studies on cycling!

      Like

  5. Use of Strava data from cyclists will be VERY limiting for research sampling and extrapolation. Just wrong. I know a lot of regular cyclists who don’t use Strava at all. They’re not interested in tracking, numbers. They just want to FEEL healthy and look healthy by cycling daily as part of their lifestyle and for recreation.

    Strava is only for fitness junkies or techie gadget oriented folks who would use a heart monitor, etc.

    Like

    • Hi there, thanks for commenting! :)

      I agree. I think that one of the main limitations of using strava is that the people you talked about just now are the most likely to get left out and not even represented. Or people who just bike to get where they need to go and they don’t consider biking as part of their identity AT ALL. They still count, and should be counted and the focus of improvements.

      Like

      • The local cycling coordinator which is a paid full-time position at the City of Calgary was asked if we would ever use Strava data. His diplomatic response: it’s a cool app, but would not be representative.

        Thank goodness. Otherwise it would be wrongful use of taxpayers’ money. Already there has been huge ruckus over our $9million cycle track plan over the next 4 yrs.

        Like

  6. Wrong app but right concept.

    My glance at the Strava data is that it works for advanced cyclists looking for popular road routes, locations of the bike shops where they hang out (often seen as a bright cluster) and mountain bike trails. I also found lines for multi use trails and shortcuts through suburban neighborhoods.

    There are some planning apps such as cyclephilly.org that are administered by gov’t planners and attempts to provide more detailed information about each trip. Ideally such an app that uses smart phone accelerometers to calculate bike trips without manual input could yield a good sample size through all income levels (if you can sell it to all income levels).

    Like

    • Thanks for the comment, John! Interesting info, especially about the bike shops — I didn’t really think of that. I like how that could show shops as important social hubs, or even making an economic argument for biking!

      Though, I’m not sure about calling Strava cyclists as “advanced” is fully accurate or desirable. There’s a judgmental (in terms of an evaluation sense, not in the sense of condescending tone) quality to it–placing higher value on things that are “advanced.”

      I think there are quite skilled riders out there who don’t use strava and have more experience, better handling ability than many strava users. For example, I can tell you that riding around with all the giant cars and buses sure builds agility skills!

      But the general point you make is one I agree with: there is most likely a systematic (for lack of a better word) difference that makes people who use strava different from those who don’t. I don’t mean that in a negative sense either, I’m not saying it’s good or bad to use strava or not.

      Thanks again for commenting!

      Like

      • My experience with the Chicago data, at least the heat maps, is that it shows the arterial routes where there is cycling infrastructure, such as defined trails, and routes that are generally efficient for traveling/commuting, but also for training.
        In terms of suggesting improvements, I would propose looking for empty portions that are not due to unsurpassable geographic barriers (i.e. O’Hare) and look at those spots as opportunities. Also, to provide connectors between important routes.

        Planning improvements based on the Strava usage density alone is like expanding the expressways because everyone uses them, while ignoring the neighborhood streets.

        Like

      • Thanks for coming back and elaborating! :) I think that’s a good point too. Even if a strava data set were representative, there still are issues/ things to think about in terms of how to apply what is learned. Expand areas in use? Focus on increases areas not in used? Both? How to allocate resources relatively and so on.

        Like

  7. I have just reread your post. And, the BikePortland.org article. I am not a scientist but have been responsible for much analytical work over the years. The Strava data is inappropriate. Strava is used mostly by racers and fitness hounds. I am surprised the the ODOT didn’t recognize this at the outset. Your arguments are bang on and enlightening. Thanks for starting this discussion before other municipalities make the same mistake.

    Like

    • haha thank you! and I appreciate you taking so much time to read the materials. it’s so easy to just skim and comment (or read the title and then comment…as the NPR april fools joke showed! haha)

      My guess is that they did know it. But it’s really cool technology and a sample size that most researchers only dream about (in my line of work, a sample size of 200 is considered AMAZING, and it costs WAY more than that to actually get that many people).

      But people like big sample sizes and statistical power. Unfortunately, i bet if they run any analyses they’ll have TOO MUCH power, but that’s another discussion (seriously, if they start publishing articles about statistical significant without effect sizes too, I might faint). Okay, now I’m really getting on a tangent… haha

      Thanks for the kind words, too. Means a lot!

      Like

      • The article from bike Portland addressed this issue from the very beginning. ODOT is completely aware of the sample bias. The data is not intended to be used to determine where we need more capacity etc. It’s intended to help them understand how a cyclists moves through available roads, paths, interacts with traffic signals etc. It’s up for debate how differently different cyclists move through traffic, but I think that problem, if it exists is much smaller than the reaction I’ve seen to this development suggests. I feel like the underlying issue here is a sort of class divide between ‘fitness junkies’ and other bike commuters which is being represented as a class divide between the wealthy and the poor. Just be careful that you aren’t viewing this minority-minority group (that is a subgroup of the already small group, cyclists) in the same prejudicial way that created the inequity you’ve devoted your life to extinguishing.

        Like

      • I agree that the data were not originally intended to be used to determine where more capacity is needed. That supports one of my main arguments. That’s not the purpose of the data, and yet that’s what they wanted to use it for:

        “…justify bicycle investments, plan for future bicycle traffic growth, illustrate the benefits of bike infrastructure investments, and so on”

        Like

  8. I’m a regular commuter and sometimes racer. I’ve never used Strava or any other GPS system. I’ve often said (to anyone who cares to listen) that roads like Skyline *should* have bike lanes, or other safety improvements. The Strava dataset is important because it reveals a certain kind of use.

    It’s not just a question of where the roads lead, which is important, but also a question of where people already ride. Sure, Skyline and Rock Creek don’t have grocery stores or elementary schools, but the fact that the Strava data shows so many riders gives proof that citizens ride there quite regularly. I’d argue that making the roads that people ride safer is as important as making destinations safer to access. Sure, the city creates bike traffic by creating safer routes to specific destinations, but it might as well also take a look where people already ride (some people, at least) and make those places safer, too.

    The Strava data makes that quite apparent. Here’s a road that sees a lot of use. It might not access an artisanal bakery or low-income school, but apparently people ride there a lot. Do those people riding on that road not deserve some accounting of their traffic patterns? Does it really seem to you that ODOT does not understand the difference between a sample of cargo bikers with kids on back and a sample of people with GPS on their bike. I argue that the people who ride Skyline do deserve some counting, and that ODOT does understand who they’re counting.

    Like

    • Even though you don’t, I think that it’s important to address social inequities when making infrastructure improvements. That’s the stance of my post, and the general way in which I approach things.

      Like

    • Thanks for posting! Quickly glanced at it. What I like about it is that right from the beginning it’s pretty clear about the purpose: they want to use it for planning. It’s not taking something meant for one thing, and then ad hoc using it for another.

      Still, the same problems can happen without careful attention–specific groups could be left out of consideration for improvements because of who they are getting their data from. Doesn’t mean it’s not useful and super cool, just means extra work needs to be done just to be sure that they aren’t neglecting already-marginalized populations.

      Thanks again for sharing!

      Like

  9. Can you point me to the part of your argument where you disproved the null hypothesis? The null hypothesis being that for planning purposes, the Strava-user database does not differ from other tools used to assemble data on cyclist behavior?
    I agree strongly with the sentiments expressed in the “research-0013.jpg” cartoon, but I can envision a number of different methods that share the same problems. In New York City, the authorities do “screenline” counts, where counters are positioned along certain high-traffic bike routes leading to midtown Manhattan. This is great for finding out how many people are traveling to midtown, but in my opinion it is unlikely to lead to improvements to bicycle infrastructure along routes that do not lead to midtown Manhattan. My point being, the city authorities didn’t need to buy a Strava data pack to get data that would have similar biases.
    If the goal of cycling promotion is to get people onto bikes, the overall problem with all types of collection of cyclist data trips is that they only measure trips taken by people who are actually cycling during the study period.
    My understanding of bicycling promotion market research is that transportation planners devote a great deal of attention to encouraging the “interested but concerned” folks who are not currently riding bikes because they feel it’s not safe. These people’s biking experiences are not going to be reflected in any kind of data collection project because they are not currently biking.

    Like

    • Just because other techniques have the same biases, doesn’t mean it’s ok and doesn’t mean they shouldn’t be critiqued. I’m not comparing this data set to current or other techniques.

      But, I agree with you about the other methods. I have an earlier post kind of along the same lines, about automated bike counters.

      There should be a mix, I think. Improvements in areas that are already in high use but also specific attention to areas that would see more cycling with changes and improvements. Your point is solid, and is something I focus heavily on–who is being left out of the research?

      Must pay attention to that to ensure already marginalized groups aren’t being left out even more.
      :)

      Like

  10. I consider the key issue for improving cycle facilities is to be effectively targeting people who don’t cycle. This is where the research effort is needed.

    Like

  11. If only ODOT had all the money in the world they could do everything perfect. But they are trying to do what they can with the budget they have. And they actually do use community groups when they are working on projects. The Rose Quarter project had large community involvement and resulted in major bike improvement plans to be built in a few years. Their upcoming outer powell project has more emphasis on community involvement than anything else. But to tackle a metro wide community involvement plan to determine bicycle improvements seems like it should be a PBOT, METRO and more than ODOT task. ODOT owns a minor portion of the roadways compared to other agencies. Maybe the judgment should be on the other agencies for not getting that done.. Not on ODOT.

    Like

    • You don’t need “all the money in the world” to put in the effort to get a representative sample. Otherwise, there wouldn’t be any studies at all. You don’t need a “perfect” project. Otherwise there wouldn’t be any studies at all. At no point did I suggest that a study needs to be outrageously expensive or perfect.

      Anyone who does a research study gets “judged” on their research design. That’s the way it is, and ultimately makes research stronger. If one cannot defend the choices made and gets upset over being “judged” then they are in the wrong field.

      Like

  12. You may not have said that the study needs to be outrageaouly expensive or perfect, but a community study of the size you are suggesting is expensive so it was implied, I guess expensive is relative… But it is on the order of magnitude of 20xs more for a community-based participatory research than for the Strava data. You don’t just have to get the community involved, but you also have to sort through all that data and figure out what it means afterwards. Not to mention you typically get only a certain group of people to get involved in this. So then there is more money spent to reach out to make sure you are really getting the “community” and not just the people that scream the loudest.

    I agree all research is “judged” and that that is great thing. I don’t think the Strava data is the solution to all of the METRO bike solutions, but it is one method ODOT uses among many others. One of their great methods is applying project specific community-based participatory research, what is great about that is that they are addressing something they are working on and get it built. It’s not just studying things and thinking about them, it’s getting it done. They own far less roadways than the other agencies, so I am not sure how it is appropriate for them to do it on a bigger scale.

    If they are getting upset over being “judged” on the data set I agree that is wrong, but I didn’t get that from your article. To me it seemed like they stated the known limitations and that they are just excited to have one more tool in their pocket to help them provide a better traffic network for all users.

    Like

    • I didn’t suggest a size for a CBPR study, you are making assumptions here about that.

      You get what you pay for; this applies to research. Getting a ton of data cheaply doesn’t mean it’s high quality.

      Getting quality data and the right people to ensure social equity seems worth it to me.

      As I said in my post: yes, They stated the limitations but there is some indication that they still act on their findings as though those limitations don’t really exist.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s