Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict Data Additions #11

Open
ajnafa opened this issue Oct 18, 2021 · 7 comments
Open

Conflict Data Additions #11

ajnafa opened this issue Oct 18, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@ajnafa
Copy link

ajnafa commented Oct 18, 2021

Let me start by saying this is an awesome package that has been extremely useful for constructing duration data. Over the past decade peace science (comparative politics/civil conflict) has moved in a subnational direction (there's only so much we can learn from the country-year) and I'm wondering how much work it would be to add support for the major event-level data projects? Namely, UCDP's Georeferenced Event Dataset (GED) Global version 21.1 and the Armed Conflict Location & Event Data Project. Of course, due to size limitations it may not be feasible to include them by default but this could be handled via some basic functions that when called download and import the data. In the case of ACLED there may be some licensing issues that complicate things. The most straightforward solution I can think of in that case would be to utilize their recently added API and have users specify their own key but it may be necessary to seek guidance from ACLED regarding how best to handle that since you have to have an API key to even download their curated data files now. I may have some time to work on some of the API stuff next month if you're interested in adding that functionality.

@ajnafa ajnafa changed the title Conflict Data Editions Conflict Data Additions Oct 18, 2021
@svmiller svmiller added the enhancement New feature or request label Oct 18, 2021
@svmiller
Copy link
Owner

svmiller commented Oct 18, 2021

Remember, you're dealing with an old school Correlates of War-era peace scientist here who is more knowledgeable about inter-state stuff than intra-state stuff, so I might ask some dumb questions. Importantly, is there a type of "seminal" analysis on this front? Think of articles like Fearon and Laitin (2003) [state-year] and Bremer (1992) [dyad-year] as illustrative. If you do state-level analyses of civil conflict, or dyadic analyses of dispute onset, you've seen these 100 times before and the basic information in there is mimicked in every similar analysis that adjusts the set of covariates or the sampling frame. When it comes to different levels of analysis, I'm less knowledgeable and could best see the value if I know there is an exemplar analysis that serves as a template to copy.

I downloaded the GED and, 50 MBs into the download as I type, I already know that ain't happening. 😛 I could see some workarounds, as you mentioned. Perhaps the data can be stored remotely, and loaded in by the user. I'm trying to keep those remote data sets to a bare minimum for a variety of CRAN-related reasons, but that's an option. That said, here's where I think this would be optimal.

For one, the UCDP/PRIO people have such a killer suite of data sets for researchers doing civil conflict that I really think they need their own API for these things, and perhaps their own R package around it. Perhaps I'm being too vain or misremembering things, but I seem to recall some UCDP/PRIO people poking around this package late last year (when it first went on CRAN) and talking about how they really need something like that with their own data. I think they do, and that I think {peacesciencer} would work better with that as a suggested package. An R package like I offer here has some flexibility and limitations vis-a-vis standalone programs. Flexibility: not everything needs to be one centralized program. Limitation: not everything can be part of one centralized program.

Second, and this is invariably going to happen soon (and perhaps really soon if it's urgent), but {peacesciencer} is getting a helper function related to this issue. It'll be named something like declare_datatype() where the user can "declare" their data to be a certain type in {peacesciencer}. Look at the code underneath the functions and you'll see the create_* functions assign data-type attributes and the other functions depend on those. If you do something like a group_by() summarize, you'll overwrite those and need to put them back (add_peace_years() is the worst offender here). What this "declare" function would do is allow the user to declare their data to be a certain type and that would allow built-in functions to work.

In a case like this, I can see this working as follows. You have subnational units, but those units are nested in states. Perhaps there are covariates at the subnational level that I don't know about, and perhaps I just need to be shown why these are important as a different level of analysis (beyond state-year, dyad-year, and soon: leader-levels). There are also state-level covariates that will allow you to make more cross-sectional comparisons at a higher order than the subnational-level (e.g. state-level GDP per capita). The "declare" function would allow you to load the GED data, "declare" them to be 1) state-level and 2) using the Gleditsch-Ward system of states and that would allow you to use a function like add_sdp_gdp() (which depends on those attributes).

What do you think of that?

@ajnafa
Copy link
Author

ajnafa commented Oct 19, 2021

Canonical isn't really my style so I hope you'll settle for interesting instead.

Zhou, Yang-Yang, and Andrew Shaver. 2021. “Reexamining the Effect of Refugees on Civil Conflict: A Global Subnational Analysis.” American Political Science Review: 1–22. doi: 10.1017/S0003055421000502.

Bove, Vincenzo, Jessica Di Salvatore, and Leandro Elia. 2021, "UN Peacekeeping and Households' Well-Being in Civil Wars." American Journal of Political Science. https://doi.org/10.1111/ajps.12644

Hammond, Jesse. “Maps of Mayhem: Strategic Location and Deadly Violence in Civil War.” Journal of Peace Research 55, no. 1 (January 2018): 32–46. https://doi.org/10.1177/0022343317702956.

Regarding UCDP, the file is definitely large and my thought was a function that downloads it from the internet because as far as I know, the direct link generally doesn't change. As far as structure, I have pretty detailed script that turns the UCDP event data into duration format somewhere in my dropbox that may provide a good place to start in terms of building a backend. I've always found it absolutely insane that given how massive their database is, UCDP doesn't have its own API.

The structure of ACLED is basically identical with the main difference being that UCDP consists of "conflict episodes" while ACLED strictly codes events. Unfortunately, this requires two slightly different approaches to expanding the data with UCDP being a bit more tricky than ACLED. As far as subnational covariates are concerned, the College of William and Mary's AIDData project is the single best source for those and has them at multiple levels, but honestly it's probably way too much work to integrate that. I think as long as the standard GADM identifiers are included, which I believe are in both UCDP GED and ACLED, it's best to leave the decision of subnational covariates to the end user because selection is hard to generalize and requires specific knowledge of the question at hand. If there was a pre-compiled global data source that had the night time lights data at various administrative levels, I think that would be worth including but my main thought here was focused on the addition of the conflict data itself.

Honestly, I think the declare function idea is going to be the easiest way to make it as generalizable as possible, though I think it would be best to have them declare both the type of data they want as output (i.e., state-rebel dyad, time to event, aggregate counts), and in the case of event level data, the geographic unit since both UCDP and ACLED have first, second, and third admin level geo-coding. I think structuring the data is the biggest headache most people face and automating that overcomes such a huge barrier that everything else seems easy and that's the functionality that I think integrates the best alongside the existing functionality of peacesciencer

The downside to downloading the data from the internet, of course, is that you'd have to add something like rcurl as a dependency if it isn't already.

@ajnafa
Copy link
Author

ajnafa commented Oct 20, 2021

I had a chance to look at the way you've got things structured and I think the best place to start is going to be a generalization of the create_statedays function called create_districtmonths. The logic behind using month as the time unit stems from the fact that it is the lowest level at which, as far as I recall, UCDP guarantees the accuracy of the GED events and because if you try to go more fine grained than that memory requirements quickly become an issue and you have to tackle the problem of differential accuracy.

As far as geographic level is concerned, Weidmann's (2015) analysis suggests 95% of conflict events, and the GED in particular, fall within 50km of the reported coordinates so the function should probably support the first and second administrative levels (the equivalents of states/provinces and counties respectively; probably via an additional user-specified character argument) but not the third because that introduces complications related to spatial error (and apparently some countries just don't have a third admin level at all).

This also helps resolve the problem of file sizes since only dealing with a semi-aggregated version of the data drastically reduces the amount of space required and eliminates the potential headache that introducing non-api based web dependencies would be.

Any thoughts on this?

@svmiller
Copy link
Owner

I hope you don't mind me saying this is something I might have to return to. Don't misinterpret silence on my end as lack of interest. It's just that this R package/manuscript was R&Red at an IR journal and the major addition the reviewers/editor wanted was leader stuff. It's what I'm knee-deep doing.

Re: time. Time is no issue. {lubridate} and the {tidyverse} make dates a cinch. Re: districts, though. Do we know and know well the full universe of districts/subnational units? Do they change/have they changed? In the American context, for example, we still stupidly have the 50 states and I don't think we have new counties that have formed within states at any point in the past forever. I can see, however, that possibly being an issue in other countries, certainly war-torn countries, that are imposing wholesale changes in their borders (or parts of it) during or after a conflict.

Btw, I see you forked the Github repository. If you're wanting to propose an addition via Github, feel free. It might make more sense to drop whatever code additions you have in the data-raw directory. Writing documenation is a chore but I'm happy to do that.

@ajnafa
Copy link
Author

ajnafa commented Oct 21, 2021

I absolutely understand. The GED stuff, at least in terms of the functions that would need to be written, happens to be closely related to some stuff I have to do for a manuscript I'm revising for ISA Midwest in late November so I figured I may as well get them started in the process. I probably should have led by saying that I already have most of the code written to do what I've described above for the GED data, it's mainly a matter of getting it in the proper format and verifying compatibility with the rest of the package.

As far as the geography stuff goes, there are three major systems--PRIO Grid, GADM, and GeoBoundaries--and it's worth considering which would be the most useful. I'll talk with some of the faculty in my department who do this sort of thing and see if there are any preferences for one system over the other. The GED supports two of the three but I need to double check which ones those are. This is definitely a longer term thing but I'll hopefully have something ready and working by January or so.

@svmiller
Copy link
Owner

Dumping these links here so I don't lose track of them.

https://twitter.com/adamjnafa/status/1472678781307564044
https://ucdp.uu.se/apidocs/

@ajnafa
Copy link
Author

ajnafa commented Dec 19, 2021

Here's the code for the function in the tweet.

##------------------------------------------------------------------------------------###
###------------------Functions for Pulling UCDP Data Via the RestAPI----------------------
###------------------------------------------------------------------------------------###
#
# This is a function designed to import data from the Uppsala Conflict Data Program's 
# API directly into an R session. The function takes n required arguments and uses them
# to construct an API call.
#
# Arguments:
#  .resource    The UCDP Dataset to be imported. Currently supported options include 
#               - `gedevents` for the UCDP Georeferenced Event Dataset (UCDP GED)
#               - `ucdpprioconflict` UCDP/PRIO Armed Conflict Dataset
#               - `dyadic` for UCDP Dyadic Dataset
#               - `nonstate` for UCDP Non-State Conflict Dataset
#               - `onesided` for UCDP One-Sided Violence Dataset
#               - `battledeaths` for UCDP Battle Related Deaths Dataset
#
#  .version     The version of the UCDP resource to be downloaded. If not Specified,
#               the argument defaults to the most recent release, which at this time is
#               is version 21.1
#
#  .date_range  An optional vector of length two containing dates for the beginning and
#               end of the time range to pull observations for respectively. For example, 
#               `.date_range` = c("1991-01-01", "2000-12-31")` would retreive all events 
#               that occured between January 1, 1991 and December 31, 2000. This argument
#               is only evaluated if `.resource = "gedevents"`
#
#  .filters     An optional string of additional conditions to use for filtering in the
#               API call. For more details see https://ucdp.uu.se/apidocs/
#
#  ...          Additional arguments. Currently supported options include `.pagesize`
#               which defaults to the maximum of 1000
#
# Usage:
#
#   # Make and API Call for the UCDP GED
#   ged_data <- ucdp_api_data(
#     .resource = "gedevents", 
#     .version = "21.1", 
#     .date_range = c("1999-01-01", "2000-12-31")
#     )
#
#   # Print the first few rows of the retreived data
#   head(ged_data[[3]])
#

# API Call Constructor Function
ucdp_api_data <- function(.resource, 
                          .version = NULL, 
                          .date_range = NULL, 
                          .filters = NULL, 
                          ...
                          ){
  # Base URL for the UCDP API
  .base_url = "https://ucdpapi.pcr.uu.se/api"
  # Check for a user-specified argument of .version
  if(!is.null(.version)){
    .base_api_string = str_c(.base_url, .resource, .version, sep = "/")
  } 
  # If .version is not specified, use the most recent version
  else {
    .base_api_string = str_c(.base_url, .resource, "21.1", sep = "/")
  }
  # Check for a user-specified argument of .pagesize
  if(!is.null(.pagesize)){
    .base_api_string = str_c(.base_api_string, "?pagesize=", .pagesize, sep = "")
  }  
  # If .pagesize is not specified, set it to 1000
  else {
    .base_api_string = str_c(.base_api_string, "?pagesize=1000", sep = "")
  }
  # Check for a user-specified date range of length 2
  if(length(.date_range) == 2 & .resource == "gedevents"){
    # Contrstruct a string with the specified time range
    .time_string = str_c(
      "&StartDate=", 
      lubridate::ymd(.date_range[1]),
      "&EndDate=",
      lubridate::ymd(.date_range[2]), 
      sep = ""
      )
    # Update the base API Call
    .base_api_string = str_c(.base_api_string, .time_string, sep = "")
  }
  # Additional user-specified filtering conditions
  if(!is.null(.filters)){
    # Update the API with user-specified conditions such as
    .base_api_string = str_c(.base_api_string, .filters, sep = "&")
  }
  # Make the initial API call
  .result <- jsonlite::fromJSON(.base_api_string)
  
  # If the number of pages is > 1 recover each page
  if(.result$TotalPages > 1){
    .df <- tibble(
      # Generate a sequence of pages
      .pages = seq(0, (.result$TotalPages - 1), 1),
      # Generate a sequence of pages
      .api_calls = str_c(.base_api_string, "&page=", .pages)
      )
    
    # Recover each page and append the the result
    .ucdp_data <- map_dfr(
      .x = ged_example[[2]]$.api_calls,
      ~ jsonlite::fromJSON(.x)$Result
    )
    
    return(list(.result, .df, .ucdp_data))
  }
  # Otherwise, just recover the data from the original object
  else {
    return(.result)
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants