Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about "all connections are in use" #5

Open
djkpf opened this issue Jul 1, 2019 · 8 comments
Open

Question about "all connections are in use" #5

djkpf opened this issue Jul 1, 2019 · 8 comments

Comments

@djkpf
Copy link

djkpf commented Jul 1, 2019

Absolutely loving this package! Thank you.

With the permission of Cifra Club, I am using it to scrape the chords of the top 10 hits in the US going back many decades. When I run a large batch of urls, the package works well, but when I try to save my result, I get

Error in file(con, "r") : all connections are in use

I wonder whether this is a known issue, and if there is something I should be doing differently.

Thanks!

@brunaw
Copy link
Member

brunaw commented Jul 2, 2019

Hey @djkpf, thanks for using it :) Looks like you're doing some really cool stuff.

Would it be possible for you to share the code you're using? This is an unknown error for me and this way would be easier to find what's causing it. By the message, my guess is that something weird is going on while you connect to the website, which could be solved by taking it easier (giving R some time between making each connection, for example), but it's difficult to tell without the code.

@djkpf
Copy link
Author

djkpf commented Jul 2, 2019

Hi there,

Thanks so much. I don't ask many questions on github, but I will do my best.

Using the following data: https://github.com/djkpf/pop-chords/blob/master/sampledata_chorrrds.csv, I run the following code:

library(tidyverse)

library(chorrrds)

dt10 <- read_csv("./sampledata_chorrrds.csv")

chords59to62 <- dt10 %>%
  dplyr::pull(url) %>%                     
  purrr::map(chorrrds::get_chords) %>%   
  purrr::map_dfr(dplyr::mutate_if, is.factor, as.character)   %>% 
  chorrrds::clean(message = FALSE)

save(chords59to62, file = "chords59to62.Rdata")

It is when I save that I get the error:

Error in file(con, "r") : all connections are in use

@brunaw
Copy link
Member

brunaw commented Jul 2, 2019

Great, thanks! So, let's get into the issue:

It looks to me that you created the URLs with the name of the song and the artist, but actually, the Cifraclub doesn't work like this. Unfortunately, not all songs are available on the website. This is why the most common usage is to provide an artist to the get_songs() function, which will return the
available songs of each artist.

What I've done in the following code, which uses your dt10 object, is:

  1. Created a column that has the proper name format for the get_songs() function
  2. Created a function that finds the name of the songs of each of your artists (when they exist in Cifraclub), and compare those names to the one you're looking for, to return the correct URL in Cifraclub for your songs;

...but a big issue in this is that I searched for some of the artists and songs manually, and some of them don't even exist in Cifraclub ): That might lead you to end up with a smaller dataset.

(the code can be slow)

dt10 <- dt10 %>% 
  mutate(
    x2 = x2 %>% 
    str_remove_all(pattern = "'") %>% 
    str_to_lower(),
  name = x2 %>%     
    str_replace_all(pattern = " ", replacement = "-")
  )
  
correct_url <- function(artist, name_form, str_comparison){
  saf <- safely(get_songs)
  songs <- saf(artist)
  if(!is.null(songs$result)){
    if(dim(songs$result)[1] > 0){
      return(
        songs$result$url[
          songs$result$name %>% 
            str_remove(pattern = paste0(name_form, " ")) %>% 
            RecordLinkage::levenshteinDist(str2 = str_comparison) %>% 
            which.min()
          ] %>% 
          as.character()
        
        )
    }  else { "artist not found"}
  } else { "artist not found"}
}

mps <- pmap_chr(list(
  artist = dt10$name, 
  name_form = dt10$x2, 
  str_comparison = dt10$x1), 
  correct_url)

The results of this should be what you'll use in the get_chords() function (except when the title wasn't found). Could you test it and let me know how it goes for you? :)

@djkpf
Copy link
Author

djkpf commented Jul 2, 2019

Thanks! So one thing I should clarify is that what I was doing was working perfectly besides the error I received when I saved: Error in file(con, "r") : all connections are in use

All of the data collection actually worked perfectly. I was getting data from all the urls that existed. My only issue was that I couldn't save it because the connections were open. Could it be because it uses parallel processing?

I also used what you did and the data collection worked, but I could not save.

@brunaw
Copy link
Member

brunaw commented Jul 2, 2019

Yes, sorry about that! You can just close the connections after scraping each URL, something like this:

library(tidyverse)
library(chorrrds)

dt10 <- read_csv("https://raw.githubusercontent.com/djkpf/pop-chords/master/sampledata_chorrrds.csv")

exit_get_chords <- function(url,...) {
  con <- chorrrds::get_chords(url)
  closeAllConnections()
  con
}

chords59to62 <- dt10 %>%
  dplyr::pull(url) %>%
  unique() %>% 
  purrr::map(exit_get_chords) %>% 
  purrr::map_dfr(dplyr::mutate_if, is.factor, as.character)   %>% 
  chorrrds::clean(message = FALSE)


save(chords59to62, file = "chords59to62.Rdata")`

@brunaw
Copy link
Member

brunaw commented Jul 2, 2019

Answering your question: no, it doesn't do anything in parallel, I need to look deeper to find the true source of this error. Thanks for pointing it out!

@djkpf
Copy link
Author

djkpf commented Jul 3, 2019

This did the trick. Thanks so much for your help!!! Will let you know when story is finished.

@leonawicz
Copy link
Member

It sounds like connections are being left open by possibly either by get_chords or get_songs after they open a connection to a url. You would see this message if you happen to use a procedure like above where you are mapping a function over many urls. If it is failing to close its connection to a url or file before exiting, you can max out how many connections you can have open simultaneously.

But running get_songs and get_chords now and checking showConnections doesn't reveal any open connections. I also checked the version of chorrds prior to my PR was merged. I didn't notice anything there either. It's possible this was fixed in an update at some point. Or it's possible that if it was failing silently on some urls, only in those cases the connection may not have closed before the function exited. It would be interesting to know if you can still reproduce the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants