Skip to content

Commit

Permalink
feat: add exercises
Browse files Browse the repository at this point in the history
  • Loading branch information
honzajavorek committed Oct 8, 2024
1 parent 2c47e24 commit 362737b
Showing 1 changed file with 83 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -302,14 +302,91 @@ Ta-da! We've managed to get links leading to the product pages. In the next less

<Exercises />

### TODO
### Scrape links to countries in Africa

TODO
Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print links to Wikipedia pages of all the states and territories mentioned in all tables. Start with this URL:

### TODO
```text
https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa
```

Your program should print the following:

```text
https://en.wikipedia.org/wiki/Algeria
https://en.wikipedia.org/wiki/Angola
https://en.wikipedia.org/wiki/Benin
https://en.wikipedia.org/wiki/Botswana
...
```

<details>
<summary>Solution</summary>

```py
import httpx
from bs4 import BeautifulSoup
from urllib.parse import urljoin

listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
response = httpx.get(listing_url)
response.raise_for_status()

html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
link = name_cell.select_one("a")
url = urljoin(listing_url, link["href"])
print(url)
```

</details>

### Scrape links to F1 news

Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print links to all the listed articles. Start with this URL:

```text
https://www.theguardian.com/sport/formulaone
```

Your program should print something like the following:

```text
https://www.theguardian.com/world/2024/sep/13/africa-f1-formula-one-fans-lewis-hamilton-grand-prix
https://www.theguardian.com/sport/2024/sep/12/mclaren-lando-norris-oscar-piastri-team-orders-f1-title-race-max-verstappen
https://www.theguardian.com/sport/article/2024/sep/10/f1-designer-adrian-newey-signs-aston-martin-deal-after-quitting-red-bull
https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-undriveable-monster-how-bad-really-is-it-and-why
...
```

<details>
<summary>Solution</summary>

```py
import httpx
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://www.theguardian.com/sport/formulaone"
response = httpx.get(url)
response.raise_for_status()

html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

for item in soup.select("#maincontent ul li"):
link = item.select_one("a")
url = urljoin(url, link["href"])
print(url)
```

TODO
Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:

### TODO
```text
https://www.theguardian.com/sport/article/2024/sep/02/example
https://www.theguardian.com/sport/article/2024/sep/02/example#comments
```

TODO
</details>

0 comments on commit 362737b

Please sign in to comment.