A curated list of online newspapers covering 79 languages and 7267 sources.
This list provides newspaper sources which can be useful in building corpora for NLP applications. It is particulary instrumental for low-resource languages which lack large-scale datasets.
If you liked the work, or if this project was useful in your work, do consider supporting this project by making a small donation at buymeacoffee
- Afrikaans - 4 sources
- Albanian - 47 sources
- Amharic - 10 sources
- Arabic - 441 sources
- Armenian - 30 sources
- Assamese - 44 sources
- Azerbaijani - 25 sources
- Basque - 1 sources
- Belarusian - 3 sources
- Bengali - 253 sources
- Bulgarian - 57 sources
- Burmese - 11 sources
- Catalan - 22 sources
- Central khmer - 18 sources
- Croatian - 76 sources
- Czech - 115 sources
- Danish - 54 sources
- Dutch - 88 sources
- English - 1038 sources
- Estonian - 23 sources
- Finnish - 161 sources
- French - 500 sources
- Galician - 1 sources
- Georgian - 23 sources
- German - 270 sources
- Gujarati - 177 sources
- Hebrew - 16 sources
- Hindi - 154 sources
- Hungarian - 53 sources
- Icelandic - 17 sources
- Indonesian - 51 sources
- Italian - 126 sources
- Japanese - 78 sources
- Kannada - 154 sources
- Kazakh - 8 sources
- Korean - 62 sources
- Lao - 5 sources
- Latvian - 37 sources
- Lithuanian - 45 sources
- Luxembourgish - 1 sources
- Macedonian - 30 sources
- Malagasy - 2 sources
- Malay (macrolanguage) - 2 sources
- Malayalam - 140 sources
- Maltese - 5 sources
- Marathi - 155 sources
- Modern greek (1453-) - 97 sources
- Mongolian - 13 sources
- Nepali (macrolanguage) - 66 sources
- Norwegian - 152 sources
- Norwegian bokmål - 2 sources
- Norwegian nynorsk - 18 sources
- Oriya (macrolanguage) - 43 sources
- Panjabi - 157 sources
- Persian - 60 sources
- Polish - 57 sources
- Portuguese - 184 sources
- Pushto - 11 sources
- Romanian - 76 sources
- Russian - 174 sources
- Serbian - 54 sources
- Sindhi - 1 sources
- Sinhala - 20 sources
- Slovak - 34 sources
- Slovenian - 27 sources
- Spanish - 792 sources
- Swahili (macrolanguage) - 12 sources
- Swedish - 103 sources
- Tagalog - 8 sources
- Tajik - 3 sources
- Tamil - 142 sources
- Telugu - 92 sources
- Thai - 25 sources
- Turkish - 76 sources
- Turkmen - 1 sources
- Ukrainian - 21 sources
- Urdu - 49 sources
- Uzbek - 2 sources
- Vietnamese - 62 sources