Opening up the Tunisian Institute of Statistics' poverty report of 2020
By Iyed Ghedamsi
November 14, 2020
As is the case with most publication from Tunisian public institutions that generate data, the National Institute of Statistics (NIS) has recently released a report on poverty in Tunisia as a … PDF file. The Tunisian citizen in me cringed, the data nerd decided it would make a good project to showcase how to transform a PDF into a more open format, and even throw in some interactive data visualization.
Before moving further in this blog post, it is important to establish what it does, and what it does not.
What this is not
- An analysis of the poverty data in Tunisia, although I am very tempted to do it.
- A bashing of the NIS, they are probably overworked, paid too little, and lack the necessary training to operate and maintain an open data pipeline.
What this is
- A make-up routine for a PDF file using Python, Tabula, Plotly and Streamlit.
If you want to play with the data file you can find it in this link. The interactive dashboard is viewable in this link
Data extraction from PDF.
The first step of this project is evidently transforming the data inside the PDF into structured data. For this purpose, we will use two excellent Python libraries: Tabula for PDF tables’ manipulation, and Pandas for data cleaning and reshaping.
|
|
In this snippet, we are telling Tabula to read the file NIS_poverty_report.pdf
, on page 40
specifically, which has multiple_tables
. Note that the parameter multiple_tables
implies that Tabula will return a list of pandas Data Frames, which is why we use the first value in the return value of the read_pdf
function call. This will become relevant when extracting the data for Mednine and Tataouine which happen to be on the same page.
A nice tutorial about table extraction from PDF using Table can be found in this link
|
|
The other important argument to pay attention to with tabula is pandas_options
. To properly understand what it does, we simply need to think of the dictionary passed to it as the python convention for passing key word arguments in **kwargs
. Essentially, after reading the table from PDF, tabula will read the result as a CSV using pandas according to the key word arguments passed in pandas_options
.
What remains now is to go through the pages specific to each governorate, extract their corresponding table, then merge them into the final dataset. The following snippet provides an overview of the process in code. the full code is available here)
|
|
Agumenting the data: fuzzy string matching
The dataset we have extracted thus far can be visualized as is to get insights about regional differences in poverty rates, and how that can be related to primary and secondary schools abandonment rates. That is all good and well, but I wanted a map (the technical term is a Choropleth map, from the Greek Choros for area and Plethos for multitude).
As such, I have embarked on the journey to augment the dataset with the administrative borders of Tunisian delegations. Getting the data was relatively simple, the kind people at data4tunisia host a platform where a variety of entities and people can share their data publically, which is where I got the geo data for Tunisian delegations.
The challenge that jumps at us right away is that the key variables used to join the datasets i.e. the governorate name and the delegation name are not written the same way. To better illustrate the problem, let us look at a subset of the municipality names we are trying to match.
Delegation name form NIS | Delegation name from Geo data |
---|---|
Ettahrir | El Tahrir |
Tina | Thyna |
La Medina | Médina |
Ariana Ville | Ariana Médina |
Souk Lahad | Souk El Ahed |
Elksar | Ksar |
Boumhel Bassatine | Boumhel |
Sidi Ali Ben Nasrallah | Nasrallah |
Sfax Ville | Sfax Médina |
Sidi Amor Bou Hajla | Bouhajla |
With 264 unique delegation names (there are 264 delegations but two of them are called Ezzouhour), this task has to be done programatically. The best way I found to deal with this is by formulating the challenge as an Assignment problem. In essence, this entails building a cost matrix between the names from the NIS and the names from the Geo data where the cost is some measure of distance between the text of the two names, then running an optimization algorithm that will assign matching names such as to minimize the total distance (or cost) between the matched names.
For the implementation of this step we resort to two excellent python packages: TextDistance and SciPy. With TextDistance we have access to a wide variety of text distance metrics, but I have settled through experimentation on the Jaccard Index. As for the optimization algorithm we will use the Hungarian algorithm.
NSI_names = [
"Ettahrir",
"Tina",
"La Medina",
"Ariana Ville",
"Souk Lahad",
"Elksar",
"Boumhel Bassatine",
"Sidi Ali Ben Nasrallah",
"Sfax Ville",
"Sidi Amor Bou Hajla",
]
geodata_names = [
"El Tahrir",
"Thyna",
"Médina",
"Ariana Médina",
"Souk El Ahed",
"Ksar",
"Boumhel",
"Nasrallah",
"Sfax Médina",
"Bouhajla",
]
# We shuffle them randomly to avoid having the existing order affecting the algorithm
# even though theoretically it does not matter
shuffle(NSI_names)
shuffle(geodata_names)
Now we compute the cost matrix as follows:
pairs = product(NSI_names, geodata_names)
scores = np.array(
[textdistance.jaccard.normalized_distance(q, c) for q, c in pairs]
).reshape((len(NSI_names), len(geodata_names)))
Boumhel | Bouhajla | Souk El Ahed | Thyna | Médina | Nasrallah | Sfax Médina | Ksar | El Tahrir | Ariana Médina | |
---|---|---|---|---|---|---|---|---|---|---|
Souk Lahad | 0.785714 | 0.615385 | 0.533333 | 0.846154 | 0.857143 | 0.8125 | 0.6875 | 0.923077 | 0.8125 | 0.789474 |
Ariana Ville | 0.882353 | 0.823529 | 0.8 | 0.866667 | 0.8 | 0.6875 | 0.722222 | 0.857143 | 0.6875 | 0.529412 |
Tina | 1 | 0.909091 | 1 | 0.5 | 0.571429 | 0.916667 | 0.75 | 0.857143 | 0.7 | 0.785714 |
Sidi Ali Ben Nasrallah | 0.84 | 0.8 | 0.692308 | 0.875 | 0.833333 | 0.590909 | 0.730769 | 0.869565 | 0.76 | 0.6 |
Elksar | 0.916667 | 0.833333 | 0.8 | 0.9 | 0.909091 | 0.636364 | 0.9375 | 0.571429 | 0.636364 | 0.882353 |
Boumhel Bassatine | 0.588235 | 0.611111 | 0.73913 | 0.842105 | 0.85 | 0.761905 | 0.782609 | 0.894737 | 0.761905 | 0.8 |
La Medina | 0.933333 | 0.866667 | 0.833333 | 0.833333 | 0.5 | 0.875 | 0.461538 | 0.916667 | 0.8 | 0.533333 |
Sidi Amor Bou Hajla | 0.761905 | 0.65 | 0.652174 | 0.956522 | 0.863636 | 0.833333 | 0.75 | 0.904762 | 0.782609 | 0.666667 |
Sfax Ville | 0.866667 | 0.875 | 0.777778 | 0.928571 | 0.857143 | 0.8125 | 0.6 | 0.923077 | 0.733333 | 0.85 |
Ettahrir | 0.928571 | 0.857143 | 0.888889 | 0.818182 | 0.833333 | 0.785714 | 0.882353 | 0.8 | 0.454545 | 0.833333 |
For the actual optimization, we call the linear_sum_assignment
from the module scipy.optimize
which applies the Hungarian algorithm to our cost matrix.
row_ind, col_ind = linear_sum_assignment(scores)
scores_df = pd.DataFrame(
[
{"source": NSI_names[i], "target": geodata_names[j], "distance": scores[i, j]}
for i, j in zip(row_ind, col_ind)
]
).sort_values("distance")
Which gives the correct assignments
source | target | distance |
---|---|---|
Ettahrir | El Tahrir | 0.454545 |
Tina | Thyna | 0.5 |
La Medina | Médina | 0.5 |
Ariana Ville | Ariana Médina | 0.529412 |
Souk Lahad | Souk El Ahed | 0.533333 |
Elksar | Ksar | 0.571429 |
Boumhel Bassatine | Boumhel | 0.588235 |
Sidi Ali Ben Nasrallah | Nasrallah | 0.590909 |
Sfax Ville | Sfax Médina | 0.6 |
Sidi Amor Bou Hajla | Bouhajla | 0.65 |
Parting words
This project was truly fun because the actual data collection was not hard but it had enough challenges to keep it interesting. For such a basic project I dealt with extracting data from PDFs, Fuzzy string matching, Optimization algorithms and the final step which is turning the results into an interactive dashboard which you can find here.
For the full code used in this project please visit my Github repo
- Posted on:
- November 14, 2020
- Length:
- 7 minute read, 1404 words
- See Also: