-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create 2019 Indian Vidhan Sabha OCDIDs. #174
Create 2019 Indian Vidhan Sabha OCDIDs. #174
Conversation
@jamesturk @jpmckinney @jdmgoogle this is necessary for the imminent Vidhan Sabha elections. I am unable to assign reviewers so please assign yourself. |
Just want to bump this to make sure it's been seen. |
Format looks good to me – I haven't checked against source of truth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. I just have some questions around the script, the naming structure, and the original source of truth.
scripts/create_ocd_ids.py
Outdated
for parent in sorted(parent_set, key=lambda x: x.split(",")[-1]): | ||
print(parent) | ||
|
||
for state_abbr, state in contests.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sort the output so it's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we wish to sort by state name, then district name or state name by constituency?
scripts/create_ocd_ids.py
Outdated
# format hardcoded OCD ID | ||
global new_file | ||
ocd_id = "ocd-division/country:{}/state:{}/district:{}/cd:{}" | ||
rest = "state {} district {} {} constituency {}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming here is a bit awkward. Maybe
${constituency} constituency, ${district} district, ${state}
E.g.,
Khanapur constituency, Sangli district, Maharashtra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I think I'll use semicolons in lieu of the commas here, unless we should add new column names for constituency, district, and state (can we do that? It could potentially serve a future purpose)
scripts/create_ocd_ids.py
Outdated
"Kasba Peth": "Kasbapeth" | ||
} | ||
|
||
const_replacements = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these being replaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the electoral districts in India have changed frequently in the last decade, so there's a lot of erroneous information out there. Some of it has made its way to wikipedia, which is unfortunately the only place I'd found abbreviations. To reconcile the differences between the the ultimate source of truth (https://affidavit.eci.gov.in) and the spreadsheet of abbreviations to districts, I use this dictionary. (Actually, this particular set of replacements is going to be taken out of this PR as a more concrete source of constituencies has been found with all corresponding states and districts: https://electoralsearch.in
scripts/create_ocd_ids.py
Outdated
contests = {"hr": "Haryana", "mh": "Maharashtra"} | ||
columns = ["id", "name"] | ||
country = "in" | ||
election = "Vidhan Sabha" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OCD-IDs should be independent of any one election.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the OCD-IDs for the Lok Sabha elections, it looks like the name of the election was included in the file containing the election (I was pattern matching). I'll take this out and generalize this script better.
scripts/create_ocd_ids.py
Outdated
|
||
for c_row in consts: | ||
# source of truth on district names: | ||
# https://affidavit.eci.gov.in/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What, exactly is being pulled from there? What's the input CSV that this script is munging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will include another script I made that fetches data and creates constituency CSVs for each state from the new source. District and constituency information is pulled from that website. I'll detail the expected format of the district abbreviation in a comment, but that information must be retrieved from elsewhere; in this case, they were taken from wikipedia manually put into a spreadsheet without the use of any provided script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that's useful to have, I'd prefer to split that out into a separate PR and have this one focus on only the OCD-IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I split the PRs with this one containing the OCD IDs and another that adds the scripts that generate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the names of the constituencies are updated we should be good to go. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, sounds good @jpmckinney . Let me know if there's any changes that need to be done on the additional OCD-IDs |
In this PR I've included a script that I created to generate OCD IDs specifically for the Indian Vidhan Sabha elections of Maharashtra and Haryana. The OCD IDs generated are for constituencies of these states, which include their districts.
There were some decisions made regarding districtnames due to discrepancies between the districts in wikipedia pages for [Maharashtra districts](https://en.wikipedia.org/wiki/List_of_districts_of_Maharashtra#Districts) and [constituencies](https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Maharashtra_Legislative_Assembly)and Haryana districts and[constituencies](https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Haryana_Legislative_Assembly). Thedistrict pages were deferred to over the analogous columns in the constituency pages after someresearch. They are as follows:-Yamunanagar
is used overYamuna Nagar
-Gondia
is used overGondiya
-Gurugram
is used overGurgaon
-Nuh
is used over ``Mewatwhere applicable.UPDATE: For source of truth, it was determined that
https://affidavit.eci.gov.in
is the source of truth regarding Consituency and District names.Additionally, no changes were made to the aliases file located in
identifiers/countries-in
as it's unclear if that was necessary.