Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create 2019 Indian Vidhan Sabha OCDIDs. #174

Merged
merged 5 commits into from
Oct 23, 2019

Conversation

rahul-nath
Copy link
Contributor

@rahul-nath rahul-nath commented Oct 3, 2019

In this PR I've included a script that I created to generate OCD IDs specifically for the Indian Vidhan Sabha elections of Maharashtra and Haryana. The OCD IDs generated are for constituencies of these states, which include their districts. There were some decisions made regarding district names due to discrepancies between the districts in wikipedia pages for [Maharashtra districts](https://en.wikipedia.org/wiki/List_of_districts_of_Maharashtra#Districts) and [constituencies](https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Maharashtra_Legislative_Assembly) and Haryana districts and
[constituencies]
(https://en.wikipedia.org/wiki/List_of_constituencies_of_the_Haryana_Legislative_Assembly). The district pages were deferred to over the analogous columns in the constituency pages after some research. They are as follows:

- Yamunanagar is used over Yamuna Nagar
- Gondia is used over Gondiya
- Gurugram is used over Gurgaon
- Nuh is used over ``Mewat

where applicable.

UPDATE: For source of truth, it was determined that https://affidavit.eci.gov.in is the source of truth regarding Consituency and District names.

Additionally, no changes were made to the aliases file located in identifiers/countries-in as it's unclear if that was necessary.

@rahul-nath
Copy link
Contributor Author

@jamesturk @jpmckinney @jdmgoogle this is necessary for the imminent Vidhan Sabha elections. I am unable to assign reviewers so please assign yourself.

@rahul-nath
Copy link
Contributor Author

Just want to bump this to make sure it's been seen.

@jpmckinney
Copy link
Member

Format looks good to me – I haven't checked against source of truth.

@jpmckinney jpmckinney closed this Oct 9, 2019
@jpmckinney jpmckinney reopened this Oct 9, 2019
Copy link
Contributor

@jdmgoogle jdmgoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I just have some questions around the script, the naming structure, and the original source of truth.

for parent in sorted(parent_set, key=lambda x: x.split(",")[-1]):
print(parent)

for state_abbr, state in contests.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort the output so it's easier to read.

Copy link
Contributor Author

@rahul-nath rahul-nath Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we wish to sort by state name, then district name or state name by constituency?

# format hardcoded OCD ID
global new_file
ocd_id = "ocd-division/country:{}/state:{}/district:{}/cd:{}"
rest = "state {} district {} {} constituency {}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here is a bit awkward. Maybe

${constituency} constituency, ${district} district, ${state}

E.g.,

Khanapur constituency, Sangli district, Maharashtra

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I think I'll use semicolons in lieu of the commas here, unless we should add new column names for constituency, district, and state (can we do that? It could potentially serve a future purpose)

"Kasba Peth": "Kasbapeth"
}

const_replacements = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these being replaced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the electoral districts in India have changed frequently in the last decade, so there's a lot of erroneous information out there. Some of it has made its way to wikipedia, which is unfortunately the only place I'd found abbreviations. To reconcile the differences between the the ultimate source of truth (https://affidavit.eci.gov.in) and the spreadsheet of abbreviations to districts, I use this dictionary. (Actually, this particular set of replacements is going to be taken out of this PR as a more concrete source of constituencies has been found with all corresponding states and districts: https://electoralsearch.in

contests = {"hr": "Haryana", "mh": "Maharashtra"}
columns = ["id", "name"]
country = "in"
election = "Vidhan Sabha"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCD-IDs should be independent of any one election.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the OCD-IDs for the Lok Sabha elections, it looks like the name of the election was included in the file containing the election (I was pattern matching). I'll take this out and generalize this script better.


for c_row in consts:
# source of truth on district names:
# https://affidavit.eci.gov.in/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What, exactly is being pulled from there? What's the input CSV that this script is munging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will include another script I made that fetches data and creates constituency CSVs for each state from the new source. District and constituency information is pulled from that website. I'll detail the expected format of the district abbreviation in a comment, but that information must be retrieved from elsewhere; in this case, they were taken from wikipedia manually put into a spreadsheet without the use of any provided script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that's useful to have, I'd prefer to split that out into a separate PR and have this one focus on only the OCD-IDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I split the PRs with this one containing the OCD IDs and another that adds the scripts that generate them.

Copy link
Contributor

@jdmgoogle jdmgoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the names of the constituencies are updated we should be good to go. Thanks.

Copy link
Contributor

@jdmgoogle jdmgoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahul-nath
Copy link
Contributor Author

Format looks good to me – I haven't checked against source of truth.

Awesome, sounds good @jpmckinney . Let me know if there's any changes that need to be done on the additional OCD-IDs

@jpmckinney jpmckinney merged commit 6587ba6 into opencivicdata:master Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants