Skip to content

Commit fe58297

Browse files
committed
Start with the openaddresses code
0 parents  commit fe58297

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+8758
-0
lines changed

CONTRIBUTING.md

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
Pelias can't succeed without contributions from community members like you! Contributions come in many different shapes and sizes. In this file we provide guidance around two of the most common types of contributions: opening issues and opening pull requests.
2+
3+
# Community Values
4+
5+
We ask that you are respectful when contributing to Pelias or engaging with our community. As a community, we appreciate the fact that contributors might be approaching the project from a different perspective and background. We hope that beginners as well as advanced users will be able to use and contribute back to Pelias. We want to encourage contributions and feedback from all over the world, which means that English might not be a contributor's native language, and sometimes we may encounter cultural differences. Contructive disagreements can be essential to moving a project forward, but disrespectful language or behavior will not be tolerated.
6+
7+
Above all, be patient, be respectful, and be kind!
8+
9+
# Submitting Issues
10+
11+
All issues for Pelias are housed in the [pelias/pelias](https://github.com/pelias/pelias) repo. Before opening an issue, be sure to search the repository to see if someone else has asked your question before. If not, go ahead and [open a new issue](https://github.com/pelias/pelias/issues/new).
12+
13+
## Submitting technical bugs
14+
15+
When submitting bug reports, please be sure to give us as much context as possible so that we can reproduce the error you encountered. Be sure to include:
16+
- System conditions (OS, browser, etc)
17+
- Steps to reproduce
18+
- Expected outcome
19+
- Actual outcome
20+
- Screenshots, if applicable
21+
- Code that exposes the bug, if you have it (such as a failing test or a barebones script)
22+
23+
## Submitting issues around search result quality
24+
25+
It's important to get feedback about the quality of local search results. When it comes to things like address structure, capitalization, and spelling errors, your local knowledge will make it easier for us to understand the problem. When submitting issues be sure to include:
26+
- Where in the world you were searching
27+
- Your search query
28+
- Your expected result
29+
- Your actual result
30+
31+
32+
# Pull Requests Welcome!
33+
34+
## Project standards overview
35+
36+
Pelias has several miscellaneous standards:
37+
38+
- we use [JSHint](http://jshint.com/docs/) for linting
39+
- we use [TravisCI](https://travis-ci.org/) for continuous integration
40+
- we use [Winston](https://www.npmjs.com/package/winston) for logging
41+
- we *love* tests, especially when written with [tape](https://github.com/substack/tape)
42+
- we use [semver](http://semver.org/) for package versioning
43+
- we *loosely* use [JSDoc](http://usejsdoc.org/index.html) for documenting code, as described [here](in_code_documentation_guidelines.md)
44+
45+
`jshint` and any unit tests in a project will be automatically invoked when you commit to an existing project; make
46+
sure they exit successfully!
47+
48+
## Active contributors
49+
50+
We'll gladly invite active contributors to become members of the [Pelias organization](https://github.com/pelias). New
51+
members will gain direct write permissions, *and with great power comes great responsibility*. To ensure that any new
52+
repositories that you create conform to Pelias standards, we developed [pelias-init](https://github.com/pelias/init), a
53+
simple project generator that will initialize all of the boilerplate needed to get started on something new.

Dockerfile

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# base image
2+
FROM pelias/baseimage
3+
4+
# downloader apt dependencies
5+
# note: this is done in one command in order to keep down the size of intermediate containers
6+
RUN apt-get update && apt-get install -y unzip && rm -rf /var/lib/apt/lists/*
7+
8+
# change working dir
9+
ENV WORKDIR /code/pelias/openaddresses
10+
WORKDIR ${WORKDIR}
11+
12+
# copy code into image
13+
ADD . ${WORKDIR}
14+
15+
# install npm dependencies
16+
RUN npm install
17+
18+
# run tests
19+
RUN npm test

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
The MIT License (MIT)
2+
3+
Copyright (c) 2014 Mapzen
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+99
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
>This repository is part of the [Pelias](https://github.com/pelias/pelias) project. Pelias is an
2+
>open-source, open-data geocoder built by [Mapzen](https://www.mapzen.com/) that also powers
3+
>[Mapzen Search](https://mapzen.com/projects/search). Our official user documentation is
4+
>[here](https://mapzen.com/documentation/search/).
5+
6+
# Pelias OpenAddresses importer
7+
8+
[![Greenkeeper badge](https://badges.greenkeeper.io/pelias/openaddresses.svg)](https://greenkeeper.io/)
9+
10+
[![Build Status](https://travis-ci.org/pelias/openaddresses.svg?branch=master)](https://travis-ci.org/pelias/openaddresses)
11+
12+
## Overview
13+
14+
The OpenAddresses importer is used to process data from [OpenAddresses](http://openaddresses.io/)
15+
for import into the Pelias geocoder.
16+
17+
## Requirements
18+
19+
Node.js 4 or higher is required.
20+
21+
## Installation
22+
```bash
23+
git clone https://github.com/pelias/openaddresses
24+
cd openaddresses
25+
npm install
26+
```
27+
28+
## Data Download
29+
Use the `imports.openaddresses.files` configuration option to limit the download to just the OpenAddresses files of interest.
30+
Refer to the [OpenAddresses data listing]( http://results.openaddresses.io/?runs=all#runs) for file names.
31+
32+
> see the 'Configuration' section below for a more detailed example of how to use `imports.openaddresses.files`
33+
34+
```bash
35+
npm run download
36+
```
37+
38+
## Usage
39+
```bash
40+
# show full command line options
41+
node import.js --help
42+
43+
# run an import
44+
npm start
45+
```
46+
47+
## Admin Lookup
48+
OpenAddresses records do not contain information about which city, state (or
49+
other region like province), or country that they belong to. Pelias has the
50+
ability to compute these values from [Who's on First](http://whosonfirst.mapzen.com/) data.
51+
For more info on how admin lookup works, see the documentation for
52+
[pelias/wof-admin-lookup](https://github.com/pelias/wof-admin-lookup). By default,
53+
adminLookup is enabled. To disable, set `imports.adminLookup.enabled` to `false` in Pelias config.
54+
55+
**Note:** Admin lookup requires loading around 5GB of data into memory.
56+
57+
## Configuration
58+
This importer can be configured in [pelias-config](https://github.com/pelias/config), in the `imports.openaddresses`
59+
hash. A sample configuration file might look like:
60+
61+
```javascript
62+
{
63+
"esclient": {
64+
"hosts": [
65+
{
66+
"env": "development",
67+
"protocol": "http",
68+
"host": "localhost",
69+
"port": 9200
70+
}
71+
]
72+
},
73+
"logger": {
74+
"level": "debug"
75+
},
76+
"imports": {
77+
"whosonfirst": {
78+
"datapath": "/mnt/data/whosonfirst/",
79+
"importPostalcodes": false,
80+
"importVenues": false
81+
},
82+
"openaddresses": {
83+
"datapath": "/mnt/data/openaddresses/",
84+
"files": [ "us/ny/city_of_new_york.csv" ]
85+
}
86+
}
87+
}
88+
```
89+
90+
The following properties are recognized:
91+
92+
This importer is configured using the [`pelias-config`](https://github.com/pelias/config) module.
93+
The following configuration options are supported by this importer.
94+
95+
| key | required | default | description |
96+
| --- | --- | --- | --- |
97+
| `datapath` | yes | | The absolute path of the directory containing OpenAddresses files. Must be specified if no directory is given as a command-line argument. |
98+
| `files` | no | | An array of the names of the files to download/import. If specified, *only* these files will be downloaded and imported, rather than *all* `.csv` files in the given directory. **If the array is empty, all files will be downloaded and imported.** Refer to the [OpenAddresses data listing]( http://results.openaddresses.io/?runs=all#runs) for file names.|
99+
| `deduplicate` | no | `false` | Boolean flag to enable deduplication (deprecated. See [pelias/address-deduplicator](https://github.com/pelias/address-deduplicator) for more info). |

bin/parallel

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
3+
# grab the number of workers count
4+
count=$1
5+
6+
# remove the first argument from the arguments array ($@)
7+
shift
8+
9+
# only do anything if count is a valid integer >= 1
10+
if [[ $count -ge 1 ]]; then
11+
echo "starting $count parallel builds"
12+
13+
# spawn $count parallel builds, passing correct params and all arguments
14+
for i in `seq 0 $(($count-1))`; do
15+
cmd="npm start -- --parallel-count $count --parallel-id $i $@ "
16+
$cmd &
17+
done
18+
19+
# don't let this script finish until all parallel builds have finished
20+
wait
21+
fi

import.js

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
/**
2+
* @file Entry-point script for the OpenAddresses import pipeline.
3+
*/
4+
5+
'use strict';
6+
7+
var peliasConfig = require( 'pelias-config' ).generate(require('./schema'));
8+
9+
var logger = require( 'pelias-logger' ).get( 'openaddresses' );
10+
11+
var parameters = require( './lib/parameters' );
12+
var importPipeline = require( './lib/importPipeline' );
13+
14+
const adminLookupStream = require('pelias-wof-admin-lookup');
15+
var deduplicatorStream = require('./lib/streams/deduplicatorStream');
16+
17+
var addressDeduplicator = require('pelias-address-deduplicator');
18+
19+
20+
// Pretty-print the total time the import took.
21+
function startTiming() {
22+
var startTime = new Date().getTime();
23+
process.on( 'exit', function (){
24+
var totalTimeTaken = (new Date().getTime() - startTime).toString();
25+
var seconds = totalTimeTaken.slice(0, totalTimeTaken.length - 3);
26+
var milliseconds = totalTimeTaken.slice(totalTimeTaken.length - 3);
27+
logger.info( 'Total time taken: %s.%ss', seconds, milliseconds );
28+
});
29+
}
30+
31+
var args = parameters.interpretUserArgs( process.argv.slice( 2 ) );
32+
33+
const adminLayers = ['neighbourhood', 'borough', 'locality', 'localadmin',
34+
'county', 'macrocounty', 'region', 'macroregion', 'dependency', 'country',
35+
'empire', 'continent'];
36+
37+
if( 'exitCode' in args ){
38+
((args.exitCode > 0) ? console.error : console.info)( args.errMessage );
39+
process.exit( args.exitCode );
40+
} else {
41+
startTiming();
42+
43+
if (peliasConfig.imports.openaddresses.hasOwnProperty('adminLookup')) {
44+
logger.info('imports.openaddresses.adminLookup has been deprecated, ' +
45+
'enable adminLookup using imports.adminLookup.enabled = true');
46+
}
47+
48+
var files = parameters.getFileList(peliasConfig, args);
49+
50+
var deduplicator = deduplicatorStream.create(peliasConfig, addressDeduplicator);
51+
52+
importPipeline.create( files, args.dirPath, deduplicator, adminLookupStream.create(adminLayers) );
53+
}

lib/cleanup.js

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
var _ = require('lodash');
2+
3+
function removeLeadingZerosFromStreet(token) {
4+
return token.replace(/^(?:0*)([1-9]\d*(st|nd|rd|th))/,'$1');
5+
}
6+
7+
function capitalizeProperly(streetname){
8+
if (streetname.toUpperCase() === streetname || streetname.toLowerCase() === streetname){
9+
streetname = _.capitalize(streetname);
10+
}
11+
return streetname;
12+
}
13+
14+
function cleanupStreetName(input) {
15+
return input.split(/\s/)
16+
.map(removeLeadingZerosFromStreet)
17+
.filter(function(part){
18+
return part.length > 0;
19+
}).map(capitalizeProperly)
20+
.join(' ');
21+
}
22+
23+
module.exports = {
24+
streetName: cleanupStreetName
25+
};

lib/importPipeline.js

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
var logger = require( 'pelias-logger' ).get( 'openaddresses' );
2+
var recordStream = require('./streams/recordStream');
3+
var model = require( 'pelias-model' );
4+
var peliasDbclient = require( 'pelias-dbclient' );
5+
var isUSorCAHouseNumberZero = require( './streams/isUSorCAHouseNumberZero' );
6+
7+
/**
8+
* Import all OpenAddresses CSV files in a directory into Pelias elasticsearch.
9+
*
10+
* @param {array of string} files An array of the absolute file-paths to import.
11+
* @param {object} opts Options to configure the import. Supports the following
12+
* keys:
13+
*
14+
* deduplicate: Pass address object through `address-deduplicator-stream`
15+
* to perform deduplication. See the documentation:
16+
* https://github.com/pelias/address-deduplicator-stream
17+
*
18+
* adminValues: Add admin values to each address object (since
19+
* OpenAddresses doesn't contain any) using `admin-lookup`. See the
20+
* documentation: https://github.com/pelias/admin-lookup
21+
*/
22+
function createFullImportPipeline( files, dirPath, deduplicatorStream, adminLookupStream, finalStream ){ // jshint ignore:line
23+
logger.info( 'Importing %s files.', files.length );
24+
25+
finalStream = finalStream || peliasDbclient();
26+
27+
recordStream.create(files, dirPath)
28+
.pipe(deduplicatorStream)
29+
.pipe(adminLookupStream)
30+
.pipe(isUSorCAHouseNumberZero.create())
31+
.pipe(model.createDocumentMapperStream())
32+
.pipe(finalStream);
33+
}
34+
35+
module.exports = {
36+
create: createFullImportPipeline
37+
};

lib/isValidCsvRecord.js

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
var _ = require('lodash');
2+
3+
/*
4+
* Return true if a record has all of LON, LAT, NUMBER and STREET defined
5+
*/
6+
function isValidCsvRecord( record ){
7+
return hasAllProperties(record) &&
8+
!houseNumberIsExclusionaryWord(record) &&
9+
!streetContainsExclusionaryWord(record) &&
10+
!latLonAreOnNullIsland(record);
11+
}
12+
13+
/*
14+
* Return false if record.NUMBER is literal word 'NULL', 'UNDEFINED',
15+
* or 'UNAVAILABLE' (case-insensitive)
16+
*/
17+
function houseNumberIsExclusionaryWord(record) {
18+
return ['NULL', 'UNDEFINED', 'UNAVAILABLE'].indexOf(_.toUpper(record.NUMBER)) !== -1;
19+
}
20+
21+
/*
22+
* Return false if record.STREET contains literal word 'NULL', 'UNDEFINED',
23+
* or 'UNAVAILABLE' (case-insensitive)
24+
*/
25+
function streetContainsExclusionaryWord(record) {
26+
return /\b(NULL|UNDEFINED|UNAVAILABLE)\b/i.test(record.STREET);
27+
}
28+
29+
function hasAllProperties(record) {
30+
return [ 'LON', 'LAT', 'NUMBER', 'STREET' ].every(function(prop) {
31+
return record[ prop ] && record[ prop ].length > 0;
32+
});
33+
}
34+
35+
// returns true when LON and LAT are both parseable as 0
36+
// > parseFloat('0');
37+
// 0
38+
// > parseFloat('0.000000');
39+
// 0
40+
// > parseFloat('0.000001');
41+
// 0.000001
42+
function latLonAreOnNullIsland(record) {
43+
return ['LON', 'LAT'].every(prop => parseFloat(record[prop]) === 0);
44+
}
45+
46+
module.exports = isValidCsvRecord;

0 commit comments

Comments
 (0)