Technical Assesement

Introduction

Welcome to the Data technical assesment.

The main goal of the exercise is to asses your approach to problem solving and knowledge of Spark Core.

Brief

Your objective of the exercise is to rank a set of words based on their apperarance in some medical articles.

The articles are loaded in a CSV file and look like:

120,"Occlusion of Middle Esophagus with Extraluminal Device, Open Approach"
121,"Transfer Vagus Nerve to Acoustic Nerve, Open Approach"
122,"Repair Cervicothoracic Vertebral Disc, External Approach"

When the job starts, it loads the data from the file and creates:

val articlesRDD: RDD[MedicalArticle] = MedicalData.readData(sc).persist()

Where the definition of MedicalArticle is as follows:

case class MedicalArticle(id: String, text: String)

For the assignments, we're considering the following list of terms to be ranked:

val terms = List("Auditory", "Wrist", "Endoscopic", "Pulmonary", "Drainage")

1st Assignment

Your task is to write a function that returns the list of ids of articles containing any of the terms:

def articleIdsContaining(terms: Seq[String], rdd: RDD[MedicalArticle]): Seq[String] = ???

2nd Assignment

For this assignments, we provide the following functions. You can choose to use them if you find them useful:

/**
  * @return Whether the word is contained in the text
  */
private def textContains(word: String, text: String): Boolean =
  text.split(" ").contains(word)

/**
  * @return Which of the terms provided are contained in a given article
  */
private def findTerms(terms: List[String], article: MedicalArticle): Seq[String] =
  terms.filter(textContains(_, article.text))

Your task is to write:

A function that creates a reverse index which key is the term and the value is the set of articles containing that term.

def createIndex(terms: List[String], rdd: RDD[MedicalArticle]): RDD[(String, Iterable[MedicalArticle])] = ???

The following tries to illustrate how the reverse index should look like:

"Auditory" -> [article_1, article_5, ...]
...
"Drainage" -> [article_6, article_30, article_180, ...]

A function that ranks the terms using the previous reverse index.

def rankTermsUsingIndex(index: RDD[(String, Iterable[MedicalArticle])]): List[(String, Int)] = ???

The following tries to illustrate how the rank should look like:

List((Endoscopic,298), (Drainage,113), (Pulmonary,14), (Wrist,8), (Auditory,6))

3rd Assignment

Your last task is to provide an optimized implementation for ranking the terms, specially when it comes to large datasets. You'll also profive a justification for your solution.

def rankTermsOptimized(langs: List[String], rdd: RDD[MedicalArticle]): List[(String, Int)] = ???

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Technical Assesement

Introduction

Brief

1st Assignment

2nd Assignment

3rd Assignment

About

Releases

Packages

Languages

addisonglobal/data-technical-test

Folders and files

Latest commit

History

Repository files navigation

Technical Assesement

Introduction

Brief

1st Assignment

2nd Assignment

3rd Assignment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages