diff --git a/Code/01_Data_Acquisition_and_Understanding/1_Download_and_Parse_Medline_Abstracts.ipynb b/Code/01_Data_Acquisition_and_Understanding/1_Download_and_Parse_Medline_Abstracts.ipynb index edd391f..748f31a 100644 --- a/Code/01_Data_Acquisition_and_Understanding/1_Download_and_Parse_Medline_Abstracts.ipynb +++ b/Code/01_Data_Acquisition_and_Understanding/1_Download_and_Parse_Medline_Abstracts.ipynb @@ -5,12 +5,12 @@ "metadata": {}, "source": [ "## Download and Parse MEDLINE Abstracts\n", - "This Notebook describes the way you can download and parse the publically available Medline Abstracts. There are about 812 XML files that are available on the ftp server. Each XML file conatins about 30,000 Document Abstracts.\n", + "This notebook shows how you can download and parse the publicly available Medline abstracts. There are about 812 XML files that are available on the ftp server. Each XML file conatins about 30,000 Document Abstracts.\n", "\n", - "
Note: This Notebook is meant to be run on a Spark Cluster. If you are running it through a jupyter notebbok, make sure to use the PySpark Kernel." + "
Note: This notebook is meant to be run on a Spark Cluster. If you are running it through a jupyter notebbok, make sure to use the PySpark Kernel." ] }, { @@ -18,12 +18,13 @@ "metadata": {}, "source": [ "#### Using the Parser \n", - "Download and install the pubmed_parser library into the spark cluster nodes. You can us the egg file available in the repo or produce the .egg file by running
\n", + "Download and install the pubmed_parser library into the spark cluster nodes. You can use the egg file available in the repo or produce the .egg file by running
\n", "python setup.py bdist_egg
\n", - "in repository and add import for it. The egg file file can be read from the blob storage. Once you have the egg file ready you can put it in the container associated with your spark cluster.\n", + "in repository and add import for it. The egg file can be read from the blob storage. Once you have the egg file ready you can put it in the container associated with your spark cluster.\n", "
\n", + "**AT** I DO NOT SEE THIS .EGG FILE IN THE REPO\n", "\n", - "#### Installing a additional packages on Spark Nodes\n", + "#### Installing additional packages on Spark Nodes\n", "To install additional packages you need to use script action from the azure portal. see this
\n", "Here's an example:\n", "
To install unidecode, you can use script action (on your Spark Cluster)\n", @@ -132,7 +133,7 @@ "metadata": {}, "source": [ " Parse the XMLs and save them as a Tab separated File
\n", - "There are a total of 812 XML files. It would take time for downloading that much data. Its advisable to do it in batches of 50.\n", + "There are a total of 812 XML files. It will take some time to download this much data. It's advisable to do it in batches of 50.\n", "Downloading and parsing 1 file takes approximately 25-30 seconds. " ] }, @@ -192,9 +193,9 @@ "metadata": { "anaconda-cloud": {}, "kernelspec": { - "display_name": "NLP_DL_EntityRecognition local", + "display_name": "Python 3", "language": "python", - "name": "nlp_dl_entityrecognition_local" + "name": "python3" }, "language_info": { "codemirror_mode": { diff --git a/Code/01_Data_Acquisition_and_Understanding/ReadMe.md b/Code/01_Data_Acquisition_and_Understanding/ReadMe.md index a3218f0..9c11ca0 100644 --- a/Code/01_Data_Acquisition_and_Understanding/ReadMe.md +++ b/Code/01_Data_Acquisition_and_Understanding/ReadMe.md @@ -1,13 +1,10 @@ ## **Data Preparation** -This section describes how to download the [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html) abstracts from the Website using **Spark**. We are using HDInsight Spark 2.1 on Linux (HDI 3.6). -The FTP server for Medline has about 812 XML files where each file contains about 30000 abstracts. Below you can see the fields present in the XML files. We are currently using the Abstracts extracted from -the XML files to train the word embedding model +This section describes how to download the [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html) abstracts from the website using **Spark**. We are using HDInsight Spark 2.1 on Linux (HDI 3.6). +The FTP server for Medline has about 812 XML files where each file contains about 30,000 abstracts. Below you can see the fields present in the XML files. We use the Abstracts extracted from the XML files to train the word embedding model. ### [Downloading and Parsing Medline Abstracts](1_Download_and_Parse_Medline_Abstracts.ipynb) -The [Notebook]((1_Download_and_Parse_Medline_Abstracts.ipynb) describes how to download the local drive of the head node of the Spark cluster. Since the data is big (about 30 Gb), it might take a while to download. We parse the XML files as we download them. We are using a publicly available -[Pubmed Parser](https://github.com/titipata/pubmed_parser) to parse the downloaded XMLs and saving them in a tab separated file (TSV). The parsed XMLs are stored in a local folder on the head node (you can change this by specifying a different location in -the Notebook). The parse XML returns the following fields: +The [Notebook]((1_Download_and_Parse_Medline_Abstracts.ipynb) **AT** LINK DOESN'T WORK describes how to download the local drive of the head node of the Spark cluster. Since the data is big (about 30 Gb), it might take a while to download. We parse the XML files as we download them. We use a publicly available [Pubmed Parser](https://github.com/titipata/pubmed_parser) to parse the downloaded XMLs and save them in a tab separated file (TSV). The parsed XMLs are stored in a local folder on the head node (you can change this by specifying a different location in the notebook). The parse XML returns the following fields: abstract affiliation @@ -28,11 +25,13 @@ the Notebook). The parse XML returns the following fields: title **Notes**: -- There are more that 800 XML files that are present on the Medline ftp server. The code in the Notebook downloads them all. But you can change that in the last cell of the notebook (e.g. download only a subset by reducing the counter). -- With using Tab separated files: The Pubmed parser add a new line for every affiliation. This may cause the TSV files to become unstructured. To **avoid** this we explicitly remove the new line from the affiliation field. +- There are more that 800 XML files that are present on the Medline ftp server. The code in the notebook downloads them all. But you can change that in the last cell of the notebook (e.g. download only a subset by reducing the counter). +- With using Tab separated files: The Pubmed parser adds a new line for every affiliation. This may cause the TSV files to become unstructured. To **avoid** this we explicitly remove the new line from the affiliation field. - To install unidecode, you can use [script action](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux) on your Spark Cluster. Add the following lines to your script file (.sh). You can install other dependencies in a similar way + #!/usr/bin/env bash /usr/bin/anaconda/bin/conda install unidecode -- The egg file needed to run the Pubmed Parser is also included in the repository. + +- The egg file needed to run the Pubmed Parser is also included in the repository. **AT** I DO NOT SEE THIS .EGG FILE IN THE REPO diff --git a/Code/02_Modeling/01_FeatureEngineering/2_Train_Word2Vec.ipynb b/Code/02_Modeling/01_FeatureEngineering/2_Train_Word2Vec.ipynb index a6497e2..69382b3 100644 --- a/Code/02_Modeling/01_FeatureEngineering/2_Train_Word2Vec.ipynb +++ b/Code/02_Modeling/01_FeatureEngineering/2_Train_Word2Vec.ipynb @@ -5,17 +5,17 @@ "metadata": {}, "source": [ "## Train, Evaluate and Visualize the Word Embeddings\n", - "In this Notebook we detail the process of how to train a word2vec model on the Medline Abstracts and obtaining the word embeddings for the biomedical terms. This is the first step towards building an Entity Extractor. We are using the Spark's MLLib package to train the word embedding model. We also show how you can test the quality of the embeddings by an intrinsic evaluation task along with visualization.\n", + "In this notebook we detail the process of how to train a word2vec model on the Medline abstracts and obtaining the word embeddings for the biomedical terms. This is the first step towards building an entity extractor. We are using the Spark's MLLib package to train the word embedding model. We also show how you can test the quality of the embeddings by an intrinsic evaluation task along with visualization.\n", "
\n", - "The Word Embeddings obtained from spark are stored in parquet files with gzip compression. In the next notebook we show how you can easily extract the word embeddings form these parquet files and visualize them in any tool of your choice.\n", + "The word embeddings obtained from spark are stored in parquet files with gzip compression. In the next notebook we show how you can easily extract the word embeddings form these parquet files and visualize them in any tool of your choice.\n", "\n", "
This notebook is divided into several sections, details of which are presented below.\n", "
    \n", "
  1. Load the data into the dataframe and preprocess it.
  2. \n", - "
  3. Tokenize the data into words and train Word2Vec Model
  4. \n", - "
  5. Evaluate the Quality of the Word Embeddings by comparing the Spearman Correlation on a Human Annoted Dataset
  6. \n", - "
  7. Use PCA to reduce the dimension of the Embeddings to 2 for Visualization
  8. \n", - "
  9. Use t-SNE incombination with PCA to improve the Quality of the Visualizations
  10. \n", + "
  11. Tokenize the data into words and train Word2Vec model
  12. \n", + "
  13. Evaluate the quality of the word embeddings by comparing the Spearman Correlation on a human annoted dataset
  14. \n", + "
  15. Use PCA to reduce the dimension of the embeddings to 2 for visualization
  16. \n", + "
  17. Use t-SNE incombination with PCA to improve the quality of the visualizations
  18. \n", "
" ] }, @@ -262,7 +262,7 @@ "metadata": {}, "source": [ " Filter the Abstracts based on the dictionary loaded above\n", - "
This will help to filter out abstracts that donot contain words that you care about\n", + "
This will help to filter out abstracts that do not contain words that you care about\n", "
Its an optional preprocessing step" ] }, @@ -525,7 +525,7 @@ "\n", "
For the cell below we are using a cluster size of 4 worker nodes each with 4 cores. " ] @@ -582,7 +582,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " Manually Evaluate Similar words by getting nearest neighbours " + " Manually evaluate similar words by getting nearest neighbours " ] }, { @@ -1207,7 +1207,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " For a better Visualization we use PCA + t-SNE\n", + " For a better visualization we use PCA + t-SNE\n", "
We first use PCA to reduce the dimensions to 45, then pick 15000 word vectors and apply t-SNE on them. We use the same word list for visualization as used in PCA (above) to see the differences between the 2 visualizations" ] }, @@ -1410,9 +1410,9 @@ "metadata": { "anaconda-cloud": {}, "kernelspec": { - "display_name": "NLP_DL_EntityRecognition local", + "display_name": "Python 3", "language": "python", - "name": "nlp_dl_entityrecognition_local" + "name": "python3" }, "language_info": { "codemirror_mode": { diff --git a/Code/02_Modeling/01_FeatureEngineering/ReadMe.md b/Code/02_Modeling/01_FeatureEngineering/ReadMe.md index 22b5508..6b61577 100644 --- a/Code/02_Modeling/01_FeatureEngineering/ReadMe.md +++ b/Code/02_Modeling/01_FeatureEngineering/ReadMe.md @@ -1,61 +1,54 @@ ## [Train Word2Vec Word Embedding Model](2_Train_Word2Vec.ipynb) -This [Notebook](2_Train_Word2Vec.ipynb) covers how you can train, evaluate and visualize word embeddings using **[Word2Vec](https://arxiv.org/pdf/1301.3781.pdf)** implementaion from **[MLLib](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec)** -in **Spark**. The MLLib function for word2vec is based on a continuos skip-gram model that tries to predict the context words given a word. To optimse the performance this implementation uses hierarchical softmax. H-SoftMax -essentially replaces the flat SoftMax layer with a hierarchical layer that has the words as leaves. This allows us to decompose calculating the probability of one word into a sequence of probability calculations, -which saves us from having to calculate the expensive normalization over all words. The algorithm has several hyper-parameters which can be tuned to obtain better performance. These are windowSize, vectorSize etc. (We -define the meaning of each parameter in step 5). results for hyper parameter tuning are present in the end. Let's begin extracting word embeddings for Bio-medical terms. +This [notebook](2_Train_Word2Vec.ipynb) covers how you can train, evaluate and visualize word embeddings using **[Word2Vec](https://arxiv.org/pdf/1301.3781.pdf)** implementaion from **[MLLib](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec)** +in **Spark**. The MLLib function for word2vec is based on a continuous skip-gram model that tries to predict the context words given a word. To optimize the performance this implementation uses hierarchical softmax. H-SoftMax, which essentially replaces the flat SoftMax layer with a hierarchical layer that has the words as leaves. This allows us to decompose calculating the probability of one word into a sequence of probability calculations, which saves us from having to calculate the expensive normalization over all words. The algorithm has several hyper-parameters which can be tuned to obtain better performance. These are windowSize, vectorSize etc. (We define the meaning of each parameter in step 5). Results for hyper parameter tuning are presented in the end. Let's begin extracting word embeddings for Bio-medical terms. -We are going to use the XML files we downloaded in [section 1](../../01_DataPreparation/1_Download_and_Parse_Medline_Abstracts.ipynb) +We are going to use the XML files we downloaded in [section 1](../../01_DataPreparation/1_Download_and_Parse_Medline_Abstracts.ipynb). - Step 1: Import the required libraries and point the path to the directory where you downloaded the XML files. - Step 2: Combine all the data from these XMLs into a single dataframe and store this new dataframe in the Parquet format so that we can read it quickly in the next runs. This may take about 15-30 mins depending on the size of your spark cluster. -- Step 3: (Optional) If you have a dictionary of Words that you care about and want to exclude any abstracts that do not contain words from this dictionary you can uncomment the filtering code. +- Step 3: (Optional) If you have a dictionary of words that you care about and want to exclude any abstracts that do not contain words from this dictionary you can uncomment the filtering code. - Step 4: Do some basic preprocessing on the abstracts and load test sets for evaluation. - Step 5: Train Word2Vec model using the Continuous Skip-gram Model (We use the MLlib Word2Vec function for this). Set the parameters of the model based on your requirements - windowSize (number of words of the left and right eg. window size of 2 means 2 words to the left and 2 to the right of the current word.) - vectorSize (size of the embeddings), - minCount (minium number of occurences of a word to be included in the output) - numPartitions (number of partitions used for training, keep a small number for accuracy) + - windowSize (number of words of the left and right eg. window size of 2 means 2 words to the left and 2 to the right of the current word.) + - vectorSize (size of the embeddings) + - minCount (minium number of occurences of a word to be included in the output) + - numPartitions (number of partitions used for training, keep a small number for accuracy) Note: The time it takes to run the Word2Vec depends on the parameters you specify as well as the size of your cluster. -- Step 6: Intrinsic Evaluation of the Word Vectors. We evaluate the cosine similarity between the vectors of the pair of terms. Each pair is annotated for similarity by humans. The score given is on an arbitary scale. Hence, to compare we +- Step 6: Intrinsic evaluation of the word vectors. We evaluate the cosine similarity between the vectors of the pair of terms. Each pair is annotated for similarity by humans. The score given is on an arbitary scale. Hence, to compare we use the [Spearman Correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) between the scores predicted by the model and those assigned by humans. -- Step 7: Save the Word Embeddings in Parquet format +- Step 7: Save the word embeddings in Parquet format -- Step 8: Visualization with PCA. We split the word embeddings into columns of a dataframe, save the embeddings and apply PCA on it. We then plot some words (We plot a few terms related to Pnemonia and a few terms related to Colon Cancer) - and see the clusters formed (Ideally similar Words should be closer to each other) - We can also save these embeddings in a TSV format and plot it using tensorboard or [Projector for Tensorflow](http://projector.tensorflow.org/). Refer to the first section of the [next notebook](../../02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb) to see how you can extract the word embeddings from parquet files. +- Step 8: Visualization with PCA. We split the word embeddings into columns of a dataframe, save the embeddings and apply PCA on it. We then plot some words (we plot a few terms related to Pnemonia and a few terms related to Colon Cancer) and see the clusters formed (ideally similar words should be closer to each other). We can also save these embeddings in a TSV format and plot it using tensorboard or [Projector for Tensorflow](http://projector.tensorflow.org/). Refer to the first section of the [next notebook](../../02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb) to see how you can extract the word embeddings from parquet files. -- Step 9: Visualization using t-SNE and PCA. t-SNE helps to get better visualizations for the terms, but since it is computationally expensive we first use PCA to scale the vector size down to 45 and then select 15000 words and apply t-SNE on +- Step 9: Visualization using t-SNE and PCA. t-SNE helps to get better visualizations for the terms, but since it is computationally expensive we first use PCA to scale the vector size down to 45 and then select 15,000 words and apply t-SNE on this reduced word set. We notice that this results in better visualization when compared to just PCA. #### Notes: -- Hyper-Parameter Tuning: Showing the effect of hyper-parameters on the performance of intrinsic task. +- Hyper-Parameter Tuning: Showing the effect of hyper-parameters on the performance of an intrinsic task. - From the below table we see that having a higher window size is better for the intrinsic evaluation since it helps to get more context about the word but in reality it may - actually be very noisy to keep a larger context. We also see having a larger mincount is giving better results but it is also decreasing the coverage over the test set, hence its advisable to keep micount as a low number. - The numbers here are reported on a spark cluster having 11 worker nodes with each worker node with 4 cores. The runtime for most of the evaluations is under 30 mins because of the large number of partitions. If speed - is the main concern then number of partitions should be as high as possible (but less than the total number of cores), however if the concer is accuracy the number of partitions should be a lower number (it will take more time). +From the below table we see that having a higher window size is better for the intrinsic evaluation since it helps to get more context about the word but in reality it may + actually be very noisy to keep a larger context. We also see having a larger mincount gives better results but it also decreases the coverage over the test set, hence its advisable to keep micount as a low number. + The numbers here are reported on a spark cluster having 11 worker nodes with each worker node having 4 cores. The runtime for most of the evaluations is under 30 mins because of the large number of partitions. If speed is the main concern then the number of partitions should be as high as possible (but less than the total number of cores), however if the concern is accuracy the number of partitions should be a lower number (it will take more time).             ![Sample Evaluation](../../../Images/Evaluations.png) - Intrinsic Task Vs Extrinsic Task - We noticed that parameters which are good for Intrinsic Evaluation are not the best for Extrinsic Evaluation (Neural Entity Extraction). In particular, keeping a large window size helped intrinsic evaluation but not - extrinsic. + We noticed that parameters which are good for intrinsic evaluation are not the best for extrinsic evaluation (Neural Entity Extraction). In particular, keeping a large window size helped intrinsic evaluation but not extrinsic. - PCA + t-SNE - Visualization through PCA was not very effective, primarily because of loss of information when we scal down the vectors to 2 dimensions. We found t-SNE to be much better in terms of visualization. [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) at its core + Visualization through PCA was not very effective, primarily because of loss of information when we scale down the vectors to 2 dimensions. We found t-SNE to be much better in terms of visualization. [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) at its core works in two steps. First it creates a probability distribution over the pairs in the higher dimensional space in a way that similar objects have a high probability of being picked and dissimilar points have very low probability of getting picked. Next, it defines a similar probability distribution over the points in a low dimensional map and minimizes the KL Divergence between the two distributions with respect to location of points on the map. diff --git a/Code/02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb b/Code/02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb index c9604ba..5732f67 100644 --- a/Code/02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb +++ b/Code/02_Modeling/02_ModelCreation/3_Training_Neural_Entity_Extractor_Pubmed.ipynb @@ -711,13 +711,7 @@ "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00486-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00487-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00488-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", - "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00489-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00489-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00490-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00491-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", "reading C:\\dl4nlp\\Models/word2vec_pubmed_model_vs_50_ws_5_mc_400_parquet_files\\part-00492-858f31b5-52a6-49cf-b541-9fc814fb1662.gz.parquet\n", @@ -1395,7 +1389,9 @@ { "cell_type": "code", "execution_count": 28, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "# %%writefile Data_Preparation2.py\n", @@ -1784,7 +1780,9 @@ { "cell_type": "code", "execution_count": 29, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "# %%writefile Entity_Extractor.py\n", @@ -2293,9 +2291,9 @@ ], "metadata": { "kernelspec": { - "display_name": "NLP_DL_EntityRecognition local", + "display_name": "Python 3", "language": "python", - "name": "nlp_dl_entityrecognition_local" + "name": "python3" }, "language_info": { "codemirror_mode": { diff --git a/Code/02_Modeling/02_ModelCreation/ReadMe.md b/Code/02_Modeling/02_ModelCreation/ReadMe.md index 4d9d525..133b000 100644 --- a/Code/02_Modeling/02_ModelCreation/ReadMe.md +++ b/Code/02_Modeling/02_ModelCreation/ReadMe.md @@ -1,32 +1,32 @@ ## [Training a Neural Entity Detector using Pubmed Word Embeddings](3_Training_Neural_Entity_Extractor_Pubmed.ipynb) -This [Notebook](3_Training_Neural_Entity_Extractor_Pubmed.ipynb) describes how you can use [Keras](https://keras.io/) with [Tensorflow](https://www.tensorflow.org/) backend to train a Deep Neural Network for Entity Recognition. We demonstrate how we can use the Word Embeddings generated previously to initialize the Embedding layer -of the Deep Neural Network. The task at hand is to identity Drugs and Diseases from a given text. We are using an auto labeled dataset which is the combination of Semeval 2013 - Task 9.1 (Drug Recognition) and BioCreative V CDR task corpus. +This [notebook](3_Training_Neural_Entity_Extractor_Pubmed.ipynb) describes how you can use [Keras](https://keras.io/) with [Tensorflow](https://www.tensorflow.org/) backend to train a deep neural network for entity recognition. We demonstrate how we can use the word embeddings generated previously to initialize the embedding layer +of the deep neural network. The task at hand is to identity Drugs and Diseases from a given text. We are using an auto labeled dataset which is the combination of Semeval 2013 - Task 9.1 (Drug Recognition) and BioCreative V CDR task corpus. -This Notebook uses a [Linux Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-linux-dsvm-intro) by Azure which has a single GPU, its [NC6](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/#n-series). -Before, proceeding forward make sure you have the word embeddings model trained. You can refer to this [notebook](../01_FeatureEngineering/2_Train_Word2Vec.ipynb) +This notebook uses a [Linux Data Science VM](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-linux-dsvm-intro) on Azure which has a single GPU. The [NC6](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/#n-series) DSVM is used. +Before, proceeding make sure you have the word embeddings model trained. You can refer to this [notebook](../01_FeatureEngineering/2_Train_Word2Vec.ipynb) to see how to train your own word embedding model for the Bio-Medical domain using Word2Vec on Spark. -Once you have the Embedding model ready you can start working on training the Neural Network for Entity Recognition. +Once you have the Embedding model ready you can start working on training the neural network for entity recognition. -**Step 1**: Copy the dataset and evaluation scripts to correct locations and Read Word Embeddings from Parquet Files: +**Step 1**: Copy the dataset and evaluation scripts to correct locations and read word embeddings from Parquet files: We are using the [fastparquet package](https://pypi.python.org/pypi/fastparquet) as a way to read the embeddings from the parquet files and load them to a Pandas dataframe. You can then store this embedding matrix -in any format and use it either as a TSV to visualize using the [Projector for Tensorflow](http://projector.tensorflow.org/) or use as a lookup table for downstream Deep Learning Tasks. The code provides a way to download the files from the Blob connected to your Spark Cluster. +in any format and use it either as a TSV to visualize using the [Projector for Tensorflow](http://projector.tensorflow.org/) or use as a lookup table for downstream deep learning tasks. The code provides a way to download the files from the blob connected to your Spark Cluster. for more information about blob storage see [this](https://docs.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-blobs) **Step 2**: Prepare the data for training and testing in a format that is suitable for Keras. The read_and_parse_data function is the one that does that. - - It first reads the word embeddings and a word_to_index_map mapping each word in the embeddings to an index. It also creates a list where each item id refers to the the word vector corresponding to that index and hence to its word. - - Next it reads the training and testing data and line by line and appends a sentence to a list. It also creates one-hot vectors for each of the classes of Tags (like B-Disease, I-Disease, B-Drug etc.) + - It first reads the word embeddings and a word_to_index_map which maps each word in the embeddings to an index. It also creates a list where each item ID refers to the the word vector corresponding to that index and hence to its word. + - Next it reads the training and testing data line by line and appends a sentence to a list. It also creates one-hot vectors for each of the classes of tags (like B-Disease, I-Disease, B-Drug etc.) - Once the list of sentences is ready, its time now to replace each word with its index from the above map. If we find a word which is not present in our vocabulary we replace the word by the token "UNK". - To generate the vector for "UNK" we sample a random vector, which has the same dimension as our embeddings, from a Normal Distribution. Since the number of words in each sentence might differ, we pad each sequence - to make sure that they have the same length. We add an additional tag "NONE" for each of the padded term. We also associate a zero vector with the paddings. The final shape of the train and test data should be + To generate the vector for "UNK" we sample a random vector, which has the same dimension as our embeddings, from a normal distribution. Since the number of words in each sentence might differ, we pad each sequence + to make sure that they have the same length. We add an additional tag "NONE" for each of the padded terms. We also associate a zero vector with the paddings. The final shape of the train and test data should be (number of samples, max_sequence_length). This is the shape that can be fed to the [Embedding Layer](https://keras.io/layers/embeddings/) in Keras. Once we have this shape for our dataset we are ready for training - our Neural Network (but first lets create one). + our neural network (but first lets create one). - **Step 3**: This step decribes how to create a Deep Neural Network in Keras. We first create a [sequential](https://keras.io/getting-started/sequential-model-guide/) model for our Neural Network. - We start by adding an [Embedding Layer](https://keras.io/layers/embeddings/) to our model and specify the input shape as created above. We load our pre-trained Embeddings for the weights of this layer and set the *trainable* flag as False since we do not want - to update the Embeddings (but this can change). Next we add a [Bi-Directional LSTM layer](https://keras.io/layers/wrappers/#bidirectional). We add a [dropout layer](https://keras.io/layers/core/#dropout). + **Step 3**: This step decribes how to create a deep neural network in Keras. We first create a [sequential](https://keras.io/getting-started/sequential-model-guide/) model for our neural network. + We start by adding an [embedding layer](https://keras.io/layers/embeddings/) to our model and specify the input shape as created above. We load our pre-trained embeddings for the weights of this layer and set the *trainable* flag as False since we do not want + to update the embeddings (but this can change). Next we add a [Bi-Directional LSTM layer](https://keras.io/layers/wrappers/#bidirectional). We add a [dropout layer](https://keras.io/layers/core/#dropout). We repeat the previous step once again. Finaly, we add a [TimeDistributed Dense Layer](https://keras.io/layers/wrappers/#timedistributed). This layer is responsible for generating predictions for each word in the sentence. Our model looks like this @@ -41,19 +41,17 @@ for more information about blob storage see [this](https://docs.microsoft.com/en We optimize the [categorical_crossentropy](https://keras.io/losses/#categorical_crossentropy) loss and are using the [Adam](https://keras.io/optimizers/#adam) optimizer. -**Step 4**: Now we have the data ready and the Neural Network Built, so lets put them together and start the training. This step shows how to call the previously -defined functions. We specify the paths of the training and the test files along with some parameters like +**Step 4**: Now we have the data ready and have created the neural network. We can now put them together and start the training. This step shows how to call the previously defined functions. We specify the paths of the training and the test files along with some parameters like - vector size: this is the length of a word vector (50 for us). - classes: this is the number of unique classes in the training and test sets (eg of a class would be B-Chemical, I-Disease, O etc.) - seq_length: this is the max_sequence_length found above - layers: number of LSTM layers you want to have - epochs: number of epochs you would like to do for Neural Network Training +- vector size: this is the length of a word vector (50 for us). +- classes: this is the number of unique classes in the training and test sets (eg of a class would be B-Chemical, I-Disease, O etc.) +- seq_length: this is the max_sequence_length found above +- layers: number of LSTM layers you want to have +- epochs: number of epochs you would like to do for neural network training -Once these are set, the model we start to train. Next step would be to obtain model predictions on the test set and evaluate the performance of the model. +Once these are set, we start to train the model. Next step would be to obtain model predictions on the test set and evaluate the performance of the model. -**Step 5**: We store the predictions obtained from the previous step into a text file. The first step here will to combine that output and obtain a file in the -following format +**Step 5**: We store the predictions obtained from the previous step into a text file. The first step here will to combine that output and obtain a file in the following format Word1 Tag1 Word2 Tag2 diff --git a/Code/02_Modeling/03_ModelEvaluation/ReadMe.md b/Code/02_Modeling/03_ModelEvaluation/ReadMe.md index 2700922..03701be 100644 --- a/Code/02_Modeling/03_ModelEvaluation/ReadMe.md +++ b/Code/02_Modeling/03_ModelEvaluation/ReadMe.md @@ -2,15 +2,14 @@ - **Compare the performance of the model against an entity extractor trained on Google News Vectors** -Google News Vectors are trained on Google News Data. For each word they have a 300-Dimensional vector. These are available [online](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). We wanted to compare the embeddings trained on a specific domain(Bio-Medical) -against a model trained with embeddings from a general domain like News, to evaluate if a domain specific model achieves higher performance. Our results show that a domain specific model indeed achives -higher performance. The [Notebook](4_b_Test_Model_trained_on_Google_News_Embeddings.ipynb) shows the process of replicating that result. +Google News Vectors are trained on Google News Data. Each word has a 300-Dimensional vector. These are available [online](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). We wanted to compare the embeddings trained on a specific domain (Bio-Medical) +against a model trained with embeddings from a general domain like news, to evaluate if a domain-specific model achieves higher performance. Our results show that a domain-specific model indeed achieves +higher performance. The [notebook](4_b_Test_Model_trained_on_Google_News_Embeddings.ipynb) shows the process of replicating that result. -- **Comparison between Uni-Directional LSTM trained with Pubmed Embeddings and Uni-Directional LSTM trained with Google Embeddings usige Keras with CNTK backend** +- **Comparison between Uni-Directional LSTM trained with Pubmed Embeddings and Uni-Directional LSTM trained with Google Embeddings using Keras with CNTK backend** -The comparison between the performance of using a model with Pubmed Embedding against Google News embeddings with CNTK helped us to benchmark the performance of CNTK. The [Notebook 1](4_c_UniDirectional_LSTM_using_Pubmed_Embedding_with_CNTK_Backend.ipynb) -and [Notebook 2](4_d_UniDirectional_LSTM_using_Google_Embedding_with_CNTK_Backend.ipynb) demonstrates the procedure followed to implement a Uni-Directional LSTM Layer in Keras. The performance is reported in the notebooks and we see that Pubmed Embeddings out perform -the general embeddings. We are using the Bio-Creative 2 [Gene Mention Identification task](http://www.biocreative.org/tasks/biocreative-i/first-task-gm/) dataset here. +The comparison between the performance of using a model with Pubmed embedding against Google News embeddings with CNTK helped us to benchmark the performance of CNTK. The [notebook 1](4_c_UniDirectional_LSTM_using_Pubmed_Embedding_with_CNTK_Backend.ipynb) +and [notebook 2](4_d_UniDirectional_LSTM_using_Google_Embedding_with_CNTK_Backend.ipynb) demonstrates the procedure followed to implement a Uni-Directional LSTM Layer in Keras. The performance is reported in the notebooks and we see that Pubmed embeddings out-perform the general embeddings. We are using the Bio-Creative 2 [Gene Mention Identification task](http://www.biocreative.org/tasks/biocreative-i/first-task-gm/) dataset here. Note: We are using Uni-directional LSTM layers since Keras with CNTK backend did not support "reverse" at the time this work was done. Install CNTK 2.0 for Keras from [here](https://docs.microsoft.com/en-us/cognitive-toolkit/using-cntk-with-keras) @@ -18,16 +17,13 @@ Install CNTK 2.0 for Keras from [here](https://docs.microsoft.com/en-us/cognitiv - **Comparison between Uni-Directional LSTM using CNTK and Tensorflow backends** -Here we are comparing the perfomance of training a model with Pubmed Embeddings using Keras with CNTK and Tensorflow as the backends. The [Notebook 1](4_e_Pubmed_BC5_UniDirectional_LSTM_with_CNTK_Backend.ipynb) -and [Notebook 2](4_f_Pubmed_BC5_UniDirectional_LSTM_with_Tensorflow_Backend.ipynb) show the implementations and the results. We find that CNTK is faster than Tensorflow in terms of training time and both achieve -similar over all F1 Scores 61.66 on 5343 correctly identified entities for CNTK and 60.5 on 4973 correctly identified entities for Tensorflow. We are using the Bio Creative 5 +Here we are comparing the perfomance of training a model with Pubmed embeddings using Keras with CNTK and Tensorflow as the backends. The [Notebook 1](4_e_Pubmed_BC5_UniDirectional_LSTM_with_CNTK_Backend.ipynb) +and [notebook 2](4_f_Pubmed_BC5_UniDirectional_LSTM_with_Tensorflow_Backend.ipynb) show the implementations and the results. We find that CNTK is faster than Tensorflow in terms of training time and both achieve similar over all F1 Scores 61.66 on 5343 correctly identified entities for CNTK and 60.5 on 4973 correctly identified entities for Tensorflow. We are using the Bio Creative 5 [Disease and Chemical Identification task]( http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/) dataset here. Note: **Loading a pre-trained model for predictions** -The [Notebook](4_a_Test_the_Trained_Neural_Entity_Extractor_Model.ipynb) shows how we can load a pre-trained neural entity extractor model (like the one trained in previous section). This will be useful if you want to reuse the model -for scoring at a later stage. The above notebook uses the call to load_model method by Keras. [These](https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model) are some useful functionalities that Keras provides -off the shelf for saving and loading your deep learning models. +The [notebook](4_a_Test_the_Trained_Neural_Entity_Extractor_Model.ipynb) shows how we can load a pre-trained neural entity extractor model (like the one trained in previous section). This will be useful if you want to reuse the model for scoring at a later stage. The above notebook uses the call to load_model method by Keras. [These](https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model) are some useful functionalities that Keras provides off the shelf for saving and loading your deep learning models. diff --git a/Code/02_Modeling/ReadMe.md b/Code/02_Modeling/ReadMe.md index e1ef12a..f63165d 100644 --- a/Code/02_Modeling/ReadMe.md +++ b/Code/02_Modeling/ReadMe.md @@ -1,17 +1,16 @@ ## **Code Structure for Model Creation and Evaluation** ### [01_FeatureEngineering](01_FeatureEngineering/ReadMe.md) -The Notebook in the feature engineering section details how the Word2Vec Model can be used to extract Word Embeddings for Bio-Medical Data. We are using the Medline Abstracts downloaded and parsed in the Data Preparation -section. These word embeddings then act as features for our Neural Entity Extraction Model. The Notebook also shows how to evaluate the quality of the embeddings using intrinsic evaluation and goes on to -demonstrate methods for visualization of the embeddings using PCA and t-SNE. The computing resource used here is Spark. +The notebook in the feature engineering section details how the Word2Vec model can be used to extract word embeddings for Bio-Medical data. We are using the Medline abstracts downloaded and parsed in the Data Preparation +section. These word embeddings then act as features for our Neural Entity Extraction model. The notebook also shows how to evaluate the quality of the embeddings using intrinsic evaluation and goes on to demonstrate methods for visualization of the embeddings using PCA and t-SNE. The computing resource used here is Spark. ### [02_ModelCreation](02_ModelCreation/ReadMe.md) -The Notebook in this section covers the details of how we can use the word embeddings obtained in the previous section to initialize the Embedding Layer of our Neural Network. It provides a mechanism of how you can +The notebook in this section covers the details of how we can use the word embeddings obtained in the previous section to initialize the embedding layer of our neural network. It provides a mechanism of how you can convert the input data (which is in BIO format) to an input shape which Keras understands. Once you convert the data in this format the notebook details the steps of how you can create a deep neural network by -adding different Layers. Towards the end, the notebook describes how to evaluate the performance of the model on the test set using an Evaluation script. +adding different layers. Towards the end, the notebook describes how to evaluate the performance of the model on the test set using an evaluation script. ### [03_ModelEvaluation](03_ModelEvaluation/ReadMe.md) -This section is about comparing the model to different baseline modls and evaluate how the domain specific word embeddings perform when compared with Generic embeddings. We are using the Embeddings trained on -Google News to perform the comparisons. We find that training domain specific entities gives a significant improvement over the Generic Word Embeddings. In this section we also compare the performance of the +This section is about comparing the model to different baseline models and evaluating how the domain-specific word embeddings perform when compared with generic embeddings. We are using the embeddings trained on +Google News to perform the comparisons. We find that training domain-specific entities gives a significant improvement over the generic word embeddings. In this section we also compare the performance of the CNTK backend and the Tensorflow Backend. We find that CNTK is faster than Tensorflow but since the CNTK backend [does not](https://docs.microsoft.com/en-us/cognitive-toolkit/Using-CNTK-with-Keras#known-issues) have a complete implementation of BiDirectional LSTM Layer the performance is not as good. diff --git a/Code/03_Deployment/ReadMe.md b/Code/03_Deployment/ReadMe.md index ffb380d..967cc9a 100644 --- a/Code/03_Deployment/ReadMe.md +++ b/Code/03_Deployment/ReadMe.md @@ -1,10 +1,9 @@ ## **Operationalize the Neural Entity Extractor Model** -In order ot operationalize a deep learning model if need to take care about installing a lot of its dependencies. For example the machine should have Python, Keras tensorflow etc. which is extremely time consuming -and error prone at the same time. In order to ensure that we do not run into any configurations related issues we can take the advantage of [Docker containers](https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/09/23/getting-started-with-docker-and-container-services/). -Docker Containers essentially *Wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment. * +To operationalize a deep learning model, care must be taken to install all of its dependencies. For example the machine should have Python, Keras tensorflow etc., which is extremely time consuming and error prone at the same time. In order to ensure that we do not run into any configuration related issues we can take the advantage of [Docker containers](https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/09/23/getting-started-with-docker-and-container-services/). +Docker Containers essentially wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment. In order to make the scoring performant and real time we are using [Azure Container Service](https://docs.microsoft.com/en-us/azure/container-service/dcos-swarm/). -This set of Notebooks details the steps that will be required to Operationalize the previously trained models. This part is mostly using the tutorial for [Deploying ML Models](https://gallery.cortanaintelligence.com/Tutorial/Deploy-CNTK-model-to-ACS). +This set of notebooks details the steps that will be required to operationalize the previously trained models. This part is mostly using the tutorial for [Deploying ML Models](https://gallery.cortanaintelligence.com/Tutorial/Deploy-CNTK-model-to-ACS). We will be using Docker Images and Azure Container Service to deploy a scoring web service that can be consumed directly. We also show the steps of deploying a basic website using [Flask Web app](https://docs.microsoft.com/en-us/azure/app-service-web/web-sites-python-create-deploy-flask-app) @@ -12,11 +11,10 @@ and create a Docker conatiner for it that can be deployed on Azure using [Web Ap ### Section 1: [Building a Docker Image for the Web Service](5_Build_Image.ipynb) -This [Notebook](5_Build_Image.ipynb) shows how you can make a web service in Python using Flask and then make a Docker Container for it. This web service will have the trained model in its backend and will provide the model predictions for the input given. This web app will be +This [notebook](5_Build_Image.ipynb) shows how you can make a web service in Python using Flask and then make a Docker Container for it. This web service will have the trained model in its backend and will provide the model predictions for the input given. This web app will be accessible to the outside world but will be running on your local machine. In the next sections we will illustrate how to use Azure Container Services to host the same web service without making any code changes (that's the power of Docker and Azure :) ). -Step 1: Set up the paths to where you have the trained model and other content necessary for making predictions. It is recommended to follow the previous Notebooks to obtain all the necessary files which will be -required to set up the scoring. If you have already followed them you would have the model saved in a .model file and the content such as embeddings and data required for preparing the test set in the correct format in a +Step 1: Set up the paths to where you have the trained model and other content necessary for making predictions. It is recommended to follow the previous notebooks to obtain all the necessary files which will be required to set up the scoring. If you have already followed them you would have the model saved in a .model file and the content such as embeddings and data required for preparing the test set in the correct format in a .p Pickle file. Step 2: Create a file that will help you transform the input received via the web service in a format suitable for Keras. This is the place where we format the input data in the format that our pre trained model @@ -29,11 +27,11 @@ Step 4: Configure routing for the Web Service app.run(host='0.0.0.0') # This is the code which makes the web app accessible to the outside world. -Note: By Default the Flask Web App runs on port 5000. Make Sure to have this port open on your machine to allow incoming traffic. If you are using an Azure VM, create a rule in the +Note: By Default the Flask Web App runs on port 5000. Make sure to have this port open on your machine to allow incoming traffic. If you are using an Azure VM, create a rule in the [Network Security Group](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal) Step 5: The following steps will help you setup a Custom Docker Image for the web app. This way you do not have to worry about the environment you are deploying to. -The Image should conatin all the necessary items required for your web app. +The Image should contain all the necessary items required for your web app. - Specify the requirements in the requirements file. All these will be pip installed when you create the Docker Image. - Create a proxy between the ports 88 and 5000. @@ -42,17 +40,17 @@ The Image should conatin all the necessary items required for your web app. ### Section 2: [Testing the Deployed Web Service](6_Test_Image.ipynb) -This [Notbook](6_Test_Image.ipynb) demonstrates how you can test the docker image deployed locally as well as on ACS. +This [notebook](6_Test_Image.ipynb) demonstrates how you can test the docker image deployed locally as well as on ACS. It has 2 sections in it. The first section shows how to test the web service running locally on your machine. The second sections shows how to test the service once deployed on ACS (you can come back to this after finishing section 3). -The Notebook runs the Docker image created in section 1. It uses some sample text data to send to the service and outputs its response. We provide the command to run the generated docker container (via the command line). Once the docker container is up and running you will be able to send the request to the web service. If for some reason the -service does not work you can debug it by seeing the output on the command line. Once you are satisfied that the conatiner deployed locally is exactly what you want then you can proceed to the next notebook which shows +The notebook runs the Docker image created in section 1. It uses some sample text data to send to the service and outputs its response. We provide the command to run the generated docker container (via the command line). Once the docker container is up and running you will be able to send the request to the web service. If for some reason the +service does not work you can debug it by seeing the output on the command line. Once you are satisfied that the container deployed locally is exactly what you want then you can proceed to the next notebook which shows how to deploy the same docker conatiner on ACS. The second section in this notebook will help you test the response from the web service deployed on ACS. You should get the same output as you got for section 1, if the service is deployed correctly. ### Section 3: [Taking this Web Service to Azure using Azure Container Service](7_Deploy_on_ACS.ipynb) -This [Notebook](7_Deploy_on_ACS.ipynb) walks through how you can deploy Azure Container Service from your notebook. +This [notebook](7_Deploy_on_ACS.ipynb) walks through how you can deploy Azure Container Service from your notebook. You must have Azure CLI (for [Ubuntu](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli#apt-get-for-debianubuntu)) installed on your machine for this. **Caution**: Make sure that your model loads in less than 30 seconds. If not you may see Error 502 while accessing the service. In order to speed up the model loading, make sure you pickle the content as a dictionary @@ -67,11 +65,11 @@ Finally, we are ready to deploy the container we created in Section 1 onto the a Step 4: You can now complete the second part of Section 2 i.e. to test your web service on ACS. -Appendix: The appendix shows how you can Tear down this created cluster if you donot need it anymore. +Appendix: The appendix shows how you can tear down this created cluster if you donot need it anymore. ### Section 4: [Deploy a Website using Web App for Linux on Azure](8_Build_Website.ipynb) -This [Notebook](8_Build_Website.ipynb) shows how you can create a flask web app, and make a docker image for it and deploy it on Azure. +This [notebook](8_Build_Website.ipynb) shows how you can create a flask web app, and make a docker image for it and deploy it on Azure. Step 1: Create the UI for the web app. You can use the files in this step to modify the UI of your application. We are creating the UI files on the fly and you can edit the files in the template folder as per your needs and have a different UI. Flask also supports accessing variables in the html documents. You can see some tutorials on its use [here](http://flask.pocoo.org/docs/0.12/tutorial/) diff --git a/Code/ReadMe.md b/Code/ReadMe.md index 664ded6..2bb8e96 100644 --- a/Code/ReadMe.md +++ b/Code/ReadMe.md @@ -1,20 +1,16 @@ # **Code Structure** -The files in this folder contain the source code for training a word embedding model using Word2Vec on Spark and then to use these embeddings for Neural Entity Recognition. The aim of the project is to be able to -train a domain specific (Bio Medical Domain) word embedding model and evaluate it against Generic word Embedding Model. The evaluations we perform justify that having a Domain Specific Word Embedding model has superior -performance than the Genric one. We are using Medline abstracts to train our Word Embeddings and then using several Entity Recognition Tasks to evaluate its performance. +The files in this folder contain the source code for training a word embedding model using Word2Vec on Spark and then to use these embeddings for neural entity recognition. The aim of the project is to be able to +train a domain-specific (Bio Medical) word embedding model and evaluate it against a generic word embedding model. The results we obtain show thata domain-specific word embedding model has superior +performance than the genric one. We use Medline abstracts to train our word embeddings and then use several entity eecognition tasks to evaluate its performance. -Training and Evaluating a Deep Neural Network is only the first part of the problem. Operationalizing the trained model in a real-world data processing pipeline, be it real-time or batch scoring is another -challenge. The Operationalization section walks you through the code that will be required to expose the deep learning model as a web service for external customers of yours. This would also provide a way how -deep learning can be used for real time scoring. To make life of data scientists easy we show an end to end walk through of how to deploy this web service using Docker conatiners on Azure Container Service as well -as consume this web service using a website hosted through a docker image on Web App for Linux on Azure. +Training and evaluating a deep neural network is only the first part of the problem. Operationalizing the trained model in a real-world data processing pipeline, be it real-time or batch scoring, is another challenge. The Operationalization section walks you through the code that will be required to expose the deep learning model as a web service for your external customers. This would also provide a way for deep learning to be used for real-time scoring. To make the life of data scientists easy, we show an end-to-end walkthrough of how to deploy this web service using Docker conatiners on Azure Container Service. We also show how to consume this web service from a website hosted through a docker image on Web App for Linux on Azure. ### [Data Preparation](01_DataPreparation/ReadMe.md) -The goal of this section is to help setup the data for training the word embedding model. +The goal of this section is to help setup the data for training the word embedding model. ### [Model creation and evaluation](02_Modeling/ReadMe.md) -The goal of this section is to train the word embedding model on Medline Abstracts. Then to use these embeddings as features and train a deep neural network for extracting entities like -Drugs, Diseases from medical data. We perform several evaluations of the word embedding model as well as the neural entity extractor. +The goal of this section is to train the word embedding model on Medline Abstracts. WE then show how to use these embeddings as features and train a deep neural network for extracting entities like +"Drugs" and "Diseases" from medical data. We perform several evaluations of the word embedding model as well as the neural entity extractor. ### [Operationalization](03_Deployment/ReadMe.md) -The goal of this section is to operationalize the model created in the previous section and publish a scoring web service. We also demonstrate a basic UI that can used to see how to -consume the web service deployed on ACS. +The goal of this section is to operationalize the model created in the previous section and publish a scoring web service. We also demonstrate a basic UI that can used to see how to consume the web service deployed on ACS. diff --git a/README.md b/README.md index d5a1cb6..1401926 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,6 @@ # Biomedical Entity Recognition from MEDLINE Abstracts using Deep Learning - -## Link of the Gallery GitHub repository -Following is the link to the public GitHub repository: +Link to the public GitHub repository: [https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction](https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction) @@ -11,59 +9,55 @@ Following is the link to the public GitHub repository: We provide summary documentation here about the sample. More extensive documentation can be found on the GitHub site in the file: [https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction/blob/master/ProjectReport.md](https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction/blob/master/ProjectReport.md). -[quick-start-installation.md] +[quick-start-installation.md] **AT** MISSING LINK ## Prerequisites ### Azure subscription and Hardware 1. An Azure [subscription](https://azure.microsoft.com/en-us/free/) -2. [HDInsight Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql) version Spark 2.1 on Linux (HDI 3.6). To process the full amount of MEDLINE abstracts discussed below, you need the minimum configuration of: -* Head node: [D13_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/) size -* Worker nodes: At least 4 of [D12_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/). In our work, we used 11 worker nodes of D12_V2 size. +2. [HDInsight Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql) version Spark 2.1 on Linux (HDI 3.6). To process the full amount of MEDLINE abstracts discussed below, you need the minimum configuration of: + - Head node: [D13_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/) size + - Worker nodes: At least 4 of [D12_V2](https://azure.microsoft.com/en-us/pricing/details/hdinsight/). In our work, we used 11 worker nodes of D12_V2 size. + 3. [NC6 Data Science Virtual Machine (DSVM)](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-linux-dsvm-intro) on Azure. ### Software -1. Azure Machine Learning Workbench. See [installation guide](quick-start-installation.md). +1. Azure Machine Learning Workbench. See [installation guide](quick-start-installation.md). LINK DIDN'T WORK FOR ME 2. [TensorFlow](https://www.tensorflow.org/install/) 3. [CNTK 2.0](https://docs.microsoft.com/en-us/cognitive-toolkit/using-cntk-with-keras) 4. [Keras](https://keras.io/#installation) -## INTRODUCTION: Business Case Understanding & Project Summary +## Introduction: Business Case Understanding & Project Summary ### Use Case Overview -Medical Named Entity Recognition (NER) is a critical step for complex biomedical NLP tasks such as: -* Extraction of diseases, symptoms from electronic medical or health records. -* Understanding the interactions between different entity types like Drugs, Diseases for purpose of pharmacovigilence. +Medical Named Entity Recognition (NER) is a critical step for complex biomedical Natural Language Processing (NLP) tasks such as: +* Extraction of diseases and symptoms from electronic medical or health records. +* Understanding the interactions between different entity types like drugs and diseases for the purpose of pharmacovigilence. + +Our study focuses on how a large amount of unstructured biomedical data available from MEDLINE abstracts can be utilized for training a Neural Entity Extractor for biomedical NER. -Our study focuses on how a large amount of unstructured biomedical data available form MEDLINE abstracts can be utilized for training a Neural Entity Extractor for biomedical NER. +**AT** PERHAPS HERE YOU COULD GIVE AN EXAMPLE OF A SENTENCE FROM THE BIOMEDICAL DOMAIN AND "TAG" IT FOR NAMED-ENTITIES The project highlights several features of Azure Machine Learning Workbench, such as: - 1. Instantiation of [Team Data Science Process (TDSP) structure and templates](how-to-use-tdsp-in-azure-ml.md ) + 1. Instantiation of the [Team Data Science Process (TDSP) structure and templates](how-to-use-tdsp-in-azure-ml.md ) 2. Execution of code in Jupyter notebooks as well as Python files - 3. Run history tracking for Python files - 4. Execution of jobs on remote Spark computes context using HDInsight Spark 2.1 clusters + 3. Run history tracking for Python files + 4. Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters 5. Execution of jobs in remote GPU VMs on Azure - 6. Easy operationalization of Deep Learning models as web-services on Azure Container Services + 6. Easy operationalization of Deep Learning models as web services on Azure Container Services ### Purpose of the Project -Objectives of the sample are: -1. Show how to systematically train a Word Embeddings Model using nearly 15 million MEDLINE abstracts using [Word2Vec on Spark](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec) and then use them to build an LSTM-based deep neural network for Entity Extraction on a GPU enabled VM on Azure. +The objectives of the sample are: +1. Show how to systematically train a word embeddings model using nearly 15 million MEDLINE abstracts using [Word2Vec on Spark](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec). These word embeddings are then used to build an LSTM-based deep neural network for Entity Extraction on a GPU enabled VM on Azure. 2. Demonstrate that domain-specific data can enhance accuracy of NER compared to generic data, such as Google News, which are often used for such tasks. -3. Demonstrate an end-to-end work-flow of how to train and operationalize deep learning models on large amounts of text data using Azure Machine Learning Workbench and multiple compute contexts (Spark, GPU VMs). -4. Demonstrate the following capabilities within Azure Machine Learning Workbench: - Instantiation of [Team Data Science Process (TDSP) structure and templates](how-to-use-tdsp-in-azure-ml.md) - - Execution of code in Jupyter notebooks as well as Python files - Run history tracking for Python files - Execution of jobs on remote Spark compute context using HDInsight Spark 2.1 clusters - Execution of jobs in remote GPU VMs on Azure - Easy operationalization of Deep Learning models as web-services on Azure Container Services +3. Demonstrate an end-to-end workflow of how to train and operationalize deep learning models on large amounts of text data using Azure Machine Learning Workbench and multiple compute contexts (Spark, GPU VMs). +4. Demonstrate the capabilities of Azure Machine Learning Workbench **AT** REMOVED REPETITION ### Summary: Results and Deployment Our results show that training a domain-specific word embedding model boosts the accuracy of biomedical NER when compared to embeddings trained on generic data such as Google News. The in-domain word embedding model can detect 7012 entities correctly (out of 9475) with a F1 score of 0.73 compared to 5274 entities with F1 score of 0.61 for generic word embedding model. -We also demonstrate how we can publish the trained Neural Network as a service for real time scoring using Docker and Azure Container Service. Finally, we develop a basic website using Flask to consume the created web service and host it on Azure using Web App for Linux. Currently the model operational on the website (http://medicalentitydetector.azurewebsites.net/) supports seven entity types namely, Diseases, Drug, or Chemicals, Proteins, DNA, RNA, Cell Line, Cell Type. +We also demonstrate how we can publish the trained neural network as a web service for real time scoring using Docker and Azure Container Service. Finally, we develop a basic website using Flask to consume the created web service and host it on Azure using Web App for Linux. Currently the model operational on the website (http://medicalentitydetector.azurewebsites.net/) supports seven entity types, namely, "Diseases", "Drug or Chemicals", "Proteins", "DNA", "RNA", "Cell Line" and "Cell Type". ## Architecture @@ -72,7 +66,7 @@ The figure shows the architecture that was used to process data and train models ![Architecture](./Images/architecture.png) ## Data -We first obtained the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is available publically and is in form of XML files available on their [FTP server](ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/). There are 812 XML files available on the server and each of the XML files contain around 30,000,000 abstracts. More detail about data acquisition and understanding is provided in the Project Structure. The fields present in each file are +We first obtained the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.nih.gov/pubs/factsheets/medline.html). The data is available publically and is in the form of XML files available on their [FTP server](ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz/). There are 812 XML files available on the server and each of the XML files contain around 30 million abstracts. More detail about data acquisition and understanding is provided in the Project Structure. The fields present in each file are abstract affiliation @@ -92,10 +86,10 @@ We first obtained the raw MEDLINE abstract data from [MEDLINE](https://www.nlm.n pubdate: Publication date title -This amount to a total of 24 million abstracts but nearly 10 million documents do not have a field for abstracts. Since the amount of data processed is large and cannot be loaded into memory at a single machine, we rely on HDInsight Spark for processing. Once the data is available in Spark as a data frame, we can apply other pre-processing techniques on it like training the Word Embedding Model. Refer to [./Code/01_DataPreparation/ReadMe.md](./Code/01_DataPreparation/ReadMe.md) to get started. +This amount to a total of 24 million abstracts but nearly 10 million documents do not have a field for abstracts. Since the amount of data processed is large and cannot be loaded into the memory of a single machine, we rely on HDInsight Spark for processing. Once the data is available in Spark as a data frame, we can apply other pre-processing techniques on it like training the Word Embedding Model. Refer to [./Code/01_DataPreparation/ReadMe.md](./Code/01_DataPreparation/ReadMe.md) to get started. -Data after parsing XMLs +Data after parsing XMLs: ![Data Sample](./Images/datasample.png) @@ -106,66 +100,38 @@ Other datasets, which are being used for training and evaluation of the Neural E ## Project Structure and Reporting According TDSP LifeCycle Stages -For the project, we use the TDSP folder structure and documentation templates (Figure 1), which follows the [TDSP lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md). Project is created based on instructions provided [here](https://github.com/amlsamples/tdsp/blob/master/Docs/how-to-use-tdsp-in-azure-ml.md). +For the project, we use the TDSP folder structure and documentation templates (Figure 1), which follows the [TDSP lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md). The project is created based on instructions provided [here](https://github.com/amlsamples/tdsp/blob/master/Docs/how-to-use-tdsp-in-azure-ml.md). ![Fill in project information](./Images/instantiation-step3.jpg) -The step-by-step data science workflow was as follows: - -### 1. Data Acquisition and Understanding -The MEDLINE abstract fields present in each file are +The step-by-step data science workflow was as follows: - abstract - affiliation - authors - country - delete: boolean if False means paper got updated so you might have two XMLs for the same paper. - file_name - issn_linking - journal - keywords - medline_ta: this is abbreviation of the journal nam - mesh_terms: list of MeSH terms - nlm_unique_id - other_id: Other IDs - pmc: Pubmed Central ID - pmid: Pubmed ID - pubdate: Publication date - title - -This amount to a total of 24 million abstracts but nearly 10 million documents do not have a field for abstracts. Since the amount of data to be processed is huge and cannot be loaded into memory at a single instance, we rely on Sparks Distributed Computing capabilities for processing. Once the data is available in Spark as a data frame, we can apply other pre-processing techniques on it like training the Word Embedding Model. Refer to [./Code/01_DataPreparation/ReadMe.md](./Code/01_DataPreparation/ReadMe.md) to get started. - - -Data after parsing XMLs - -![Data Sample](./Images/datasample.png) - -Other datasets, which are being used for training and evaluation of the Neural Entity Extractor have been include in the corresponding folder. To obtain more information about them, you could refer to the following corpora: - * [Bio-Entity Recognition Task at Bio NLP/NLPBA 2004](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) - * [BioCreative V CDR task corpus](http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/) - * [Semeval 2013 - Task 9.1 (Drug Recognition)](https://www.cs.york.ac.uk/semeval-2013/task9/) +**AT** REMOVED REPEATED SECTION ON DATA - -### 2. Modeling (Including Word2Vec word featurization/embedding) -Modeling is the stage where we show how you can use the data downloaded in the previous section for training your own word embedding model and use it for other downstream tasks. Although we are using the Medline data, however the pipeline to generate the embeddings is generic and can be reused to train word embeddings for any other domain. For embeddings to be an accurate representation of the data, it is essential that the word2vec is trained on a large amount of data. -Once we have the word embeddings ready, we can make a deep neural network that uses the learned embeddings to initialize the Embedding layer. We mark the embedding layer as non-trainable but that is not mandatory. The training of the word embedding model is unsupervised and hence we are able to take advantage of unstructured texts. However, to train an entity recognition model we need labeled data. The more the better. +### Modeling (Including Word2Vec word featurization/embedding) REMOVED NUMBERING +Modeling is the stage where we show how you can use the data downloaded in the previous section for training your own word embedding model and use it for other downstream tasks. Although we are using the Medline data, the pipeline to generate the embeddings is generic and can be reused to train word embeddings for any other domain. For embeddings to be an accurate representation of the data, it is essential that the word2vec is trained on a large amount of data. +Once we have the word embeddings ready, we can train a deep neural network that uses the learned embeddings to initialize the embedding layer. We mark the embedding layer as non-trainable but that is not mandatory. The training of the word embedding model is unsupervised and hence we are able to take advantage of unstructured texts. However, to train the supervised entity recognition model we need labeled data. The more the better. #### [Featurizing/Embedding Words with Word2Vec](./Code/02_Modeling/01_FeatureEngineering/ReadMe.md) -Word2Vec is the name given to a class of neural network models that, given an unlabeled training corpus, produce a vector for each word in the corpus that encodes its semantic information. These models are simple neural networks with one hidden layer. The word vectors/embeddings are learned by backpropagation and stochastic gradient descent. There are two types of word2vec models, namely, the Skip-Gram and the continuous-bag-of-words. Since we are using the MLlib's implementation of the word2vec, which supports the Skip-gram model, we briefly describe the model here. [For details](https://arxiv.org/pdf/1301.3781.pdf). +Word2Vec is the name given to a class of neural network models that, given an unlabeled training corpus, produce a vector for each word in the corpus that encodes its semantic information. These models are simple neural networks with one hidden layer. The word vectors/embeddings are learned by backpropagation and stochastic gradient descent. There are two types of word2vec models, namely, the Skip-Gram and the continuous-bag-of-words models. Since we are using the MLlib's implementation of word2vec supports Skip-gram, we briefly describe that model here. For more details see [this paper](https://arxiv.org/pdf/1301.3781.pdf). ![Skip Gram Model](./Images/skip-gram.png) -The model uses Hierarchical Softmax and Negative sampling to optimize the performance. Hierarchical SoftMax (H-SoftMax) is an approximation inspired by binary trees. H-SoftMax essentially replaces the flat SoftMax layer with a hierarchical layer that has the words as leaves. This allows us to decompose calculating the probability of one word into a sequence of probability calculations, which saves us from -having to calculate the expensive normalization over all words. Since a balanced binary tree has a depth of log2(|V|)log2(|V|) (V is the Vocabulary), we only need to evaluate at most log2(|V|)log2(|V|) nodes to obtain the final probability of a word. The probability of a word w given its context c is then simply the product of the probabilities of taking right and left turns respectively that lead to its leaf node. We can build a Huffman Tree based on the frequency of the words in the dataset to ensure that more frequent words get shorter representations. Refer [link](http://sebastianruder.com/word-embeddings-softmax/) for further information. -Image taken from [here](https://ahmedhanibrahim.wordpress.com/2017/04/25/thesis-tutorials-i-understanding-word2vec-for-word-embedding-i/) +The Skip-Gram model uses Hierarchical Softmax and Negative sampling to optimize the performance. Hierarchical SoftMax (H-SoftMax) is an approximation inspired by binary trees. H-SoftMax essentially replaces the flat SoftMax layer with a hierarchical layer that has the words as leaves. This allows us to decompose calculating the probability of one word into a sequence of probability calculations, which saves us from +having to calculate the expensive normalization over all words. Since a balanced binary tree has a depth of log2(|V|)log2(|V|) (V is the Vocabulary), we only need to evaluate at most log2(|V|)log2(|V|) nodes to obtain the final probability of a word. The probability of a word w given its context c is then simply the product of the probabilities of taking right and left turns respectively that lead to its leaf node. We can build a Huffman Tree based on the frequency of the words in the dataset to ensure that more frequent words get shorter representations. Refer to this [blog](http://sebastianruder.com/word-embeddings-softmax/) for further information. +The image above was taken from [here](https://ahmedhanibrahim.wordpress.com/2017/04/25/thesis-tutorials-i-understanding-word2vec-for-word-embedding-i/) + +**AT** PERHAPS THE ABOVE PARAGRAPH IS TOO MUCH DETAIL? TO ME, IT WOULD BE MORE INTERESTING TO READ HOW SKIM-GRAM TRAINS THE WORD EMBEDDINGS. PERHAPS JUST PROVIDE THE LINKS TO ARTICLES ON HOW H-SOFTMAX OPTIMIZES TRAINING. Once we have the embeddings, we would like to visualize them and see the relationship between semantically similar words. ![W2V similarity](./Images/w2v-sim.png) -We have shown two different ways of visualizing the embeddings. The first, uses a PCA to project the high dimensional vector to a 2-D vector space. This leads to a significant loss of information and the visualization is not as accurate. The second is to use PCA with t-SNE. t-SNE is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points. It works in two parts. First, it creates a probability distribution over the pairs in the higher dimensional space in a way that similar objects have a high probability of being picked and dissimilar points have low probability of getting picked. Second, it defines a similar probability distribution over the points in a low dimensional map and minimizes the KL Divergence between the two distributions with respect to location of points on the map. The location of the points in the low dimension is obtained by minimizing the KL Divergence using Gradient Descent. But t-SNE might not be always reliable. Refer [link](https://distill.pub/2016/misread-tsne/) Refer [./Code/02_Modeling/01_FeatureEngineering/ReadMe.md](./Code/02_Modeling/01_FeatureEngineering/ReadMe.md) for details about the implementation. +There are two different ways of visualizing the embeddings. The first uses Principle Component Analysis (PCA) to project the high dimensional vector to a 2-D vector space. This leads to a significant loss of information and the visualization is not as accurate. The second is to use PCA with t-SNE. t-SNE is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Refer to this [link](https://distill.pub/2016/misread-tsne/) for more details on t-SNE. Refer [./Code/02_Modeling/01_FeatureEngineering/ReadMe.md](./Code/02_Modeling/01_FeatureEngineering/ReadMe.md) for details about the implementation. + +**AT** I SUGGEST SIMPLIFYING THIS PARAGRAPH AS ABOVE BECAUSE I THINK IT'S TOO MUCH DETAIL. IF THE READER IS INTERESTED IN EXACTLY HOW T-SNE WORKS THEY CAN FOLLOW THE LINK PROVIDED As you see below, t-SNE visualization provides more separation and potential clustering patterns. @@ -183,28 +149,30 @@ As you see below, t-SNE visualization provides more separation and potential clu ![Points closest to Cancer](./Images/nearesttocancer.png) #### [Training the Neural Entity Extractor](Code/02_Modeling/02_ModelCreation/ReadMe.md) -Traditional Neural Network Models suffer from a problem that they treat each input and output as independent of the other inputs and outputs. This may not be a good idea for tasks such as Machine translation, Entity Extraction, or any other sequence to sequence labeling tasks. Recurrent Neural Network models overcome this problem as they can pass information computed until now to the next node. This property is called having memory in the network since it is able to use the previously computed information. The below picture represents this. +Traditional neural network models suffer from a problem in that they treat each input and output as independent of the other inputs and outputs. This may not be a good idea for tasks such as machine translation, entity extraction, or any other sequence-to-sequence labeling tasks. Recurrent neural network models (RNNs) overcome this problem as they can pass information computed in previous nodes to the next node. This property means the network has memory since it is able to use the previously computed information. The below picture represents this. ![RNN](./Images/rnn-expanded.png) -Vanilla RNNs actually suffer from the [Vanishing Gradient Problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) due to which they are not able to utilize all the information they have seen before. The problem becomes evident only when a large amount of context is required to make a prediction. But models like LSTM do not suffer from such a problem, in fact they are designed to remember long-term dependencies. Unlike vanilla RNNs that have a single neural network, the LSTMs have the interactions between four neural networks for each cell. Refer to the [excellent post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for a detailed explanation of how LSTM work. +Vanilla RNNs suffer from the [Vanishing Gradient Problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) in which they are not able to utilize all the information they have seen before. The problem becomes evident only when a large amount of context is required to make a prediction. But models like LSTM do not suffer from such a problem, in fact they are designed to remember long-term dependencies. Unlike vanilla RNNs that have a single neural network, the LSTMs have the interactions between four neural networks for each cell. Refer to this [excellent post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for a detailed explanation of how LSTM work. + +![LSTM Cell](./Images/lstm-cell.png) **AT** IMAGE NOT SHOWING -![LSTM Cell](./Images/lstm-cell.png) +In this sample, we show how to put together our own LSTM-based recurrent neural network and extract entities like Drugs, Diseases etc. from medical data. The first step is to obtain a large amount of labeled data. Most of the medical data contains lot of sensitive information about the person and hence are not publicly available. We rely on a combination of two different datasets that are publicly available. The first dataset is from Semeval 2013 - Task 9.1 (Drug Recognition) and the other is from BioCreative V CDR task. We combine and auto label these two datasets so that we can detect both drugs and diseases from medical texts and evaluate our word embeddings. See [here](./Code/02_Modeling/02_ModelCreation/ReadMe.md) for implementation details. -Let’s try to put together our own LSTM-based Recurrent Neural Network and try to extract Entities like Drugs, Diseases etc. from Medical Data. The first step is to obtain a large amount of labeled data and as you would have guessed, that's not easy! Most of the medical data contains lot of sensitive information about the person and hence are not publicly available. We rely on a combination of two different datasets that are publicly available. The first dataset is from Semeval 2013 - Task 9.1 (Drug Recognition) and the other is from BioCreative V CDR task. We are combining and auto labeling these two datasets so that we can detect both drugs and diseases from medical texts and evaluate our word embeddings. +**AT** I MADE THE ABOVE PARAGRAPH LESS CONVERSATIONAL. -Refer [link](./Code/02_Modeling/02_ModelCreation/ReadMe.md) for implementation details +**AT** PERHAPS YOU COULD EXPLAIN A LITTLE ABOUT HOW THE AUTO-LABELING WORKS? -The model architecture that we have used across all the codes and for comparison is presented below. The parameter that changes for different datasets is the maximum sequence length (613 here). +The model architecture that we have used in all experiments is presented below. The maximum sequence length parameter (613 here) is changes for different datasets. ![LSTM model](./Images/d-a-d-model.png) #### Model evaluation -We use the evaluation script from the shared task [Bio-Entity Recognition Task at Bio NLP/NLPBA 2004](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) to evaluate the precision, recall, and F1 score of the model. Below is the comparison of the results we get with the embeddings trained on Medline Abstracts and that on Google News embeddings. We clearly see that the in-domain model is out performing the generic model. Hence having a specific word embedding model rather than using a generic one is much more helpful. +We use the evaluation script from the shared task [Bio-Entity Recognition Task at Bio NLP/NLPBA 2004](http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) to evaluate the precision, recall, and F1 score of the model. Below is the comparison of the results we get with the embeddings trained on Medline Abstracts and with embeddings trained on Google News. We clearly see that the in-domain model out-performs the generic model. Hence having a specific word embedding model rather than using a generic one is much more helpful. ![Model Comparison 1](./Images/mc1.png) -We perform the evaluation of the word embeddings on other datasets in the similar fashion and see that in-domain model is always better. +We perform the evaluation of the word embeddings on other datasets in a similar fashion and see that the in-domain model is always better. ![Model Comparison 2](./Images/mc2.png) @@ -212,23 +180,23 @@ We perform the evaluation of the word embeddings on other datasets in the simila ![Model Comparison 4](./Images/mc4.png) -All the training and evaluations reported here are done using Keras and TensorFlow. Keras also supports CNTK backend but since it does not have all the functionalities for the bidirectional model yet we have used unidirectional model with CNTK backend to benchmark the results of CNTK model with that of TensorFlow. These are the results we get +All the training and evaluations reported here are done using Keras and TensorFlow. Keras also supports CNTK backend but it does not yet have all the functionalities for the bidirectional model. We therefore used a unidirectional model with CNTK backend to benchmark the results of CNTK model with that of TensorFlow. We obtain the results below: ![Model Comparison 5](./Images/mc5.png) -We also compare the performance of Tensorflow vs CNTK and see that CNTK performs as good as Tensorflow both in terms of time taken per epoch (60 secs for CNTK and 75 secs for Tensorflow) and the number of entities detected. We are using the Unidirectional layers for evaluation. +We also compare the performance of Tensorflow vs CNTK and see that CNTK performs as well as Tensorflow, both in terms of time taken per epoch (60 secs for CNTK and 75 secs for Tensorflow) and the number of entities detected. We are using the unidirectional layers for evaluation. ![Model Comparison 6](./Images/mc6.png) -### 3. [Deployment](./Code/03_Deployment/ReadMe.md) -We deployed a web-service on a cluster in the [Azure Container Service (ACS)](https://azure.microsoft.com/en-us/services/container-service/). The operationalization environment provisions Docker and Kubernetes in the cluster to manage the web-service deployment. You can find further information on the operationalization process [here](model-management-service-deploy.md ). +### [Deployment](./Code/03_Deployment/ReadMe.md) **AT** REMOVED NUMBERINGs +We show how to deploy a web service on a cluster in the [Azure Container Service (ACS)](https://azure.microsoft.com/en-us/services/container-service/). The operationalization environment provisions Docker and Kubernetes in the cluster to manage the web service deployment. You can find further information on the operationalization process [here](model-management-service-deploy.md ). ## Conclusion & Next Steps -In this report, we went over the details of how you could train a Word Embedding Model using Word2Vec on Spark and then use the Embeddings obtained for training a Neural Network for Entity Extraction. We have shown the pipeline for Bio-Medical domain but the pipeline is generic. You just need enough data and you can easily adapt the workflow presented here for a different domain. +In this sample, we demonstrated how to train a word embedding model using Word2Vec on Spark and then use the embeddings obtained for training a neural network for entity extraction. We have shown the pipeline for the Bio-Medical domain but the pipeline is generic. With enough data, you can easily adapt the workflow presented here to a different domain. ## Contact -Feel free to contact Mohamed Abdel-Hady (mohamed.abdel-hady@microsof.com) Debraj GuhaThakurta (debraj.guhathakurta@microsoft.com) with any question or comments. +Feel free to contact Mohamed Abdel-Hady (mohamed.abdel-hady@microsoft.com) Debraj GuhaThakurta (debraj.guhathakurta@microsoft.com) with any question or comments.