From 375550a549db91810f395b56094975c023dbd1fd Mon Sep 17 00:00:00 2001
From: Conglong Li <conglong.li@gmail.com>
Date: Tue, 16 Aug 2022 11:25:14 -0700
Subject: [PATCH] add doc for new bert example

---
 docs/_tutorials/bert-pretraining.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/_tutorials/bert-pretraining.md b/docs/_tutorials/bert-pretraining.md
index e3771b7fdad2..a0943949f9bc 100755
--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
@@ -4,6 +4,10 @@ excerpt: ""
 tags: training pre-training
 ---
 
+**Note:**
+On 08/15/2022 we have added another BERT pre-training/fine-tuning example at [github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/bert_with_pile](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/bert_with_pile), which includes a README.md that describes how to use it. Compared to the example described below, the new example in Megatron-DeepSpeed adds supports of ZeRO and tensor-slicing model parallelism (thus support larger model scale), uses a public and richer [Pile dataset](https://github.com/EleutherAI/the-pile) (user can also use their own data), together with some changes to the model architecture and training hyperparameters as described in [this paper](https://arxiv.org/abs/1909.08053). As a result, the BERT models trained by the new example is able to provide better MNLI results than original BERT, but with a slightly different model architecture and larger computation requirements. If you want to train a larger-scale or better quality BERT-style model, we recommend to follow the new example in Megatron-DeepSpeed. If your goal is to strictly reproduce the original BERT model, we recommend to follow the example under DeepSpeedExamples/bing_bert as described below. On the other hand, the tutorial below helps explaining how to integrate DeepSpeed into a pre-training codebase, regardless of which BERT example you use.
+{: .notice--info}
+
 In this tutorial we will apply DeepSpeed to pre-train the BERT
 (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers),
 which is widely used for many Natural Language Processing (NLP) tasks. The