Download all of your questions and answers as Markdown files from all Stack Exchange (SE) sites using a Python script and the SE API.
To get started, make sure that Python version 3.11.7 or newer is installed. Then, run the following commands (note that these commands were tested in Powershell, and so only one of these commands will need to be translated for use in bash, zsh, etc.):
git clone https://github.com/mhdadk/stack-exchange-backup.git
cd stack-exchange-backup
# create a virtual environment
python -m venv .venv
# activate the virtual environment. This command should be translated when using bash
.venv\Scripts\activate
python -m pip install "requests==2.31.0"
Finally, determine your SE network user ID by doing the following:
- Go to https://stackexchange.com/.
- Log in to your SE account.
- Click on your profile picture on the top-right, as indicated by the red arrow in the screenshot below.
- Go to the address bar in your browser, and the address should be in the form
https://stackexchange.com/users/<user id>/<user name>
. For example, the screenshot below shows myuser id
as9073934
and myuser name
asmhdadk
. Note/save theuser id
from the address bar that is shown for your account, as this will be needed later.
Go to the "Usage" section below for the final step.
Once the steps under the section "Getting started" above are done, you can then download all of your questions and answers from all SE sites by running the following command:
python main.py --user_id <user id>
where <user id>
should be replaced with your own user id
that you obtained by following the instructions under the "Getting started" section above. See the "Format" section below for details on the format of the files that are downloaded.
NOTE: you may notice that, for some SE sites, there are fewer answers downloaded
under the answers
directory than the number of answers shown on the SE site itself
online. The reason for this is that more than one of your answers may be associated with
the same question. In this case, the question and your multiple answers for it are
downloaded once only.
Once the command under the "Usage" section is run, a q_and_a
directory will be created inside the directory from which the command was run. This directory will have the following structure:
<stack exchange site 1>.com
|--- questions
|---|--- <question 1 id>.md
|---|--- <question 2 id>.md
|---|--- ...
|--- answers
|---|--- <question id associated with answer 1 id>.md
|---|--- <question id associated with answer 2 id>.md
|---|--- ...
<stack exchange site 2>.com
|--- questions
|---|--- <question 1 id>.md
|---|--- <question 2 id>.md
|---|--- ...
|--- answers
|---|--- <question id associated with answer 1 id>.md
|---|--- <question id associated with answer 2 id>.md
|---|--- ...
...
where
<stack exchange site n>
is the name for then
th SE site associated with a user.<question n id>
is the question ID associated withn
th question for the parent SE site.<question id associated with answer n id>
is the question ID associated with then
th answer for the parent SE site.
The q_and_a
directory will contain Markdown files with the extension .md
. Each Markdown
file will represent either a question or an answer, depending on whether it is under a
questions
directory or an answers
directory. If the Markdown file represents a
question, then the question creator will be you. Otherwise, if the Markdown file
represents an answer, the question creator will not be you, but the creator of one of
the answers included in the Markdown file will be you. More specifically, each Markdown
file will have the following format (text that is inside angle brackets, such as <this>
,
represents text that will vary for each Markdown file):
Question downloaded from <question link>
Question asked by <user name for question creator> on <question date> at <question time>.
Number of up votes: <number of up votes for question>
Number of down votes: <number of down votes for question>
Score: <overall score associated with the question (number of up votes - number of down votes)>
# <question title>
<question body>
### Comment 1
Comment made by <user name for creator of comment 1 for the question> on <comment 1 date> at <comment 1 time>.
Comment score: <number of up votes for comment 1 for the question>
<comment 1 body>
...
### Comment n
Comment made by <user name for creator of comment n for the question> on <comment n date> at <comment n time>.
Comment score: <number of up votes for comment n for the question>
<comment n for the question body>
## Answer 1
Answer by <user name for creator of answer 1> on <answer 1 date> at <answer 1 time>.
This <is/is not> the accepted answer. <indicates whether this is the accepted answer or not>
Number of up votes: <number of up votes for answer 1>
Number of down votes: <number of down votes for answer 1>
Score: <overall score associated with answer 1 (number of up votes - number of down votes)>
<answer 1 body>
### Comment 1
Comment made by <user name for creator comment 1 for answer 1> on <comment 1 date> at <comment 1 time>.
Comment score: <number of up votes for comment 1 for answer 1>
<comment 1 for answer 1 body>
...
### Comment n
Comment made by <user name for creator comment n for answer 1> on <comment n date> at <comment n time>.
Comment score: <number of up votes for comment n for answer 1>
<comment n for answer 1 body>
...
## Answer m
Answer by <user name for creator of answer m> on <answer m date> at <answer m time>.
This <is/is not> the accepted answer. <indicates whether this is the accepted answer or not>
Number of up votes: <number of up votes for answer m>
Number of down votes: <number of down votes for answer m>
Score: <overall score associated with answer m (number of up votes - number of down votes)>
<answer m body>
### Comment 1
Comment made by <user name for creator comment 1 for answer m> on <comment 1 date> at <comment 1 time>.
Comment score: <number of up votes for comment 1 for answer m>
<comment 1 for answer m body>
...
### Comment n
Comment made by <user name for creator comment n for answer m> on <comment n date> at <comment n time>.
Comment score: <number of up votes for comment n for answer m>
<comment n for answer m body>
See the "Logic" section below for an overview of how the main.py
file works.
This section is intended for anyone interested in how the main.py
file works, and is
optional reading. The following steps are added as comments (as #%% step X
) inside
the main.py
file to indicate which part of the file corresponds to which step below.
The main.py
script proceeds as follows:
-
Given a network user ID, obtain the names of all the SE sites associated with this user ID and the corresponding site ID associated with each site.
-
Create the top-level directory
q_and_a
. -
For each SE site obtained in step 1:
(a) Create the
questions
directory for this SE site.(b) Get all questions associated with this user on this SE site.
(c) For each question associated with this user on this SE site, write the contents of the question, its comments, the answers, and their comments into a Markdown file using the format mentioned in the "Format" section above.
(d) Create the
answers
directory for this SE site.(e) Get all answers associated with this user on this SE site.
(f) For each answer associated with this user on this SE site, get the ID of the question associated with the answer.
(g) For each question ID obtained in step 3(f), get the corresponding question, and then write the contents of the question, its comments, the answers, including yours, and their comments into a Markdown file using the format mentioned in the "Format" section above.
There are alternative ways of downloading all your questions and answers from each SE site:
There exists a Python API for the SE API called stackapi
that is built on top of the requests
package. Although this API provides a nice
interface to the SE API, my goal here was to use as few dependencies as possible to
lower the risk of obscelesence later on.
The SE data explorer provides another way of obtaining
a copy of all your questions and answers across all SE sites, via this query for example. However,
this query only returns a CSV file, from which the relevant content will need to be
parsed and then written to Markdown files. Additionally, I am personally not familiar
with SQL, so I preferred the approach used in the main.py
file.