Skip to content

Feat/135772 135785 bulk export page to local fs in pdf #8646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
7c68da6
add format parameter
arafubeatbox Feb 4, 2024
586a651
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Feb 7, 2024
da8cb4b
batch export to local pdf zip file
arafubeatbox Feb 12, 2024
79df682
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Feb 16, 2024
ca25e9c
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Feb 21, 2024
39a296b
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Mar 31, 2024
d236083
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Apr 11, 2024
8a6e352
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Apr 11, 2024
434231f
handle puppeteer browser errors
arafubeatbox Apr 11, 2024
cbc3030
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Apr 21, 2024
1d09fb1
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Jun 30, 2024
d9caa80
fix wrong merge
arafubeatbox Jun 30, 2024
57b256c
Merge feat/149507-149508-export-files-tomprarily-to-fs-before-upload …
arafubeatbox Jun 30, 2024
153414d
wait for page close before next pdf convert
arafubeatbox Jul 3, 2024
616bef0
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Jul 4, 2024
7935b37
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Jul 4, 2024
f79fe98
resolve unresolved conflict
arafubeatbox Jul 4, 2024
e7a980e
fix gcs multipart upload path
arafubeatbox Jul 4, 2024
c26c25c
use puppeteer cluster
arafubeatbox Jul 10, 2024
03d2354
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Jul 10, 2024
ff8c523
retry pdf conversion on fail in bulk export
arafubeatbox Jul 11, 2024
5aab975
trivial refactor
arafubeatbox Jul 11, 2024
39d4e36
add comments
arafubeatbox Jul 11, 2024
2196065
fix comments
arafubeatbox Jul 11, 2024
bc23f4d
install chromium in production
arafubeatbox Jul 11, 2024
87f208b
remove gc
arafubeatbox Jul 11, 2024
594fe5a
Merge branch 'feat/page-bulk-export' into feat/135772-135785-bulk-exp…
arafubeatbox Jul 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions apps/app/.env.development
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ QUESTIONNAIRE_SERVER_ORIGIN="http://host.docker.internal:3003"
# AUDIT_LOG_ACTION_GROUP_SIZE=SMALL
# AUDIT_LOG_ADDITIONAL_ACTIONS=
# AUDIT_LOG_EXCLUDE_ACTIONS=
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium
4 changes: 4 additions & 0 deletions apps/app/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,8 @@
"passport-ldapauth": "^3.0.1",
"passport-local": "^1.0.0",
"passport-saml": "^3.2.0",
"puppeteer": "^22.0.0",
"puppeteer-cluster": "^0.24.0",
"qs": "^6.11.1",
"rate-limiter-flexible": "^2.3.7",
"react": "^18.2.0",
Expand All @@ -186,10 +188,12 @@
"rehype-sanitize": "^5.0.1",
"rehype-slug": "^5.0.1",
"rehype-toc": "^3.0.2",
"remark": "^13.0.0",
"remark-breaks": "^3.0.2",
"remark-emoji": "^3.0.2",
"remark-frontmatter": "^4.0.1",
"remark-gfm": "^3.0.1",
"remark-html": "^11.0.0",
"remark-math": "^5.1.1",
"remark-toc": "^8.0.1",
"remark-wiki-link": "^1.0.4",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ const PageBulkExportSelectModal = (): JSX.Element => {
</small>
</div>
<div className="d-flex justify-content-center mt-2">
<button className="btn btn-primary" type="button" onClick={() => startBulkExport(PageBulkExportFormat.markdown)}>
<button className="btn btn-primary" type="button" onClick={() => startBulkExport(PageBulkExportFormat.md)}>
{t('page_export.markdown')}
</button>
<button className="btn btn-primary ms-2" type="button" onClick={() => startBulkExport(PageBulkExportFormat.pdf)}>PDF</button>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import type {
} from '@growi/core';

export const PageBulkExportFormat = {
markdown: 'markdown',
md: 'md',
pdf: 'pdf',
} as const;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ module.exports = (crowi: Crowi): Router => {
};

try {
await pageBulkExportService?.createAndStartPageBulkExportJob(path, req.user, activityParameters);
await pageBulkExportService?.createAndStartPageBulkExportJob(path, format, req.user, activityParameters);
return res.apiv3({}, 204);
}
catch (err) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,16 @@ import type { Readable } from 'stream';
import { Writable, pipeline } from 'stream';
import { pipeline as pipelinePromise } from 'stream/promises';


import type { HasObjectId } from '@growi/core';
import { type IPage, isPopulated, SubscriptionStatusType } from '@growi/core';
import { getParentPath, normalizePath } from '@growi/core/dist/utils/path-utils';
import type { Archiver } from 'archiver';
import archiver from 'archiver';
import gc from 'expose-gc/function';
import mongoose from 'mongoose';
import { Cluster } from 'puppeteer-cluster';
import remark from 'remark';
import html from 'remark-html';

import type { SupportedActionType } from '~/interfaces/activity';
import { SupportedAction, SupportedTargetModel } from '~/interfaces/activity';
Expand All @@ -21,6 +23,7 @@ import { Attachment } from '~/server/models';
import type { ActivityDocument } from '~/server/models/activity';
import type { PageModel, PageDocument } from '~/server/models/page';
import Subscription from '~/server/models/subscription';
import { configManager } from '~/server/service/config-manager';
import type { FileUploader } from '~/server/service/file-uploader';
import type { IMultipartUploader } from '~/server/service/file-uploader/multipart-uploader';
import { preNotifyService } from '~/server/service/pre-notify';
Expand Down Expand Up @@ -56,12 +59,16 @@ class PageBulkExportService {
// TODO: If necessary, change to a proper path in https://redmine.weseek.co.jp/issues/149512
tmpOutputRootDir = '/tmp';

puppeteerCluster: Cluster | undefined;

constructor(crowi) {
this.crowi = crowi;
this.activityEvent = crowi.event('activity');
}

async createAndStartPageBulkExportJob(basePagePath: string, currentUser, activityParameters: ActivityParameters): Promise<void> {
async createAndStartPageBulkExportJob(
basePagePath: string, format: PageBulkExportFormat, currentUser, activityParameters: ActivityParameters,
): Promise<void> {
const Page = mongoose.model<IPage, PageModel>('Page');
const basePage = await Page.findByPathAndViewer(basePagePath, currentUser, null, true);

Expand All @@ -72,24 +79,28 @@ class PageBulkExportService {
const pageBulkExportJob: PageBulkExportJobDocument & HasObjectId = await PageBulkExportJob.create({
user: currentUser,
page: basePage,
format: PageBulkExportFormat.markdown,
format,
});

await Subscription.upsertSubscription(currentUser, SupportedTargetModel.MODEL_PAGE_BULK_EXPORT_JOB, pageBulkExportJob, SubscriptionStatusType.SUBSCRIBE);

this.bulkExportWithBasePagePath(basePagePath, currentUser, activityParameters, pageBulkExportJob);
this.bulkExportWithBasePagePath(basePagePath, format, currentUser, activityParameters, pageBulkExportJob);
}

async bulkExportWithBasePagePath(
basePagePath: string, currentUser, activityParameters: ActivityParameters, pageBulkExportJob: PageBulkExportJobDocument & HasObjectId,
private async bulkExportWithBasePagePath(
basePagePath: string,
format: PageBulkExportFormat,
currentUser,
activityParameters: ActivityParameters,
pageBulkExportJob: PageBulkExportJobDocument & HasObjectId,
): Promise<void> {
const timeStamp = (new Date()).getTime();
const exportName = `page-bulk-export-${timeStamp}`;

// export pages to fs temporarily
const tmpOutputDir = `${this.tmpOutputRootDir}/${exportName}`;
try {
await this.exportPagesToFS(basePagePath, tmpOutputDir, currentUser);
await this.exportPagesToFS(basePagePath, tmpOutputDir, currentUser, format);
}
catch (err) {
await this.handleExportError(err, activityParameters, pageBulkExportJob, tmpOutputDir);
Expand Down Expand Up @@ -152,9 +163,9 @@ class PageBulkExportService {
}
}

private async exportPagesToFS(basePagePath: string, outputDir: string, currentUser): Promise<void> {
private async exportPagesToFS(basePagePath: string, outputDir: string, currentUser, format: PageBulkExportFormat): Promise<void> {
const pagesReadable = await this.getPageReadable(basePagePath, currentUser);
const pagesWritable = this.getPageWritable(outputDir);
const pagesWritable = await this.getPageWritable(outputDir, format);

return pipelinePromise(pagesReadable, pagesWritable);
}
Expand All @@ -180,21 +191,22 @@ class PageBulkExportService {
/**
* Get a Writable that writes the page body temporarily to fs
*/
private getPageWritable(outputDir: string): Writable {
private async getPageWritable(outputDir: string, format: PageBulkExportFormat): Promise<Writable> {
return new Writable({
objectMode: true,
write: async(page: PageDocument, encoding, callback) => {
try {
const revision = page.revision;

if (revision != null && isPopulated(revision)) {
const markdownBody = revision.body;
const pathNormalized = `${normalizePath(page.path)}.md`;
const pageBody = format === PageBulkExportFormat.pdf ? (await this.convertMdToPdf(revision.body)) : revision.body;
gc();
const pathNormalized = `${normalizePath(page.path)}.${format}`;
const fileOutputPath = path.join(outputDir, pathNormalized);
const fileOutputParentPath = getParentPath(fileOutputPath);

await fs.promises.mkdir(fileOutputParentPath, { recursive: true });
await fs.promises.writeFile(fileOutputPath, markdownBody);
await fs.promises.writeFile(fileOutputPath, pageBody);
}
}
catch (err) {
Expand Down Expand Up @@ -268,6 +280,73 @@ class PageBulkExportService {
});
}

/**
* Initialize puppeteer cluster for converting markdown to pdf
*/
async initPuppeteerCluster(): Promise<void> {
this.puppeteerCluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: configManager.getConfig('crowi', 'app:bulkExportPuppeteerClusterMaxConcurrency'),
workerCreationDelay: 10000,
monitor: true,
});

await this.puppeteerCluster.task(async({ page, data: htmlString }) => {
await page.setContent(htmlString, { waitUntil: 'domcontentloaded' });
await page.emulateMediaType('screen');
const pdfResult = await page.pdf({
margin: {
top: '100px', right: '50px', bottom: '100px', left: '50px',
},
printBackground: true,
format: 'A4',
});
return pdfResult;
});

// close cluster on app termination
const handleClose = async() => {
logger.info('Closing puppeteer cluster...');
await this.puppeteerCluster?.idle();
await this.puppeteerCluster?.close();
process.exit();
};
process.on('SIGINT', handleClose);
process.on('SIGTERM', handleClose);
}

/**
* Convert markdown string to html, then to PDF
* When PDF conversion is unstable and error occurs, it will retry up to the specified limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

変換リクエストが多いと、puppeteer のエラーが生じる場合があるので、指定された数までは変換のリトライを可能とする。

*/
private async convertMdToPdf(md: string): Promise<Buffer> {
const executeConvert = async(htmlString: string, retries: number) => {
if (this.puppeteerCluster == null) {
throw new Error('Puppeteer cluster is not initialized');
}

try {
return await this.puppeteerCluster.execute(htmlString);
}
catch (err) {
if (retries > 0) {
logger.error('Failed to convert markdown to pdf. Retrying...', err);
return executeConvert(htmlString, retries - 1);
}
throw err;
}
};

const htmlString = (await remark()
.use(html)
.process(md))
.toString();

const result = await executeConvert(htmlString, configManager.getConfig('crowi', 'app:bulkExportPuppeteerRetryLimit'));

return result;
}

private async notifyExportResult(
activityParameters: ActivityParameters, pageBulkExportJob: PageBulkExportJobDocument, action: SupportedActionType,
) {
Expand All @@ -290,6 +369,7 @@ class PageBulkExportService {

// eslint-disable-next-line import/no-mutable-exports
export let pageBulkExportService: PageBulkExportService | undefined; // singleton instance
export default function instanciate(crowi): void {
export default async function instanciate(crowi): Promise<void> {
pageBulkExportService = new PageBulkExportService(crowi);
await pageBulkExportService.initPuppeteerCluster();
}
2 changes: 1 addition & 1 deletion apps/app/src/server/crowi/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -694,7 +694,7 @@ Crowi.prototype.setupExport = async function() {
};

Crowi.prototype.setupPageBulkExportService = async function() {
instanciatePageBulkExportService(this);
await instanciatePageBulkExportService(this);
};

Crowi.prototype.setupImport = async function() {
Expand Down
12 changes: 12 additions & 0 deletions apps/app/src/server/service/config-loader.ts
Original file line number Diff line number Diff line change
Expand Up @@ -735,6 +735,18 @@ const ENV_VAR_NAME_TO_CONFIG_INFO = {
type: ValueType.NUMBER,
default: 172800, // 2 days
},
BULK_EXPORT_PUPPETEER_CLUSTER_MAX_CONCURRENCY: {
ns: 'crowi',
key: 'app:bulkExportPuppeteerClusterMaxConcurrency',
type: ValueType.NUMBER,
default: 10,
},
BULK_EXPORT_PUPPETEER_RETRY_LIMIT: {
ns: 'crowi',
key: 'app:bulkExportPuppeteerRetryLimit',
type: ValueType.NUMBER,
default: 5,
},
};


Expand Down
8 changes: 4 additions & 4 deletions apps/app/src/server/service/export.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ import archiver from 'archiver';
import { toArrayIfNot } from '~/utils/array-utils';
import loggerFactory from '~/utils/logger';

import CollectionProgress from '../models/vo/collection-progress';
import type CollectionProgress from '../models/vo/collection-progress';
import CollectionProgressingStatus from '../models/vo/collection-progressing-status';

import AppService from './app';
import type AppService from './app';
import ConfigLoader from './config-loader';
import GrowiBridgeService from './growi-bridge';
import { ZipFileStat } from './interfaces/export';
import type GrowiBridgeService from './growi-bridge';
import type { ZipFileStat } from './interfaces/export';


const logger = loggerFactory('growi:services:ExportService');
Expand Down
1 change: 0 additions & 1 deletion apps/app/src/server/service/file-uploader/gcs/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ import {
} from '../file-uploader';
import { ContentHeaders } from '../utils';

import type { IGcsMultipartUploader } from './multipart-uploader';
import { GcsMultipartUploader } from './multipart-uploader';

const logger = loggerFactory('growi:service:fileUploaderGcs');
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import type { Bucket, File } from '@google-cloud/storage';
// eslint-disable-next-line no-restricted-imports
import axios from 'axios';
import urljoin from 'url-join';

import loggerFactory from '~/utils/logger';

import { configManager } from '../../config-manager';
import { MultipartUploader, UploadStatus, type IMultipartUploader } from '../multipart-uploader';

const logger = loggerFactory('growi:services:fileUploaderGcs:multipartUploader');
Expand All @@ -26,7 +28,8 @@ export class GcsMultipartUploader extends MultipartUploader implements IGcsMulti
constructor(bucket: Bucket, uploadKey: string, maxPartSize: number) {
super(uploadKey, maxPartSize);

this.file = bucket.file(this.uploadKey);
const namespace = configManager.getConfig('crowi', 'gcs:uploadNamespace');
this.file = bucket.file(urljoin(namespace || '', uploadKey));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

アップロード先の namespace の指定が抜けており、ファイル取得時のパスと齟齬があったため修正。

}

async initUpload(): Promise<void> {
Expand Down
2 changes: 1 addition & 1 deletion apps/app/src/server/service/interfaces/export.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { Stats } from 'fs';
import type { Stats } from 'fs';

export type ZipFileStat = {
meta: object;
Expand Down
Loading
Loading