Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid storing empty entries to optimise the column size in BigQuery #142

Open
max-ostapenko opened this issue Sep 8, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@max-ostapenko
Copy link
Contributor

On the custom metrics objects just storing the whole schema itself is a big volume of data.
We could prune empty values that don't hold any meaningful value.

For example, the entries that equal to

  • undefined
  • null

and potentially

  • []
  • {}

Example: https://www.diffchecker.com/YcalWAbO/

This can be set up as a test on the custom_metrics repo PRs, to point to required optimisations in custom metrics code.
Or alternatively trimmed automatically on the crawler side.

Estimated saving - 3%.

CREATE TEMP FUNCTION TRIM_EMPTY_ENTRIES(json_str STRING, remove_empty BOOL)
RETURNS STRING
LANGUAGE js AS """
function isEmpty(value) {
    if (value === null || value === undefined) {
        return true;
    }
    
    if (typeof value === 'string' && value.trim() === '') {
        return true;
    }
    
    if (Array.isArray(value) && value.length === 0) {
        return true;
    }
    
    if (typeof value === 'object' && !Array.isArray(value)) {
        return Object.keys(value).length === 0;
    }
    
    return false;
}

function findEmptyEntries(obj, currentPath = '', removeEmpty = false) {
    let emptyPaths = [];

    if (typeof obj === 'object' && obj !== null) {
        for (let key in obj) {
            if (obj.hasOwnProperty(key)) {
                const newPath = currentPath ? `${currentPath}.${key}` : key;

                if (isEmpty(obj[key])) {
                    emptyPaths.push(newPath);
                    if (removeEmpty) {
                        delete obj[key];
                    }
                } else if (typeof obj[key] === 'object') {
                    emptyPaths = emptyPaths.concat(findEmptyEntries(obj[key], newPath, removeEmpty));
                    
                    // Remove empty objects after recursion if needed
                    if (removeEmpty && isEmpty(obj[key])) {
                        delete obj[key];
                    }
                }
            }
        }
    }

    return emptyPaths;
}

try {
  let obj = JSON.parse(json_str);
  findEmptyEntries(obj, '', remove_empty);
  return JSON.stringify(obj);
} catch {
  return json_str;
}
""";

SELECT
  page,
  custom_metrics,
  TRIM_EMPTY_ENTRIES(custom_metrics, true) AS cleaned_custom_metrics
FROM `all.pages` TABLESAMPLE SYSTEM (0.1 PERCENT)
WHERE date = '2024-08-01'
@max-ostapenko max-ostapenko added the enhancement New feature or request label Sep 8, 2024
@max-ostapenko max-ostapenko changed the title Avoid storing meaningless entries to optimise the column size in BigQuery Avoid storing empty entries to optimise the column size in BigQuery Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant