Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug - duplicates merged cell text following issue #2106 #3250

Open
veredmm opened this issue Jun 19, 2024 · 2 comments
Open

bug - duplicates merged cell text following issue #2106 #3250

veredmm opened this issue Jun 19, 2024 · 2 comments
Labels
awaiting-response docx Related to Microsoft Word (.docx) file format enhancement New feature or request

Comments

@veredmm
Copy link

veredmm commented Jun 19, 2024

still having this duplicated text problem with this kind of table structure :

merged_table2.docx

table doc:

image

after partition_docx :

image

python-docx 1.1.2
unstructured 0.14.3

@veredmm veredmm added the bug Something isn't working label Jun 19, 2024
@christinestraub christinestraub added the docx Related to Microsoft Word (.docx) file format label Jun 19, 2024
@scanny
Copy link
Collaborator

scanny commented Jun 20, 2024

@veredmm I'm getting "HEADER 5 4 3 2 1 AAA BBB CCC" as elements[0].text for that document, which is the expected behavior and does not repeat the text in that merged cell.

The .metadata.text_as_html for that Table element is this uniform 3 row x 8 col table:

  <table>
    <thead>
      <tr>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>5</td>
        <td>4</td>
        <td>4</td>
        <td>3</td>
        <td>2</td>
        <td>2</td>
        <td>1</td>
        <td>1</td>
      </tr>
      <tr>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
      </tr>
    </tbody>
  </table>

The HTML table in .text_as_html is purposely made "uniform" (same number of cells in each row), which is why the same content appears in each "grid" cell of a merged cell.

If you think that should look differently, please suggest (in HTML) what you think it should look like instead and we'll consider a change.

@scanny scanny added enhancement New feature or request awaiting-response and removed bug Something isn't working labels Jun 20, 2024
@veredmm
Copy link
Author

veredmm commented Jun 24, 2024

thanks @scanny
I would suggest that the content of the merged cell will appear only in the first cell(td) of the table row and the other cells will be empty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response docx Related to Microsoft Word (.docx) file format enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants