Onlin'inOnlin'in
Automation2 min read

Extracting Clean Text from a .docx in n8n (Without any API)

Published on November 15, 2024

"I'm handed Word CVs, I have zero API access, zero root privileges, yet I still need the raw content -- yesterday." Here is a solution using n8n's built-in capabilities and JavaScript snippets.

Why even bother?

  • A .docx file is simply a renamed ZIP archive
  • Human-readable content resides in word/document.xml
  • The n8n instance has restricted permissions preventing external installations or shell commands

The workflow at a glance

Download -> Rename_to_zip -> Decompress -> Pick_document_xml -> Extract XML -> Scrape Text

1 - Grab the file

Use a standard SFTP Download or HTTP GET node; the binary data appears under binary.data.

2 - Masquerade the .docx as a .zip

N8n's Compression node recognizes only ZIP formats. The metadata must be modified first.

Rename_to_zip code:

const bin = items[0].binary.data; bin.fileName = bin.fileName.replace(/\.docx$/i, '.zip'); bin.fileExtension = 'zip'; bin.mimeType = 'application/zip'; return items;

3 - Unzip the archive

Use Compression -> Decompress with Input Binary Field(s) set to data.

4 - Pick document.xml and toss the rest

Pick_document_xml code:

const result = []; for (const file of Object.values(items[0].binary)) { if (file.fileName === 'document.xml') { result.push({ binary: { data: file } }); break; } } return result;

5 - Turn XML into JSON

Use Extract From File with File Format set to XML and Binary Property set to data.

6 - Harvest the paragraphs, clean them up

Scrape_Text code:

const xml = items[0].json.data.toString('utf8'); const paraRegex = /<w:p[^>]*?>([\s\S]*?)<\/w:p>/g; const wTRegex = /<w:t[^>]*?>(.*?)<\/w:t>/g; const paragraphs = []; let pMatch; while ((pMatch = paraRegex.exec(xml))) { const inner = pMatch[1]; const parts = []; let tMatch; while ((tMatch = wTRegex.exec(inner))) { parts.push(tMatch[1]); } const txt = parts.join('').replace(/\s+/g, ' ').trim(); if (txt) paragraphs.push(txt); } return [{ json: { paragraphs, text: paragraphs.join('\n\n') } }];

Typical output

The output is a JSON object with a paragraphs array and a text field joining all paragraphs with double newlines.

Why this works and keeps running

  • Zero external dependencies -- ideal for locked-down environments
  • Copy-paste friendly code for any workflow
  • Clean, compact text suitable for GPT, Elasticsearch, SQL, or other systems

Final thoughts

Extracting Word file text without external libraries is achievable through three micro-scripts and six vanilla nodes, providing reliable results for automated text extraction workflows.

View all articles

Comments (0)

No comments yet. Be the first to share your thoughts!

Leave a comment

Your email won't be published.