Extracting Clean Text from a .docx in n8n (Without any API)

"I'm handed Word CVs, I have zero API access, zero root privileges, yet I still need the raw content -- yesterday." Here is a solution using n8n's built-in capabilities and JavaScript snippets.

Why even bother?

A .docx file is simply a renamed ZIP archive
Human-readable content resides in word/document.xml
The n8n instance has restricted permissions preventing external installations or shell commands

The workflow at a glance

Download -> Rename_to_zip -> Decompress -> Pick_document_xml -> Extract XML -> Scrape Text

1 - Grab the file

Use a standard SFTP Download or HTTP GET node; the binary data appears under binary.data.

2 - Masquerade the .docx as a .zip

N8n's Compression node recognizes only ZIP formats. The metadata must be modified first.

Rename_to_zip code:

const bin = items[0].binary.data; bin.fileName = bin.fileName.replace(/\.docx$/i, '.zip'); bin.fileExtension = 'zip'; bin.mimeType = 'application/zip'; return items;

3 - Unzip the archive

Use Compression -> Decompress with Input Binary Field(s) set to data.

4 - Pick document.xml and toss the rest

Pick_document_xml code:

const result = []; for (const file of Object.values(items[0].binary)) { if (file.fileName === 'document.xml') { result.push({ binary: { data: file } }); break; } } return result;

5 - Turn XML into JSON

Use Extract From File with File Format set to XML and Binary Property set to data.

6 - Harvest the paragraphs, clean them up

Scrape_Text code:

Typical output

The output is a JSON object with a paragraphs array and a text field joining all paragraphs with double newlines.

Why this works and keeps running

Zero external dependencies -- ideal for locked-down environments
Copy-paste friendly code for any workflow
Clean, compact text suitable for GPT, Elasticsearch, SQL, or other systems

Final thoughts

Extracting Word file text without external libraries is achievable through three micro-scripts and six vanilla nodes, providing reliable results for automated text extraction workflows.