How to format a vocabulary list into a table using Python for Google Sheets

Question

What are the details of your problem? I am a teacher and I want to use Python to create a worksheet for my students. I have a vocabulary PDF with content like this:

do your best duː jɔː best
33, 81
do your hair/make-up duː jɔː heə/ ˈmeɪkʌp
81
do/work overtime duː/wɜːk ˈəʊvətaɪm
36
do/write an essay duː/raɪt æn ˈeseɪ
33
document ˈdɒkjəmənt
54
documentary ˌdɒkjəˈmentəri
52
dollar ˈdɒlə
19
dolphin ˈdɒlfɪn
8
don’t worry dəʊnt ˈwʌri
65

I want to convert it into a table with three columns: vocab, API, and number that I can paste directly into Google Sheets, like this:

vocab   API number
do your best    duː jɔː best    33, 81
do your hair/make-up    duː jɔː heə/ˈmeɪkʌp 81
do/work overtime    duː/wɜːk ˈəʊvətaɪm  36
... ... ...

I tried using the following Python code to extract text from the PDF and save it to a CSV:

import os
import csv
from pdfminer.high_level import extract_text

base_path = r"C:\Users\PC\OneDrive\Desktop\New folder"

pdf_file = os.path.join(base_path, "vocab.pdf")
csv_file = os.path.join(base_path, "vocab.csv")

text = extract_text(pdf_file)

lines = text.splitlines()

with open(csv_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for line in lines:
        if line.strip():  # bỏ dòng trống
            writer.writerow([line.strip()])

print(f"Done! File CSV đã được tạo ở: {csv_file}")

However, this only produces a blank CSV file.

What I was expecting: I want the CSV to have three separate columns: vocab, API, and number with each entry properly aligned, so I can paste it directly into Google Sheets.

How do you expect to split a line such as do/write an essay duː/raɪt æn ˈeseɪ into the phrase and pronunciation parts? Also, you're not accounting for the fact that the numbers are on separate lines — jackal
– jackal, Commented Sep 23 at 9:54
You need to convert the PDF file into plain text file first. — Hai Vu
– Hai Vu, Commented Sep 23 at 16:33

jackal · Accepted Answer · 2025-09-23 10:47:55Z

You need to account for two things that are peculiar with your data.

The vocabulary / pronunciation parts are on separate lines to the numbers
You need to isolate the vocabulary from the pronunciation

import csv
from pathlib import Path
from pdfminer.high_level import extract_text

# pylint: disable=invalid-name

BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"

with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
    writer = csv.writer(_pdf)
    flag = False
    v, p = "", "" # vocabulary and pronunciation parts
    for line in map(str.strip, extract_text(pdf_in).splitlines()):
        if line:
            if flag:
                # replace multiple contiguous spaces with one space
                line = " ".join(line.split())
                writer.writerow([v, p, line])
            else:
                # assumes an even number of tokens
                m = len(tokens := line.split()) // 2
                v = " ".join(tokens[:m])
                p = " ".join(tokens[m:])
            flag = not flag

Collectives™ on Stack Overflow

How to format a vocabulary list into a table using Python for Google Sheets

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related