0

What are the details of your problem? I am a teacher and I want to use Python to create a worksheet for my students. I have a vocabulary PDF with content like this:

do your best duː jɔː best
33, 81
do your hair/make-up duː jɔː heə/ ˈmeɪkʌp
81
do/work overtime duː/wɜːk ˈəʊvətaɪm
36
do/write an essay duː/raɪt æn ˈeseɪ
33
document ˈdɒkjəmənt
54
documentary ˌdɒkjəˈmentəri
52
dollar ˈdɒlə
19
dolphin ˈdɒlfɪn
8
don’t worry dəʊnt ˈwʌri
65

I want to convert it into a table with three columns: vocab, API, and number that I can paste directly into Google Sheets, like this:

vocab   API number
do your best    duː jɔː best    33, 81
do your hair/make-up    duː jɔː heə/ˈmeɪkʌp 81
do/work overtime    duː/wɜːk ˈəʊvətaɪm  36
... ... ...

I tried using the following Python code to extract text from the PDF and save it to a CSV:

import os
import csv
from pdfminer.high_level import extract_text

base_path = r"C:\Users\PC\OneDrive\Desktop\New folder"

pdf_file = os.path.join(base_path, "vocab.pdf")
csv_file = os.path.join(base_path, "vocab.csv")

text = extract_text(pdf_file)

lines = text.splitlines()

with open(csv_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    for line in lines:
        if line.strip():  # bỏ dòng trống
            writer.writerow([line.strip()])

print(f"Done! File CSV đã được tạo ở: {csv_file}")

However, this only produces a blank CSV file.

What I was expecting: I want the CSV to have three separate columns: vocab, API, and number with each entry properly aligned, so I can paste it directly into Google Sheets.

2
  • How do you expect to split a line such as do/write an essay duː/raɪt æn ˈeseɪ into the phrase and pronunciation parts? Also, you're not accounting for the fact that the numbers are on separate lines Commented Sep 23 at 9:54
  • You need to convert the PDF file into plain text file first. Commented Sep 23 at 16:33

1 Answer 1

0

You need to account for two things that are peculiar with your data.

  1. The vocabulary / pronunciation parts are on separate lines to the numbers
  2. You need to isolate the vocabulary from the pronunciation
import csv
from pathlib import Path
from pdfminer.high_level import extract_text

# pylint: disable=invalid-name

BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"

with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
    writer = csv.writer(_pdf)
    flag = False
    v, p = "", "" # vocabulary and pronunciation parts
    for line in map(str.strip, extract_text(pdf_in).splitlines()):
        if line:
            if flag:
                # replace multiple contiguous spaces with one space
                line = " ".join(line.split())
                writer.writerow([v, p, line])
            else:
                # assumes an even number of tokens
                m = len(tokens := line.split()) // 2
                v = " ".join(tokens[:m])
                p = " ".join(tokens[m:])
            flag = not flag
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.