3

I am using the OpenAI completions API to write text like how ChatGPT does it in the user/system message back and forth. But this is for Arabic text, and explanation of text comes back in mixture of English and Arabic sometimes. Is there a standard ideal approach to figuring out how to tell what rtl and ltr mixture you need?

Here I have 4 variants:

  • auto-everywhere: On container, and on each nested span.
  • none-container-mixed-inside: Nothing on container, rtl or ltr on spans (Arabic vs. Latin chunks)
  • rtl-container-mixed-inside: rtl on container, rtl/ltr on spans again
  • rtl-container-auto-inside: rtl on container, auto on all spans

auto-everywhere

none-container-mixed-inside

rtl-container-mixed-inside

rtl-container-auto-inside

General code

<div dir="rtl">
  <p>
    <span dir="auto">بالطبع.</span>
  </p>
  <ul>
    <li>
      <span dir="ltr">"</span>
      <span dir="rtl">الأشجار</span>
      <span dir="ltr">" (al-ashjar) </span>
      <span dir="rtl">تعني </span>
      <span dir="ltr">"trees" </span>
      <span dir="rtl">وهي الفاعل.</span>
    </li>
    <li>
      <span dir="auto">"</span>
      <span dir="auto">جميلة</span>
      <span dir="auto">" (jameela) </span>
      <span dir="auto">تعني </span>
      <span dir="auto">"beautiful" </span>
      <span dir="auto">وهي صفة.</span>
    </li>
    <li>
      <span dir="auto">"</span>
      <span dir="auto">تعطي</span>
      <span dir="auto">" (tu'ti) </span>
      <span dir="auto">تعني </span>
      <span dir="auto">"give" </span>
      <span dir="auto">وهي الفعل.</span>
    </li>
    <li>
      <span dir="auto">"</span>
      <span dir="auto">الحياة</span>
      <span dir="auto">" (al-hayat) </span>
      <span dir="auto">تعني </span>
      <span dir="auto">"life" </span>
      <span dir="auto">وهي المفعول الأول.</span>
    </li>
    <li>
      <span dir="auto">"</span>
      <span dir="auto">الظل</span>
      <span dir="auto">" (al-zill) </span>
      <span dir="auto">تعني </span>
      <span dir="auto">"shade" </span>
      <span dir="auto">وهي المفعول الثاني.</span>
    </li>
  </ul>
  <p>
    <span dir="auto">الجملة تصف جمال الأشجار وقدرتها على إعطاء الحياة والظل.</span>
  </p>
  <p>
    <span dir="auto">هل هذا يوضح ما كنت تتساءل عنه؟</span>
  </p>
</div>

Copied Content

For readability's sake, here is the plain text:

بالطبع.

"الأشجار" (al-ashjar) تعني "trees" وهي الفاعل.
"جميلة" (jameela) تعني "beautiful" وهي صفة.
"تعطي" (tu'ti) تعني "give" وهي الفعل.
"الحياة" (al-hayat) تعني "life" وهي المفعول الأول.
"الظل" (al-zill) تعني "shade" وهي المفعول الثاني.
الجملة تصف جمال الأشجار وقدرتها على إعطاء الحياة والظل.

هل هذا يوضح ما كنت تتساءل عنه؟

Question

Seems like a difficult potentially ambiguous/unsolvable problem, but thought I'd ask anyways.

  • How can you tell what is the main text there to format the overall direction with?
  • What would you recommend be done?

I feel like it could get quite ambiguous. For example, take these strings:

# should be rtl
هل هذا يوضح ما كنت تتساءل عنه
# ltr
it could get quite ambiguous
# ltr
it could get تتساءل quite ambiguous
# rtl
هل هذا يوضح ما hello كنت تتساءل عنه
# rtl
hello هل هذا يوضح ما كنت تتساءل عنه

So it seems you would have to know the meaning of the sentence to figure this out properly.

Is there any automatic way to do this nicely?

If not, do you think it's possible to get ChatGPT to send me each line as JSON perhaps, and tell me? I guess that could work.

Or is there a JavaScript algorithm I could implement to solve this somehow?

I am already parsing each chunk of text (arabic vs. latin) into spans, like this pretty much:

export type Span = {
  fontSize: number
  script: string
  // heading: number
  text: string
  count: number
  heading: number
  dir?: 'ltr' | 'rtl'
}

export function parseSpans(text, size) {
  const spans = []
  const span = []
  const list = [...text]
  let script

  function add(span, script) {
    const heading = FONT_SIZE_MULTIPLIERS[script ?? 'code']?.body ?? 1
    const fontSize = Math.round(size * heading)
    const dir = getScriptDirection(script ?? 'latin')

    spans.push({
      fontSize,
      dir: 'auto',
      count: span.length,
      script:
        !script || script === 'latin'
          ? 'code'
          : script === 'other'
            ? 'code'
            : script,
      heading,
      text: span.join(''),
    })
  }

  for (const char of list) {
    const type = detectSymbol(char)

    if (!type || type === script) {
      span.push(char)

      script = type
    } else if (char.match(/[\s\.,:;\{\}\[\]\(\)\-\?\!]/)) {
      span.push(char)
    } else {
      add(span, script)

      span.length = 0

      span.push(char)

      script = type
    }
  }

  if (span.length) {
    add(span, script)
  }

  return spans
}

So maybe I can expand that or something. Looking for advice on either a specific algoritm or a "there's nothing you can do in this situation" type answer.

1
  • Counting the number of arabic vs. latin words seems to pass all your tests. I think words is the best measure, better than characters. Commented May 30 at 16:25

1 Answer 1

1

If i understand the question, you want to automatically render mixed Arabic-English content with the correct direction (ltr or rtl) at the span or block level with high accuracy.

There's no easy way to do this and browsers often struggle with this. But since you already sort the content into spans and classify by script, you can auto-infer directionality for mixed content.

  • You are assigning dir:auto which is safe but not always correct when context and semantics matter.

Update your add() function to not set dir to auto but to a resolved value so each span has a hard direction. This prevents surprises from browser heuristics.

function add(span, script) {
  const heading = FONT_SIZE_MULTIPLIERS[script ?? 'code']?.body ?? 1
  const fontSize = Math.round(size * heading)
  const resolvedScript = !script || script === 'latin'
    ? 'code'
    : script === 'other'
      ? 'code'
      : script

  const dir = getDirectionFromScript(resolvedScript)

  spans.push({
    fontSize,
    dir,
    count: span.length,
    script: resolvedScript,
    heading,
    text: span.join(''),
  })
}

You can then wrap groups of spans with the same direction. If needed, determine the overall direction of a sentence using something like:

function getDominantDirection(spans: Span[]): 'rtl' | 'ltr' {
  const score = spans.reduce((acc, span) => {
    if (span.dir === 'rtl') acc.rtl += span.text.length
    else acc.ltr += span.text.length
    return acc
  }, { rtl: 0, ltr: 0 })

  return score.rtl > score.ltr ? 'rtl' : 'ltr'
}

if you are generating these spans from LLM output, yes, you can ask ChatGpt to do this. It lets you fine-tune span creation by meaning, not just script.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.