I am using the OpenAI completions API to write text like how ChatGPT does it in the user/system message back and forth. But this is for Arabic text, and explanation of text comes back in mixture of English and Arabic sometimes. Is there a standard ideal approach to figuring out how to tell what rtl and ltr mixture you need?
Here I have 4 variants:
- auto-everywhere: On container, and on each nested span.
- none-container-mixed-inside: Nothing on container, rtl or ltr on spans (Arabic vs. Latin chunks)
- rtl-container-mixed-inside: rtl on container, rtl/ltr on spans again
- rtl-container-auto-inside: rtl on container, auto on all spans
auto-everywhere
none-container-mixed-inside
rtl-container-mixed-inside
rtl-container-auto-inside
General code
<div dir="rtl">
<p>
<span dir="auto">بالطبع.</span>
</p>
<ul>
<li>
<span dir="ltr">"</span>
<span dir="rtl">الأشجار</span>
<span dir="ltr">" (al-ashjar) </span>
<span dir="rtl">تعني </span>
<span dir="ltr">"trees" </span>
<span dir="rtl">وهي الفاعل.</span>
</li>
<li>
<span dir="auto">"</span>
<span dir="auto">جميلة</span>
<span dir="auto">" (jameela) </span>
<span dir="auto">تعني </span>
<span dir="auto">"beautiful" </span>
<span dir="auto">وهي صفة.</span>
</li>
<li>
<span dir="auto">"</span>
<span dir="auto">تعطي</span>
<span dir="auto">" (tu'ti) </span>
<span dir="auto">تعني </span>
<span dir="auto">"give" </span>
<span dir="auto">وهي الفعل.</span>
</li>
<li>
<span dir="auto">"</span>
<span dir="auto">الحياة</span>
<span dir="auto">" (al-hayat) </span>
<span dir="auto">تعني </span>
<span dir="auto">"life" </span>
<span dir="auto">وهي المفعول الأول.</span>
</li>
<li>
<span dir="auto">"</span>
<span dir="auto">الظل</span>
<span dir="auto">" (al-zill) </span>
<span dir="auto">تعني </span>
<span dir="auto">"shade" </span>
<span dir="auto">وهي المفعول الثاني.</span>
</li>
</ul>
<p>
<span dir="auto">الجملة تصف جمال الأشجار وقدرتها على إعطاء الحياة والظل.</span>
</p>
<p>
<span dir="auto">هل هذا يوضح ما كنت تتساءل عنه؟</span>
</p>
</div>
Copied Content
For readability's sake, here is the plain text:
بالطبع.
"الأشجار" (al-ashjar) تعني "trees" وهي الفاعل.
"جميلة" (jameela) تعني "beautiful" وهي صفة.
"تعطي" (tu'ti) تعني "give" وهي الفعل.
"الحياة" (al-hayat) تعني "life" وهي المفعول الأول.
"الظل" (al-zill) تعني "shade" وهي المفعول الثاني.
الجملة تصف جمال الأشجار وقدرتها على إعطاء الحياة والظل.
هل هذا يوضح ما كنت تتساءل عنه؟
Question
Seems like a difficult potentially ambiguous/unsolvable problem, but thought I'd ask anyways.
- How can you tell what is the main text there to format the overall direction with?
- What would you recommend be done?
I feel like it could get quite ambiguous. For example, take these strings:
# should be rtl
هل هذا يوضح ما كنت تتساءل عنه
# ltr
it could get quite ambiguous
# ltr
it could get تتساءل quite ambiguous
# rtl
هل هذا يوضح ما hello كنت تتساءل عنه
# rtl
hello هل هذا يوضح ما كنت تتساءل عنه
So it seems you would have to know the meaning of the sentence to figure this out properly.
Is there any automatic way to do this nicely?
If not, do you think it's possible to get ChatGPT to send me each line as JSON perhaps, and tell me? I guess that could work.
Or is there a JavaScript algorithm I could implement to solve this somehow?
I am already parsing each chunk of text (arabic vs. latin) into spans, like this pretty much:
export type Span = {
fontSize: number
script: string
// heading: number
text: string
count: number
heading: number
dir?: 'ltr' | 'rtl'
}
export function parseSpans(text, size) {
const spans = []
const span = []
const list = [...text]
let script
function add(span, script) {
const heading = FONT_SIZE_MULTIPLIERS[script ?? 'code']?.body ?? 1
const fontSize = Math.round(size * heading)
const dir = getScriptDirection(script ?? 'latin')
spans.push({
fontSize,
dir: 'auto',
count: span.length,
script:
!script || script === 'latin'
? 'code'
: script === 'other'
? 'code'
: script,
heading,
text: span.join(''),
})
}
for (const char of list) {
const type = detectSymbol(char)
if (!type || type === script) {
span.push(char)
script = type
} else if (char.match(/[\s\.,:;\{\}\[\]\(\)\-\?\!]/)) {
span.push(char)
} else {
add(span, script)
span.length = 0
span.push(char)
script = type
}
}
if (span.length) {
add(span, script)
}
return spans
}
So maybe I can expand that or something. Looking for advice on either a specific algoritm or a "there's nothing you can do in this situation" type answer.