-4

CONTEXT: My goal is to use http.Get(), then use the golang.org/x/net/html package to parse the resp.Body and extract some bits of data from some <div>'s that all use a similar naming scheme for their id attributes, which I will match with a regex. The webpage is https://mandarintemple.com/learning-materials/radicals/

PROBLEM: I seem to only get a portion of the total html body. When looking in the network tab of dev tools, there are a lot of GET requests that take place, but only the first is of type html, all others are css or js. When I look in the inspector tab of dev tools, I can see the <div>'s I want inside the <body>, but I have used io.ReadAll(resp.Body) and printed it to my console (of my editor) and clearly could see that those <div>'s were not there.

I'm guessing that one or more of the js scripts are creating and adding the <div>'s I want, rather than them being present in the original html doc it responds with (They are popups you get when hovering over a Hanzi). Is there an easy way to verify this? As far as I can tell, the <div>'s are part of the html body, but this is the only way I can explain not getting them in the response body from my http.Get() since I'm not getting any errors.

Whatever the cause is, I am looking for a way to get those popup <div>'s in my response from http.Get(). If someone can help me understand or point me to some resources to checkout, that is greatly appreciated.

Here is the applicable code from func main() just to clarify what I have said above:

    resp, err := http.Get("https://mandarintemple.com/learning-materials/radicals/")
    if err != nil {
        log.Println("failed to get resource via url with error: ", err)
    }
    defer resp.Body.Close()

    readBody, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Println("failed to read body with error: ", err)
    }
    bodyString := string(readBody)
    log.Println(bodyString)
    // I can clearly see that the <div>'s aren't present in this output
    // even though much of the rest of the <body> appears the same as in the inspector of dev tools.

1 Answer 1

2

As you correctly assume, the page uses JavaScript to generate this content. What you get with http.Get() is the static portion of page as served by the webserver, in Developer Tools window that would match the content delivered to you as response to GET https://mandarintemple.com/learning-materials/radicals/ of type text/html.

To get any <div> or content generated dynamically by JavaScript (that happens in users browser and not on server) you would need to fetch and run that JavaScript on top of that static content. Given complexity of such task "headless" browsers (a browsers running without visible window) are often used and Go code interacts with them programmatically and tells them what page to load and which content to get.

Sign up to request clarification or add additional context in comments.

1 Comment

readBody, err := io.ReadAll(resp.Body) if err != nil { log.Println("failed to read body with error: ", err) }

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.