Skip to content

read_html(url) double-encodes UTF-8 on Windows (codepage 65001) #475

@rumswiddel

Description

@rumswiddel
  • Reproducer: Non-ASCII chars get double-UTF-8-encoded when passing a URL to read_html(). Passing the same HTML as a string works correctly.

  • Regression: Works in xml2 1.3.6, broken in 1.5.2. Likely introduced in 1.3.7 (switch to Rtools libxml2) or 1.3.8 (libxml2 update to 2.11.5).

  • Environment: R 4.5.2, Windows, l10n_info()$UTF-8 == TRUE, codepage 65001.

# Minimal reproducible example: xml2::read_html(url) double-encodes UTF-8 on Windows
# Environment: R 4.5.2, Windows, l10n_info()$`UTF-8` == TRUE, codepage 65001
# xml2 version: 1.5.2 (works correctly in 1.3.6)
 
library(xml2)
 
url <- "https://translate.google.com/m?tl=de&sl=en&q=apples"
 
# --- Method 1: read_html(url) - BROKEN ---
page1 <- read_html(url)
node1 <- xml_find_first(page1, "//div[@class='result-container']")
result1 <- xml_text(node1)
 
Encoding(result1)
#> [1] "UTF-8"
charToRaw(result1)
#> [1] c3 83 c2 84 70 66 65 6c
#> Expected: c3 84 70 66 65 6c ("Äpfel")
#> Actual:   c3 83 c2 84 70 66 65 6c (double-encoded UTF-8)
result1
#> [1] "Ã\u0084pfel"
 
# --- Method 2: read_html(string) - WORKS ---
resp <- curl::curl_fetch_memory(url)
html <- rawToChar(resp$content)
Encoding(html) <- "UTF-8"
page2 <- read_html(html)
node2 <- xml_find_first(page2, "//div[@class='result-container']")
result2 <- xml_text(node2)
 
Encoding(result2)
#> [1] "UTF-8"
charToRaw(result2)
#> [1] c3 84 70 66 65 6c
result2
#> [1] "Äpfel"
 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions