wsl2_and_unicode

2024-07-16

/posts/wsl2_and_unicode/ map[email:1522009317@qq.com name:fmh]

Table of Contents

# 关于 locale，unicode，json.dump() 以及被冤枉的 WSL2

WSL2的 locale 一直是默认设置值 – “C.UTF-8”。然后今天用 python 解析路径文件中文名时，最终保存下来的竟是 unicode ("\u5c0f\u6728\u4ed4-\u7761\u7740\u4e862")，这倒是破天荒第一次碰上 :’) 为了省事，必须去找 GPT 阁下。

更新：下文是完全绕了远路了的。造成上述情况的实际原因应该是：json.dump() 默认参数中的 ensure_ascii=True，这就决定了写入文件的字符不会超出ASCII字符集。

所以，正确的修改方式不是大费周章的改写系统 locale 而是将参数改为 ensure_ascii=False。

实际上，json.dump() 这部分的函数也是 GPT 阁下写的，它理所当然地认为是在英文语境中，使用默认的参数不为过吧。😂

但 GPT4o 最先要我确认是否是终端的编码问题：

# 这就是原原本本 copy 得到的unicode字符串
echo -e "\u5c0f\u6728\u4ed4-\u7761\u7740\u4e862"

# output: 小木仔-睡着了2

其次，使用 file -bi 命令查看文件的编码格式：

file -bi path/to/the/suspicious_file.jsonl

#output: application/json; charset=us-ascii

这就意味着程序在写入本地文件时，只使用了 ascii 字符。

# 关于 escape sequences 及其解析

然后 GPT4o 很详细的告知应该如何在命令行终端生成和设置 “zh_CN.UTF-8”，以及如何在 python 中进行转义系列的重编码和解码来显示中文字串，但尝试之后没成功。原来是我理解有误。

GPT4o: It appears there may be a misunderstanding regarding the purpose and usage of the unicode-escape encoding. The unicode-escape codec is used to interpret escape sequences in strings, but it does not process normal Unicode escape sequences (\uXXXX) that are already present in a Unicode string.

If you have a string that looks like a Unicode escape sequence and you need to convert it into actual Unicode characters, you can use the unicode-escape codec. However, your string needs to be encoded in a way that the codec understands.

Here’s the corrected approach to convert a string with Unicode escape sequences into actual Chinese characters:

Given String: You have a string with Unicode escape sequences.
Decode: Use unicode-escape to decode the string.


# Given string with Unicode escape sequences
unicode_str = "\\u5c0f\\u6728\\u4ed4-\\u7761\\u7740\\u4e862"

# Decode the Unicode escape sequences
decoded_str = unicode_str.encode('utf-8').decode('unicode-escape')
print(decoded_str)

Explanation

The string unicode_str contains Unicode escape sequences.
The encode('utf-8') part ensures the string is in a byte format.
The decode('unicode-escape') part interprets the byte format as Unicode escape sequences and converts them into actual characters.

This should correctly output the Chinese characters:

小木仔-睡着了2

Note

Ensure that your string uses double backslashes (\uXXXX) if you’re defining it directly in the script. This is necessary because a single backslash (\uXXXX) would be interpreted as an actual Unicode character during the string definition phase, not as an escape sequence.

一点牢骚：又学了一些没用的知识，毕竟：首先你为什么要用WSL这破烂玩意儿呢？？？

~~因为这本来是买给非终端爱好者人士使用的电脑~~

可人家毕竟也没有用，还是你自个霸占着。前不久发现巨硬又在系统明火执仗地搞偷窥活动（设置>隐私和安全性>活动历史记录）就信誓旦旦说立刻马上刻不容缓废掉这个恶心玩意

~~最近比较忙哔哩，新安装的2T盘不好清理吧啦，总觉着人家最终要用咕噜~~

本质上是因为捏着鼻子还能勉强用用，但估计也撑不了多久了。

毕竟 Linux 才是根root，才有家home/的感觉 :)

更新：虽然导致本文诞生的原因发生了根本性改变（详见文章开头的“更新”内容）但不影响“一点牢骚”之后和本更新之前的牢骚内容。