inspect_substrings_in_file_using_sed_awk_jq

2024-07-17

/posts/thecli/awk_sed_cat_head_etc/awk_use_cases/ map[email:1522009317@qq.com name:fmh]

Table of Contents

在命令行工作远非完美，但具有极高的自由度。比如，你想要查看某个文件内(这里绝对不是指 doc/docx 这类怪胎，而是 text/csv/tsv/json 等等正常的文件) 的某一行的内容，如果那一行有很多列，而其中要是某一列的内容特别长（就是sed出来占据了整个屏幕这种长度），那么，你想要对这个文件的内容有所了解，可能需要额外的软件来打开此文件，再好好欣赏之。或者，其实在命令行就可以呢？– awk 了解一下。

is_value_missing

像我就是碰上这么个情况，将图片转换成 base64 编码的 byte-data 后，在 Jupyterlab 打开发现有些行怎么是空行？（首先我不是去质疑代码，而是质疑“眼见为实”）

所以，必须要查看某一行的内容，并且要截断过长的字串，老伙计 GPT4o 很快给出答案：

# TSV

# To print the 10th row and truncate the 3rd column to 50 characters
awk 'NR==10 { $3=substr($3, 1, 50) "..."; print }' FS="\t" OFS="\t" file.tsv

这还没完，因为 base64 编码的缘故（以及输入图片比较相似）导致 awk 到的结果都是一模一样的起始字串，自然而然就怀疑这是巧合还是错误，所以，必须查看字串尾部内容是否也一样, 这当然难不倒一点脾气也不会有的 GPT4o 老哥：

# print the last 50 characters of the string in the 3rd column of the 10th row
awk 'NR==10 { len=length($3); print substr($3, len-50, 50) }' FS="\t" file.tsv

# JSONL

# To print the 10th row and the field names of json object
sed -n '10p' file.jsonl | jq 'keys'

# To print the 10th row and the first 10 chars of strings or 10 elements of array
sed -n '10p' file.jsonl | jq '.field_name[:10]'

话痨多两句：

至于参数代表啥意思，我通常不让 GPT4o 多费唇舌。

（毕竟就算解释了我还是会动手验证一番，再者，最关键是我还没开通 plus 会员，所以得听柯景腾他老妈的那句劝：要省着点用哦 :）