Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tid: 4025914848 楼中楼内容中用于表示回复者的 FragAt 数据错误。 #214

Closed
Sorceresssis opened this issue Jul 21, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Sorceresssis
Copy link

简要描述这个bug

tid: 4025914848 第二楼的楼中楼内容中用于表示回复者的 FragAt 数据错误。

如何复现

在何种场景下用何种操作复现

楼中楼位置。
tid: 4025914848
floor: 2, pid:75363365497
楼中楼第 2 楼, pid:75497714390

image

FragAt 的数据错误,text 字段乱码,user_id字段为0
正确的值应为 卡艾王, 1097650720

"objs": [
    {
        "text": "回复 "
    },
    {
        "text": "������",
        "user_id": 0
    },
    {
        "text": ":谁啊"
    },
    {
        "id": "image_emoticon21",
        "desc": "呼~"
    }
],
@lumina37
Copy link
Owner

这个客户端里就是乱码,搞不定

@lumina37 lumina37 added the bug Something isn't working label Jul 21, 2024
@n0099
Copy link

n0099 commented Jul 21, 2024

楼中楼第 2 楼, pid:75497714390

那叫spid,楼中楼 回复贴本质完完全全同路人而贴吧命名法称其(sub)post #114 (comment)

$ curl -s https://n0099.net/tbm/v1/client_tester.php\?type\=lzls\&tid\=4025914848\&pid\=75363365497\&pn\=1\&client_version\=8.8.8.8 \
| jq '.subpost_list[] | select(.id == 75497714390) | .content'
[
  {
    "text": "回复 ",
    "type": 0
  },
  {
    "type": 4,
    "text": "������",
    "uid": 0
  },
  {
    "type": 0,
    "text": " :谁啊"
  },
  {
    "type": 2,
    "text": "image_emoticon21",
    "c": "呼~"
  }
]

经典unicode replacement char https://codepoints.net/U+FFFD https://en.wikipedia.org/wiki/Specials_(Unicode_block) 盲猜是gbk/2312转utf8时卡bug

$ echo -n 卡艾王 | tee >(hexdump) | wc -c
9
0000000 8de5 e8a1 be89 8ee7 008b
0000009
$ echo -n 卡艾王 | iconv -t GB2312 | tee >(hexdump) | wc -c
0000000 a8bf acb0 f5cd
0000006
6

卡艾王在gb2312下是6字节恰好与这6个replacement char数量相同

@Sorceresssis
Copy link
Author

Sorceresssis commented Jul 21, 2024

乱码无所谓的,难受的是 user_id 获取不到,知道 user_id 就可以直接去替换nickname。
贴吧网页版意外的兼容性很好。我想可以结合传统的网页爬虫来完善数据。但是要实现 贴吧官方网页aiotieba 的爬取同步非常麻烦。网页没有固定的 rn ,每页帖子的数量都不固定。得依靠二分法一点一点的遍历查找。

@n0099
Copy link

n0099 commented Jul 22, 2024

乱码无所谓的

请不要将任何人类不可读的一坨神必字符(如base64encoding https://z.n0099.net/#narrow/near/92764 uuid n0099/open-tbm#24 )称做乱码除非的确不知道其是什么
您已经知道这是6个U+FFFD Replacement Character

贴吧官方网页aiotieba 的爬取同步

网页端接口和客户端接口
还存在着wap网页接口( https://tieba.baidu.com/mo 疑似已在 #124 (comment) #199 (comment) 之后强制赶人到后者)和移动端网页(使用移动端设备UA访问网页端的url aka所谓的 https://en.wikipedia.org/wiki/Content_negotiation )接口

网页没有固定的 rn ,每页帖子的数量都不固定。

https://tieba.baidu.com/p/4025914848 登录态下主题帖回复帖rn=30 #158 (comment) 没登录是15
https://tieba.baidu.com/p/comment?tid=4025914848&pid=75363365497&pn=1 回复贴楼中楼固定是10

得依靠二分法一点一点的遍历查找。

只要客户端接口和网页端接口使用相同的rn那使用相同的pn查询就能对齐上
不考虑有些主题帖 回复贴如两套投票帖 #83 (comment) #124 (comment)网页端/客户端特供仅在该接口下可见

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants