Bybot handles the everyday path well — guided flows, security escalations, 中文 support. The role isn't fixing a broken bot; it's owning the part that's never finished: the free-text long tail, EN/中文 parity as the product changes, and the QA discipline to improve a live bot without breaking what already works. Everything below is what I'd do, grounded in how Bybot behaves today.
A strong bot stays strong only if someone owns the loop that keeps it that way: read the fallbacks → find the weak intent → fix the KB or routing → prove the metric moved → don't regress it. That's what I'd do at Bybit — and I don't just describe it: I built a working bilingual support bot, then ran this loop on it live and shipped five regression-gated fixes (✦ Proof). On the human side I've run the same loop at a Bitcoin L2 (BOB) and an Ethereum app (Aztec), authoring the docs that cut repeat queries. This page is the plan, grounded in how Bybot works today.
At 80M+ users, even a 1-point shift in chatbot containment is tens of thousands of tickets that never reach a human — that's the entire business case for this role. Crypto support is also uniquely high-stakes: questions are about money in motion (stuck deposits, wrong-network transfers, withdrawal holds, P2P disputes, liquidations), they're time-pressured and emotional, and a wrong automated answer isn't just unhelpful — it can be financially harmful. That raises the bar on when Bybot should answer vs. escalate, which is the thread running through this whole audit. And it isn't abstract for Bybit: the Feb 2025 ~$1.5B cold-wallet exploit (the largest crypto theft to date) is exactly why money-movement and account-security intents must reach a human fast — getting that routing right is the heart of this role.
Six duties in the JD. Here's what each one actually means in week-to-week work, and where I cover it on this page.
| JD duty | What it means in practice | Covered in |
|---|---|---|
| Own Bybot's lifecycle | Plan → train (intents/answers) → integrate (web/app/CRM) → monitor → continuously improve. You are the product owner of the bot. | §05 |
| Analyse fallback & resolution rates | Instrument the funnel, find where users drop or escalate, raise self-service without hurting CSAT. | §06 |
| Maintain bilingual KB (EN + 中文) | Author and keep both languages at parity; the KB is the bot's brain and the help-center's content. | §07 |
| QA via scenario testing | Before and after every change, run the bot through real conversations; catch regressions. | §08 · §09 |
| Cross-team + train internal staff | Translate between Product / Eng / CS Ops; teach agents what the bot can and can't do. | §05 · §11 |
| Lead NLP / AI optimisation | Improve intent recognition, retrieval, answer quality and auto-routing (right queue by intent × confidence × language × risk) — increasingly LLM/RAG-based. | §10 |
Requirements call for 5–7 yrs chatbot/CS-AI ops, the CS ticket lifecycle & escalation/routing logic, EN+中文 fluency, CRM familiarity, basic NLP/AI, and strong analytical + PM skills. Fit mapped in §12.
Traced from the public help center and support docs (Jun 2026). Bybit runs a tiered self-service → assisted funnel. Each stage is a place Bybot either deflects a ticket or leaks one to a human.
Bybot is guided-flow-first (tapped intents → clean, deep-linked, authored answers), with solid free-text/typo NLU and 👍/👎 + "以上皆不相关" (none-of-these) fallback capture — the miss-signal stream an analyst mines (§06). Two analyst-relevant tells: it answers in page locale, not message language (English typed on the 中文 page → 中文 reply, reproduced ×2 — a parity gap, §07), and security intents route to a human with a 5–7-day SLA. Full architecture read + the fact-checked verify-table in ⚙ Bybot dissected.
When Bybot can't match an intent, the practical exit is Submit Case → a 3–7 working-day ticket, and at peak volume live chat is sometimes turned off — leaving only the bot or email (help-center docs, public reviews, Jun 2026). So the headline containment number can look healthy while a user's money-movement question waits days. That gap is the whole reason §06 separates containment from true-resolution, and why ★ biases money intents toward a fast human hand-off.
Everything tagged confirmed is observable from public pages cited in §13; internals are deliberately framed as questions, not assumptions.
Bybot handles the common path well — these aren't claims that it's broken. They're the failure classes that never fully go away on any exchange CS bot, because the product keeps changing and the long tail is infinite. This is the work that's never "done," and what I'd own: measure each, watch the trend, attack the worst.
| Failure mode | What the user experiences | Cost | The fix lives in |
|---|---|---|---|
| Fallback dead-end | Bot can't match intent, loops "I didn't get that," no clean hand-off → user rage-types or leaves. | high | §06 + routing |
| Confident-but-wrong | Bot answers a money-movement question incorrectly (worse than not answering, in crypto). | high | §09 QA gate |
| EN/中文 parity drift | English KB updated; Chinese lags → bot gives stale/again-different answer to 中文 users. | med | §07 |
| Deflection ≠ resolution | "Containment" counted even though the user gave up unresolved — vanity metric hides pain. | med | §06 |
| KB rot | Product ships a change; KB/intents not updated → bot answers about a flow that no longer exists. | med | §05 loop |
Black-box, no code access — and a rule: don't take the JD's word for it. A posting describes intended scope, not the deployed system, so every claim is a hypothesis with a confidence and a probe to confirm it. The JD narrows the search; live behaviour is the arbiter.
| Hypothesis | Independent evidence | Verdict |
|---|---|---|
| Ticketing on Salesforce Service Cloud | Submit-Case URL is bybit.com/…/s/webform — the /s/ path is Salesforce Experience Cloud's signature | likely · med-high |
| Intent classifier + curated flows (not pure LLM) | Guided buttons + a fixed clarifier menu; the typo "withdraw" resolved fine | unverified — also fits an LLM |
| Language-specific flows keyed to page locale | English typed on the 中文 page → 中文 reply, reproduced ×2 (explains the parity gap) | plausible, not proven |
| Auto-routing bot→human; hard safe-flow on security intents | "connects to a live agent on request"; "i was hacked" → account-disable + ticket | likely |
Confirm probes: paraphrase one intent 5 ways (rigid template reuse ⇒ intent bot; fluent variety ⇒ LLM); type 中文 on the EN page; inspect the chat widget's network origin for *.salesforce.com. Internals are framed as questions, never asserted.
Where the bar is, and what Bybot can borrow — researched from public sources (Jun 2026; cited in §13).
Agentic Claude-based support (per Anthropic's published case study) — grounded in KB + real-time account data, with compliance guardrails + a documented eval process. Coinbase reports AI cut account-restriction resolution times ~90% and now handles ~55% of US fraud cases (Armstrong, May 2026). The north star — and the shape of my ✦ demo.
(From Binance AI Pro — a trading agent, not its support bot.) Wires multiple LLMs (ChatGPT/Claude/Qwen/MiniMax/Kimi) in an isolated sub-account with a no-withdrawal API key. Borrow the pattern: no single-vendor lock-in; least-privilege for any account action.
The exchange (kraken.com) emphasises true 24/7 human support. Borrow: on money-at-risk, a fast human hand-off is a feature, not a bot failure. (Not to be confused with the unrelated "Kraken" energy-CS platform at kraken.tech.)
Native per-language support + a priority lane for token holders (per Bitget's own materials; no official public SLA). Borrow: parity means native, not "multilingual", and SLA targets would make "resolution" concrete.
Self-service → bot triage → live chat → ticket, multilingual. The same funnel as everyone — so differentiation is quality (resolution, parity, escalation judgment), not the funnel.
Coinbase's eval-gated agentic pattern + Binance's multi-model/isolation + Kraken's hand-off discipline + Bitget's native parity & SLAs. If Bybot is still intent-tree, that's the leapfrog — and I've prototyped it.
Pull the last 90 days of un-matched / escalated turns, cluster by topic. The top 20 clusters are the entire near-term backlog — ranked by volume × escalation rate.
Stop reporting "containment" alone. Add a true-resolution signal (no re-contact in 24h + no escalation + thumbs-up) so we optimise help, not avoidance.
Auto-flag any KB article where EN was edited after its 中文 counterpart. One report kills a whole class of bilingual drift.
For money-movement intents (withdrawals, stuck deposits, liquidations) bias toward fast hand-off over a risky auto-answer. Safety beats deflection.
Freeze the top 100 real conversations as a test suite so no KB/intent edit silently breaks an answer that used to work (§08).
A 30-min standing review with CS Ops: worst 10 transcripts, what shipped, what moved. The improvement loop becomes a habit, not a project.
The JD's first duty. I treat the bot as a product with a tight build-measure-learn loop — and the analyst as its owner across Product, Eng and CS Ops.
Prioritise intents by ticket volume × cost × automatability. Not everything should be automated — money-movement edge cases stay human by design.
Author intents/utterances + KB answers (EN+中文), set confidence thresholds, define escalation triggers per intent.
Web + app + CRM. Make sure the bot transcript + detected intent land on the ticket so agents start with context, not a blank slate.
Dashboards on fallback / containment / escalation / CSAT, sliced by language and intent. Alerts when a metric regresses after a release.
Weekly: read the worst transcripts, fix the KB or routing, ship, verify the metric moved, add to regression set. Repeat.
Train CS agents on what Bybot now handles and where to trust/override it — so the bot and humans reinforce each other.
"Analyse fallback and resolution rates" is duty #2. The trap is optimising deflection (user didn't reach a human) instead of resolution (user's problem was actually solved). Here's the metric tree I'd run.
Illustrative shape, not Bybit data — the point is that the muted bars (dead-ends + false containment) are the real backlog, and they're invisible if you only track one headline number.
Duty #3, and the reason fluent Mandarin is a hard requirement. The KB is built from three sources — Help Center content, internal SOPs, and patterns in CRM ticket data — and in a RAG-style bot it is the model's knowledge, so KB quality is bot quality. I'm a native-level EN + 中文 (+ BM) writer, so I can own both languages directly rather than route every Chinese edit through translation.
Treat EN and 中文 as one article in two languages with a shared "last-reviewed" stamp. CI-style report flags any pair where one language is stale.
Front-load the answer, one intent per article, explicit synonyms ("tag" = "memo"), so both humans and the bot's retriever find the right chunk.
New articles come from the fallback clusters and CRM ticket trends (★) — the KB grows where users actually get stuck.
This is the exact loop I ran at BOB — I authored documentation and FAQs that measurably cut repeat support queries. Same mechanism here, now feeding a bot as well as a reader.
Duty #4. "QA through scenario testing" only scales if scenarios are a persistent suite, not ad-hoc clicking. Every change runs the gauntlet before it ships, in both languages.
I built a version of this as an analyst reviewing automated AI outputs at KIP Protocol. It turns "this answer feels off" into a consistent, trainable label — so QA is comparable across reviewers, languages and weeks, and the failure mix tells you what to fix next.
| Failure class | Definition | Severity | Typical fix |
|---|---|---|---|
| Wrong answer | Factually incorrect for the user's case — esp. on funds/fees/limits. | critical | KB correction + regression case |
| Hallucinated policy | Invents a rule/step that doesn't exist. | critical | Tighten retrieval / grounding |
| Missed escalation | Should have handed to a human; didn't. | critical | Escalation trigger on intent |
| No-match / fallback | Valid question, intent not recognised. | major | Add intent/utterances or article |
| Partial / incomplete | Technically right but misses a key step. | major | Rewrite KB answer |
| Language/tone drift | EN↔中文 mismatch, robotic or off-brand tone. | minor | KB parity + tone guide |
| Over-escalation | Bounced to a human something it could solve. | minor | Raise confidence/coverage |
Each reviewed transcript gets one primary label + severity; the weekly distribution is the roadmap. Critical classes also feed the safety regression set (§08).
I built a working bilingual RAG support bot — bot.web3wagmi.com — then stress-tested it as an adversarial QA analyst and shipped five fixes in one session, each gated by an eval that must hold safety 10/10 · 0 regressions or it doesn't ship. The same loop I'd run on Bybot: read the failure → find the weak intent → fix KB/routing → prove the metric moved → don't regress it.
A 中文 question answered in English. Fix: translated all 107 guides to 中文, ingested a Chinese KB, added CJK-aware retrieval. Now: 中文 in → 中文 out, with 中文 sources. (§07)
Gibberish (“u eat?”) got a low-confidence essay by matching a stray word. Fix: a semantic floor — no named entity + weak meaning → deflect, don’t guess. (§06)
“bybit vs binance” was turned away — filler + a second entity dragged the match below threshold. Fix: recognise named-guide entities so comparisons answer. (§06)
Chat displayed “confidence low · 0.394 — recommend escalation”. Fix: telemetry moved to the analyst scorecard; customer sees a clean answer + a human hand-off. (§09)
“my wallet ahcked” missed escalation. Fix: fuzzy-match high-severity terms — accept minor over-escalation because a missed one is the only critical failure. (★)
All five shipped only because the eval held safety 10/10 · 0 regressions. Change without a regression gate is how a “fix” silently breaks a working answer.
Each fix maps to a JD lever: precision/recall on the deflect boundary (§06), EN/中文 parity (§07), regression-gated QA (§08), escalation routing (★). The point isn’t the bot — it’s that I find where even a solid bot leaks, fix it, and prove it stayed fixed.
Frontier LLMs commoditised language understanding — so "NLP optimisation" now means the layer around the model: retrieval, routing, safety guardrails and evals. Duty #6 is "lead NLP/AI optimisation." Most CS-analyst applicants stop at dashboards; I also build the tooling. This is the shape of a regression harness that runs the golden + safety sets against the bot and blocks a release on any safety regression — illustrative of approach, not production code.
# bybot_eval.py — run the golden/safety suite, gate the release import json, statistics as st from bybot_client import ask # thin wrapper over the bot API def grade(case, resp): # scorecard labels from §09 → machine-checkable if case["must_escalate"] and not resp["escalated"]: return "missed_escalation" # CRITICAL — never ship if case["intent"] != resp["intent"]: return "no_match" if not resp["grounded"]: # answer cites a real KB chunk? return "hallucinated_policy" return "ok" def run(suite, lang): rows = [(c, ask(c["utterance"], lang=lang)) for c in suite] labels = [grade(c, r) for c, r in rows] crit = [l for l in labels if l in ("missed_escalation", "hallucinated_policy")] return { "lang": lang, "containment": round(st.mean(r["contained"] for _, r in rows), 3), "match_rate": round(labels.count("ok") / len(labels), 3), "critical": len(crit), } if __name__ == "__main__": suite = json.load(open("golden_safety.json")) report = [run(suite, l) for l in ("en", "zh")] # both languages, every run for r in report: print(r) # release gate: any critical failure in any language → fail the build assert all(r["critical"] == 0 for r in report), "safety regression — blocking release"
The point isn't this exact script — it's that I think about chatbot quality as something you can measure, gate and regress like software, in EN and 中文 together. I also build onchain AI agents and read SDKs/contracts, so I'm comfortable wherever the bot meets engineering.
I don't know the internal stack, so this is vendor-agnostic: the frontier tools I'd evaluate per JD responsibility. ✓ shipped = I've already run a working version (✦ Proof). The posture: instrument first, measure the real gaps, then pick buy / build / tune against the numbers — the data picks the dish, not the menu.
| JD area | Frontier tools I'd evaluate (2026) | Already shipped? |
|---|---|---|
| QA / scenario testing / regression | promptfoo (red-team), DeepEval (CI), Botium (UI regression), RAGAS, garak | ✓ eval gate + 3-model judge |
| Fallback / resolution analytics | Langfuse / Arize Phoenix (tracing), BERTopic clustering, multilingual embeddings + reranker | ✓ failure scorecard |
| Bilingual EN/中文 KB + parity | frontier + China-strong models (Qwen/DeepSeek/GLM); multi-model consensus QA; parity-drift detection | ✓ DeepSeek→Gemini+Qwen pipeline |
| Auto-routing / safety | intent+threshold routers, semantic deflection gate, NeMo Guardrails / Llama Guard | ✓ entity gate + safety escalation |
| Platform (if modernising) | benchmark Fin / Sierra / Decagon / Agentforce vs in-house; Rasa, LangGraph, MCP for control | evaluate |
The rest of this page is the argument; this is just the résumé line. 5+ yrs on the CS ↔ content ↔ AI seam — front-line support at a Bitcoin L2 (BOB), customer success on an Ethereum app (Aztec), docs that cut repeat queries, and an AI-QA failure scorecard at KIP. Native EN / 中文 (+ BM), and I build AI agents — credible to CS Ops and Engineering alike. The proof is the loop I ran live and the product read in §03.
Everything tagged confirmed is observable from Bybit's public help center and support docs (Jun 2026): the live-chat flow that opens with a virtual assistant and connects to a live agent on request, the self-service recovery functions, the "Submit Case" webform, and 24/7 multilingual support. The bot's internal platform, thresholds and metrics are not public — those are framed as questions in §03, not assumptions. No private systems were accessed; metric values in §06 are explicitly illustrative.
I worked the live surface end to end:
Public reviews surface recurring patterns ("generic, scripted responses", "~20 chats, no solution", a "queue of 108", "live chat is just a bot", multi-day freezes on money-movement). Self-selected reviews aren't data — I'd treat them as hypotheses to validate against internal metrics — but they point at containment that isn't resolution (§06), money intents needing a faster human route (★), and EN/中文 consistency gaps (§07).
Role: Client Service Analyst — (Senior) AI Chatbot · Greenhouse
Help Center: bybit.com/en/help-center
Submit Case: help-center/s/webform
Support flow: Bybit — 24/7 multilingual live chat
Sentiment: Trustpilot — Bybit reviews
Support detail: TradersUnion — Bybit support
Peer — Coinbase AI: Anthropic case study · Armstrong, ~90% / 55% (May 2026)
Peer — Binance AI Pro / Kraken support: Binance · Kraken 24/7
Claims are split into confirmed (verified from public pages) and illustrative/inferred (modelled or from the JD) throughout. Code is illustrative of approach, not production config. This is an unsolicited audit prepared as interview homework — happy to walk through any section live.
Prepared by Edward Tay · for the Bybit Client Service Analyst — (Senior) AI Chatbot role · Jun 2026 · edwardtay.com · Edwardtay7@gmail.com