1000ms

Difficulty: 8

所属公司 : 华为

时间 :2025年9月10日-国内-AI

算法标签>

模拟

解题思路

问题抽象

文档集合 corpus[0…N-1] 按时间递增（下标既是时间点/编号）。
查询 (t, q) 的检索窗口只看 L = max(0, t-K+1) 到 R = min(t, N-1) 的 M=R-L+1 篇文档。
动态权重：窗口内从新到旧编号第 j 篇（最新 j=1）的权重 weight = (K - j + 1) / K（最新 1，最旧 1/K）。
向量定义（关键）
- 查询向量：q_i = TF(w_i,q) * IDF(w_i)
- 文档向量（方案一按题面）：d_i = TF(w_i,doc) * IDF(w_i) * weight_n
相似度：余弦相似度
$\cos(A,B_n)=\frac{A\cdot B_n}{\|A\|\|B_n\|}$
由于 weight_n 乘在文档向量每一维，在分子与分母中会抵消，因而最终分数与不加权相同；但我们仍按题面实现把权重乘进文档分量与其范数的计算中。
筛选：仅返回窗口内相似度≥0.6 的文档中最高者；若并列，取窗口中最早（编号最小）。若无则返回 -1。

细节实现

TF(w,doc) = count(w)/len(doc)；查询同理。
IDF(w) = ln((M+1)/(df(w)+1)) + 1（仅在窗口内，带平滑）。
仅对查询出现的词计算点积与范数，提升效率。
计算时把 weight 乘到文档分量与文档范数里，不再额外乘相似度分子。

复杂度

时间：O(K·L + K·Q)（通常记作 O(K·L)）
空间：O(Q + K)

代码实现

Python

import sys, math
from collections import Counter, defaultdict
input = sys.stdin.readline

def main():
    N = int(input().strip())
    docs_raw = [input().strip() for _ in range(N)]
    K = int(input().strip())
    P = int(input().strip())
    queries = []
    for _ in range(P):
        q = input().strip().split(' ')
        queries.append((int(q[0]), q[1:]))

    # 预处理每篇文档的分词与词频
    doc_tokens = [s.split() for s in docs_raw]
    doc_lens = [len(tks) for tks in doc_tokens]
    doc_cnt = [Counter(tks) for tks in doc_tokens]

    ans = []
    eps = 1e-12

    for t, q_words in queries:
        # 窗口 [L, R]
        R = min(max(0, t), N - 1)
        L = max(0, R - K + 1)
        M = R - L + 1

        if not q_words or M <= 0:
            ans.append(-1)
            continue

        q_cnt = Counter(q_words)
        q_len = len(q_words)

        # df & idf（仅对查询词；平滑）
        df = defaultdict(int)
        q_set = set(q_cnt.keys())
        for i in range(L, R + 1):
            for w in q_set:
                if doc_cnt[i].get(w, 0) > 0:
                    df[w] += 1
        idf = {w: math.log((M + 1.0) / (df[w] + 1.0)) + 1.0 for w in q_set}

        # 查询范数
        q_norm_sq = 0.0
        for w, c in q_cnt.items():
            tfq = c / q_len
            x = tfq * idf[w]
            q_norm_sq += x * x
        q_norm = math.sqrt(q_norm_sq) if q_norm_sq > 0 else 1.0

        best_score, best_id = -1.0, -1

        # 新->旧遍历
        for idx in range(R, L - 1, -1):
            dl = doc_lens[idx] if doc_lens[idx] > 0 else 1
            # 位置权重：j=1 最新
            j = (R - idx) + 1
            weight = (K - j + 1) / K

            dot = 0.0
            d_norm_sq = 0.0
            for w, qc in q_cnt.items():
                tfd = doc_cnt[idx].get(w, 0) / dl
                tfq = qc / q_len
                dq = tfq * idf[w]
                dd = tfd * idf[w] * weight   # 方案一：把weight乘进文档维度
                dot += dq * dd
                d_norm_sq += dd * dd

            sim = 0.0
            if d_norm_sq > 0:
                d_norm = math.sqrt(d_norm_sq)
                sim = dot / (q_norm * d_norm)  # 不再额外乘weight

            if sim >= 0.6 - eps:
                if sim > best_score + eps:
                    best_score, best_id = sim, idx
                elif abs(sim - best_score) <= eps:
                    best_id = idx if (best_id == -1 or idx < best_id) else best_id

        ans.append(best_id if best_id != -1 else -1)

    print(' '.join(str(x) for x in ans))

if __name__ == "__main__":
    main()

Java

import java.io.*;
import java.util.*;

public class Main {

    static List<String> splitWords(String s) {
        s = s.trim();
        if (s.isEmpty()) return Collections.emptyList();
        return Arrays.asList(s.split("\\s+"));
    }

    public static void main(String[] args) throws Exception {
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        int N = Integer.parseInt(br.readLine().trim());

        String[] raw = new String[N];
        for (int i = 0; i < N; i++) raw[i] = br.readLine().replace("\r", "");

        int K = Integer.parseInt(br.readLine().trim());
        int P = Integer.parseInt(br.readLine().trim());

        // 预处理：每篇文档词频与长度
        ArrayList<HashMap<String, Integer>> dcnt = new ArrayList<>(N);
        int[] dlen = new int[N];
        for (int i = 0; i < N; i++) {
            List<String> tks = splitWords(raw[i]);
            dlen[i] = tks.size();
            HashMap<String, Integer> mp = new HashMap<>();
            for (String w : tks) mp.put(w, mp.getOrDefault(w, 0) + 1);
            dcnt.add(mp);
        }

        StringBuilder out = new StringBuilder();
        final double EPS = 1e-12;

        for (int qi = 0; qi < P; qi++) {
            String line = br.readLine();
            if (line == null) line = "";
            line = line.replace("\r", "").trim();

            int sp = line.indexOf(' ');
            int t = Integer.parseInt(sp == -1 ? line : line.substring(0, sp).trim());
            List<String> qWords = (sp == -1) ? Collections.emptyList()
                    : splitWords(line.substring(sp + 1));

            int R = Math.min(Math.max(0, t), N - 1);
            int L = Math.max(0, R - K + 1);
            int M = R - L + 1;

            if (qWords.isEmpty() || M <= 0) {
                out.append(-1);
                if (qi + 1 < P) out.append(' ');
                continue;
            }

            // 查询词频
            HashMap<String, Integer> qcnt = new HashMap<>();
            for (String w : qWords) qcnt.put(w, qcnt.getOrDefault(w, 0) + 1);
            int qlen = qWords.size();

            // df & idf（仅查询词）
            HashMap<String, Integer> df = new HashMap<>();
            for (String w : qcnt.keySet()) df.put(w, 0);
            for (int i = L; i <= R; i++) {
                HashMap<String, Integer> dm = dcnt.get(i);
                for (String w : qcnt.keySet()) {
                    if (dm.containsKey(w)) df.put(w, df.get(w) + 1);
                }
            }
            HashMap<String, Double> idf = new HashMap<>();
            for (String w : qcnt.keySet()) {
                double v = Math.log((M + 1.0) / (df.get(w) + 1.0)) + 1.0;
                idf.put(w, v);
            }

            // 查询范数
            double qnorm2 = 0.0;
            for (Map.Entry<String, Integer> e : qcnt.entrySet()) {
                String w = e.getKey();
                double tfq = e.getValue() * 1.0 / qlen;
                double x = tfq * idf.get(w);
                qnorm2 += x * x;
            }
            double qnorm = qnorm2 > 0 ? Math.sqrt(qnorm2) : 1.0;

            double bestScore = -1.0;
            int bestId = -1;

            // 新->旧遍历
            for (int idx = R; idx >= L; idx--) {
                int dl = dlen[idx] > 0 ? dlen[idx] : 1;
                HashMap<String, Integer> dm = dcnt.get(idx);

                // 位置权重（乘进文档分量）
                int j = (R - idx) + 1;
                double weight = (K - j + 1.0) / K;

                double dot = 0.0, dnorm2 = 0.0;
                for (Map.Entry<String, Integer> e : qcnt.entrySet()) {
                    String w = e.getKey();
                    int cdoc = dm.getOrDefault(w, 0);
                    double tfd = cdoc * 1.0 / dl;
                    double tfq = e.getValue() * 1.0 / qlen;
                    double dq = tfq * idf.get(w);
                    double dd = tfd * idf.get(w) * weight; // 方案一：这里乘weight
                    dot += dq * dd;
                    dnorm2 += dd * dd;
                }

                double sim = 0.0;
                if (dnorm2 > 0) {
                    double dnorm = Math.sqrt(dnorm2);
                    sim = dot / (qnorm * dnorm); // 不再额外乘weight
                }

                if (sim >= 0.6 - EPS) {
                    if (sim > bestScore + EPS) {
                        bestScore = sim; bestId = idx;
                    } else if (Math.abs(sim - bestScore) <= EPS) {
                        if (bestId == -1 || idx < bestId) bestId = idx;
                    }
                }
            }

            out.append(bestId == -1 ? -1 : bestId);
            if (qi + 1 < P) out.append(' ');
        }

        System.out.print(out.toString());
    }
}

C++

#include <bits/stdc++.h>
using namespace std;

// 按空白分词
static vector<string> splitWords(const string& s0) {
    vector<string> res;
    string s = s0;
    int n = (int)s.size(), i = 0;
    while (i < n) {
        while (i < n && isspace((unsigned char)s[i])) i++;
        if (i >= n) break;
        int j = i;
        while (j < n && !isspace((unsigned char)s[j])) j++;
        res.emplace_back(s.substr(i, j - i));
        i = j;
    }
    return res;
}

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    string line;
    getline(cin, line);
    int N = stoi(line);

    vector<string> raws(N);
    for (int i = 0; i < N; ++i) {
        getline(cin, raws[i]);
        if (!raws[i].empty() && raws[i].back() == '\r') raws[i].pop_back();
    }

    getline(cin, line);
    int K = stoi(line);
    getline(cin, line);
    int P = stoi(line);

    // 预处理：每篇文档的词频与长度
    vector<unordered_map<string,int>> dcnt(N);
    vector<int> dlen(N);
    for (int i = 0; i < N; ++i) {
        auto tks = splitWords(raws[i]);
        dlen[i] = (int)tks.size();
        auto &mp = dcnt[i];
        for (auto &w : tks) mp[w]++;
    }

    vector<int> answers;
    answers.reserve(P);
    const double EPS = 1e-12;

    for (int qi = 0; qi < P; ++qi) {
        getline(cin, line);
        if (!line.empty() && line.back() == '\r') line.pop_back();

        auto sp = line.find(' ');
        int t = stoi(sp == string::npos ? line : line.substr(0, sp));
        vector<string> qWords = (sp == string::npos) ? vector<string>() : splitWords(line.substr(sp + 1));

        int R = min(max(0, t), N - 1);
        int L = max(0, R - K + 1);
        int M = R - L + 1;

        if (qWords.empty() || M <= 0) {
            answers.push_back(-1);
            continue;
        }

        // 查询词频
        unordered_map<string,int> qcnt;
        for (auto &w : qWords) qcnt[w]++;
        int qlen = (int)qWords.size();

        // df & idf（仅查询词；平滑）
        unordered_map<string,int> df;
        for (auto &kv : qcnt) df[kv.first] = 0;

        for (int i = L; i <= R; ++i) {
            auto &dm = dcnt[i];
            for (auto &kv : qcnt) {
                const string &w = kv.first;
                if (dm.find(w) != dm.end()) df[w]++;
            }
        }

        unordered_map<string,double> idf;
        for (auto &kv : qcnt) {
            const string &w = kv.first;
            double v = log((M + 1.0) / (df[w] + 1.0)) + 1.0;
            idf[w] = v;
        }

        // 查询范数
        double qnorm2 = 0.0;
        for (auto &kv : qcnt) {
            const string &w = kv.first;
            double tfq = kv.second * 1.0 / qlen;
            double x = tfq * idf[w];
            qnorm2 += x * x;
        }
        double qnorm = qnorm2 > 0 ? sqrt(qnorm2) : 1.0;

        double bestScore = -1.0;
        int bestId = -1;

        // 新->旧遍历
        for (int idx = R; idx >= L; --idx) {
            int dl = dlen[idx] > 0 ? dlen[idx] : 1;
            auto &dm = dcnt[idx];

            int j = (R - idx) + 1;
            double weight = (K - j + 1.0) / K; // 方案一：乘进文档分量

            double dot = 0.0, dnorm2 = 0.0;

            for (auto &kv : qcnt) {
                const string &w = kv.first;
                int cdoc = 0;
                auto it = dm.find(w);
                if (it != dm.end()) cdoc = it->second;

                double tfd = cdoc * 1.0 / dl;
                double tfq = kv.second * 1.0 / qlen;
                double dq = tfq * idf[w];
                double dd = tfd * idf[w] * weight; // 这里乘weight
                dot += dq * dd;
                dnorm2 += dd * dd;
            }

            double sim = 0.0;
            if (dnorm2 > 0) {
                double dnorm = sqrt(dnorm2);
                sim = dot / (qnorm * dnorm); // 不再额外乘weight
            }

            if (sim >= 0.6 - EPS) {
                if (sim > bestScore + EPS) {
                    bestScore = sim; bestId = idx;
                } else if (fabs(sim - bestScore) <= EPS) {
                    if (bestId == -1 || idx < bestId) bestId = idx;
                }
            }
        }

        answers.push_back(bestId == -1 ? -1 : bestId);
    }

    for (int i = 0; i < (int)answers.size(); ++i) {
        if (i) cout << ' ';
        cout << answers[i];
    }
    return 0;
}

题目内容

传统 $TF-IDF$ 方法对突发新闻的敏感度不足，为提高热点词识别效果，某新闻热点追踪平台提出基于历史窗口的TF-IDF方法，具体地，在某个时间点 $t$ 提出一个查询 $q$ 时，系统不应该在整个历史文档库中进行大海捞针式的搜索。

相反，它需要智能地聚焦于查询发生时间点 $t$ 前的一段”历史时间窗口“内的文档，并且只需考虑这个窗口内最新的信息。

您的任务就是实现这个“历史窗口”检索引擎的核心逻辑。

历史窗口：仅计算从查询时间点 $t$ 开始之前的 $K$ 篇文档的词频，而非全部历史文档。

动态权重：为了使该搜索模型更关注短期趋势，窗口内越新(文档编号 $m$ 越大)的文档权重越高，窗口内第 $j$ 篇文档的权重为 $(K-j+1)/K$ (最新文档权重 $=1$ ，最旧文档权重 $=1/K$ )。

筛选与输出：计算查询内容 $q$ 与窗口内每一篇文档向量之间的余弦相似度 $(Cosine Similarity)$ ,返回本次查询中余弦相似度 $>=0.6$ 且余弦相似度最高的文档编号 $m$ ，若未找到满足条件的文档编号(余弦相似度 $<0.6$ )，则本次查询返回 $-1$ ；若存在多个相同最高相似度的文档，返回时间窗口中最早的一篇文档编号。

相似度计算方法：查询 $q$ 向量(向量 $A$ )的第 $i$ 维计算公式为， $q_i = TF(w_i,q)×IDF(w_i)$ ，其中 $TF(w_i,q)$ 表示词 $w_i$ 在所有查询内容 $query$ 中的词频， $IDF(w_i)$ 表示词 $w_i$ 在窗口文档集合中的中逆文档频率。窗口文档集合中第 $n$ 篇文档(向量 $B$ )的第 $i$ 维向量计算公式为: $d_i = TF(w_i,doc) × IDF(w_i) × weight_n$ ,其中 $TF(w_i,doc)$ 表示词 $w_i$ 在查询窗口文档集合中的词频， $IDF(w_i)$ 表示词 $w_i$ 在窗口文档中的中逆文档频率， $weight_n$ 表示第 $n$ 篇文档的动态权重。

提示:向量 $A.B$ 的余弦相似度:$\cos (A, B)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}$

输入描述

输入第一行表示文档集合 $corpus$ 总数 $N$

第二行开始每一行为从时间点 $0$ 开始的文档，时间点和文档编号按行递增 (注:文档编号、查询时间点均可理解为数组下标，下标从 $0$ 开始)

之后的一行为查询窗口的大小 $K$

接下来的下一行表示总查询次数 $P$

然后紧跟每个查询，格式为搜索时间点 $t$ 具体查询内容 $q$ ， $t$ 和 $q$ 中间用空格隔开

参数限制

$1.1<=K<=$ 文档总数 $N$

$2.0<$ 文档总数 $N<=100$

$3.0<$ 总查询次数 $P<=100$

在处理完基本题目的 $I/O$ 操作后，你可能需要实现的函数原型

是 $history$ _ $search$ $(corpus,K，query)$ ，参数说明如下:

1. $corpus$ ：从时间点0开始的文档集合， $corpus[i]$ 表示第i篇文档，文档编号 $m$ 和时间点 $t$ 均对应数组下标，下标从 $0$ 开始。

2. $K$ :窗口大小。

3. $query$ ：查询列表。每个查询是一个二元组(查询时间点 $t$ ，查询内容 $q$ )。

输出描述

输出空格分隔的最匹配的文档编号 $m$ ，文档编号和查询顺序一一对应，若两篇文档最高相似度相同，则返回窗口中最早的文档编号

若无匹配则对应位置返回 $-1$

样例1

输入

5
long short term memory
data science
natural language processing
python for data science
a tutorial on python
3
3
2 natural language
4 python data science
4 long short term memory

输出

2 3 -1

说明

第一行 $5$ 表示新闻文档集合 $corpus$ 的总数 $N$

然后紧跟从时间点 $0$ 开始的具体的每篇文档，时间点和文档编号按行递增:

时间点 $0$ 文档编号 $0$ ：" $long$ $short$ $term$ $memory$ "

时间点 $1$ 文档编号 $1$ ：“ $data$ $science$ ”

时间点 $2$ 文档编号 $2$ ：“ $natural$ $language processing$ "

时间点 $3$ 文档编号 $3$ ：“ $python$ $for$ $data$ $science$ "

时间点 $4$ 文档编号 $4$ ：“ $a$ $tutorial$ $on$ $python$ "

下一行的“ $3$ ”为窗口的大小 $K$

接下来的“ $3$ ”表示查询 $query$ 的个数

然后紧跟每个查询，格式为搜索时间点 $t$ 具体查询内容 $q$

第一次查询" $natural$ $language$ "在时间点 $2$ ，我们正好可以看到 $0-2$ 号文档，因为只有2号文档" $natural$ $language$ $processing$ "出现了关键词，因此第一个查询返回 $2$

第二次查询" $python$ $data$ $science$ "在时间点 $4$ ，由于窗口大小是 $3$ ，我们可以看到 $2,3,4$ 号文档我们无法看到 $0,1$ 号文档，本次查询最匹配的是文档编号 $3$

第三次查询“ $long$ $short$ $term$ $memory$ "在时间点 $4$ ，同样的我们只能看到 $2,3,4$ 号文档，无法看到 $0,1$ 号文档，且 $2,3,4$ 号文档中没有与查询相似文档，

所以本次查询返回 $-1$

提示

1. $q$ 可能存在不在窗口文档集合中的查询词，因此在计算逆文档频率的时候你需要进行平滑操作，公式如下 $I D F(x)=\log \left(\frac{N+1}{N(x)+1}\right)+1$ 其中 $N$ 代表窗口文档的总数，而 $N(x)$ 代表包含词 $x$ 的文档总数

2.新闻文档集合 $corpus$ 所包含的文档内容是空格分隔的小写英文文档字符串

#P3639. 第2题-历史的窗口搜索

第2题-历史的窗口搜索

解题思路

问题抽象

细节实现

复杂度

代码实现

Python

Java

C++

题目内容

输入描述

输出描述

样例1

提示

Status

Development

Support

About