级联 - 如何读取分隔符分隔的文件并获取特定字段值

Question

我正在尝试使用级联读取分隔符分隔的文件并尝试读取特定字段。

代码示例：

FileTap inTap = new FileTap(new TextDelimited( true, "," ), "C://Users//user//Desktop//test//file.txt");

文件内容：

name,age,email

如何从所有记录中只获取 name 字段？

更新：我正在尝试使用级联 API 类。

Answer 1

您应该使用 TextLine Scheme 而不是 TextDelimited Scheme，

new Hfs(new cascading.scheme.hadoop.TextLine(asSourceFields), filePath, SinkMode.REPLACE);

从该源 tap 读取一行后，您必须使用 cascading.operation.Function 拆分该行并创建一个仅包含 'name' 字段的元组。

这是一个例子，

public class SplitLine extends BaseOperation implements Function {

    public SplitLine() {
        super(1, new Fields("name"));
    }

    @Override
    public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
        TupleEntry arguments = functionCall.getArguments();

        String line = arguments.getString(0);
        String[] tokens = line.split("\t");

        // Check that the split worked as assumed.
        if (tokens.length == 3) {
            Tuple output = new Tuple("name");
            output.set(0, tokens[0]);

            functionCall.getOutputCollector().add(output);
        }
    }
}

级联 - 如何读取分隔符分隔的文件并获取特定字段值

Cascading - how to Read delimiter separated file and get specific field value

java

cascading