1 下载数据:
wget -npd -r ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1901/
wget命令使用方式请参考:
本文来源于hadoop权威指南,是本人学习笔记
1 获取数据
1.1
1.2
carl:weather carl$ ls | head
029070-99999-1901.gz
029500-99999-1901.gz
029600-99999-1901.gz
029720-99999-1901.gz
029810-99999-1901.gz
227070-99999-1901.gz
2 处理数据
#!/usr/bin/env bash
start=$(date +"+%s")
for dir in $(pwd)/$1/*
do
for year in $dir/*
do
echo -ne `basename $year .gz` "\t"
gunzip -c $year | \
awk '{ temp=substr($0, 88, 5)+0;
q=substr($0, 93, 1);
if (temp!=9999 && q~/[01459]/ && temp>max) max=temp}
END { print max }'
done
done
now=$(date + "+%s")
echo "time used:$((now-start)) seconds"
解释如下:循环获取data目录下所有文件名存入year中,通过basename函数删除year中.gz后缀获。然后通过gunzip将year输出到标准输出。awk使用substr提取气温temp和质量q。如果质量数值在01459中任何一个那么合格,说明数据质量可信。而且温度数据temp不是9999(表示数据缺失),那么和max进行比较如果。
./maxTemp.sh noaa
运行结果如下:
...129300-99999-1937 311129410-99999-1937 300129420-99999-1937 300129820-99999-1937 350129920-99999-1937 278time used:22 seconds
此时温度数据文件总数:
@localhost noaa]$ count=0; for name in $(pwd)/* ; do for gz in $name/* ; do let count++ ; done ; done@localhost noaa]$ echo $count4721
我们还可以通过并行化程序,将数据分为多块,每个文件处理一块,通过多个cpu分别计算,最后汇总计算结果。但是这样程序性能还是会掣肘于单台电脑IO操作等因素。在处理大批量数据时候,我们借助多台计算机扩展处理能力,由于多台计算机协作时候通讯以及控制比较复杂,所以我们借助hadoop这个分布式计算平台对数据进行处理。
shell参考文档:
附注1:气候数据格式说明
Example 2-1. Format of a National Climate Data Center record
0057
332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000) +028783 # longitude (degrees x 1000) FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1
N
9 -0128 1 -0139 1 10268 1
# quality code
# air temperature (degrees Celsius x 10)
# quality code
# dew point temperature (degrees Celsius x 10) # quality code
# atmospheric pressure (hectopascals x 10)
# quality code
附注2:awk教程
字符串截取函数
substr(string, start, length)
This returns a length-character-long substring of string, starting at character number start. The first character of a string is character number one.
For example, substr("washington", 5, 3) returns "ing". If length is not present, this function returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". This is also the case if length is greater than the number of characters remaining in the string, counting from character number start.