Basically, I want to implement something on one of the websites I manage. Before anyone asks, no, I can't easily use a database due to the nature of the project, even if I'm aware that a database would be the optimal solution, so moving on.
I will have a lot of data to deal with; it has a magnitude of many hundreds / few thousands of lines, with a dozen or two of fields. Since the data is going to be very repetitive by nature I'm thinking to use one or more arrays as LUTs and store all the data as numbers to save space, and also because the data will need to be filterable by most (if not all) the fields. Now, here's my question.
Amongst the fields there will be date fields, e.g. year, month, day, (hour, minute). Which is why I'm wondering: should I keep the thousands of lines within a same file and perform a lot of IFs over them, or should I maybe split the data by hand in smaller files? I explain.
Case 1:
all_data.csv
-> lots of IFs
Case 2:
2014.csv
2015.csv
-> the year filter can be applied "outside" with a FOR, while IFs for month and day will still apply
Case 3:
201501.csv
201502.csv
-> both year and month filters can be applied "outside"
Case 4:
20150101.csv
20150102.csv
-> all the date is filtered outside (with nested FORs or DOs or something); however, this would lead to 365 files per year, most of which with 0 or few lines.
So I'm wondering. Which is easier on the server? Lots of IFs, or lots of FOPENs?
Since I'm generating the data from another program I wrote it wouldn't be a problem to split it differently. What I can't do is feed that data into a database with an automatic procedure, so yeah. Which is the best plan B?
I will have a lot of data to deal with; it has a magnitude of many hundreds / few thousands of lines, with a dozen or two of fields. Since the data is going to be very repetitive by nature I'm thinking to use one or more arrays as LUTs and store all the data as numbers to save space, and also because the data will need to be filterable by most (if not all) the fields. Now, here's my question.
Amongst the fields there will be date fields, e.g. year, month, day, (hour, minute). Which is why I'm wondering: should I keep the thousands of lines within a same file and perform a lot of IFs over them, or should I maybe split the data by hand in smaller files? I explain.
Case 1:
all_data.csv
-> lots of IFs
Case 2:
2014.csv
2015.csv
-> the year filter can be applied "outside" with a FOR, while IFs for month and day will still apply
Case 3:
201501.csv
201502.csv
-> both year and month filters can be applied "outside"
Case 4:
20150101.csv
20150102.csv
-> all the date is filtered outside (with nested FORs or DOs or something); however, this would lead to 365 files per year, most of which with 0 or few lines.
So I'm wondering. Which is easier on the server? Lots of IFs, or lots of FOPENs?
Since I'm generating the data from another program I wrote it wouldn't be a problem to split it differently. What I can't do is feed that data into a database with an automatic procedure, so yeah. Which is the best plan B?
This post has been edited by nineko: 13 March 2015 - 08:14 PM


00