don't click here

Questions about php

Discussion in 'Technical Discussion' started by nineko, Mar 7, 2015.

  1. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    Basically, I want to implement something on one of the websites I manage. Before anyone asks, no, I can't easily use a database due to the nature of the project, even if I'm aware that a database would be the optimal solution, so moving on.

    I will have a lot of data to deal with; it has a magnitude of many hundreds / few thousands of lines, with a dozen or two of fields. Since the data is going to be very repetitive by nature I'm thinking to use one or more arrays as LUTs and store all the data as numbers to save space, and also because the data will need to be filterable by most (if not all) the fields. Now, here's my question.

    Amongst the fields there will be date fields, e.g. year, month, day, (hour, minute). Which is why I'm wondering: should I keep the thousands of lines within a same file and perform a lot of IFs over them, or should I maybe split the data by hand in smaller files? I explain.

    Case 1:
    all_data.csv
    -> lots of IFs

    Case 2:
    2014.csv
    2015.csv
    -> the year filter can be applied "outside" with a FOR, while IFs for month and day will still apply

    Case 3:
    201501.csv
    201502.csv
    -> both year and month filters can be applied "outside"

    Case 4:
    20150101.csv
    20150102.csv
    -> all the date is filtered outside (with nested FORs or DOs or something); however, this would lead to 365 files per year, most of which with 0 or few lines.

    So I'm wondering. Which is easier on the server? Lots of IFs, or lots of FOPENs?

    Since I'm generating the data from another program I wrote it wouldn't be a problem to split it differently. What I can't do is feed that data into a database with an automatic procedure, so yeah. Which is the best plan B?
     
  2. Glitch

    Glitch

    Tech Member
    175
    12
    18
    How you store the data will depend on what you want to do with it once it's on disk.

    Both fread and fopen will require syscalls so they'll have the context switching overhead. If you're going for read speed you'll want to avoid that like the plague, so big files with sizeable read buffers would be ideal. From what you've said it looks like you'll be querying based on date ranges so I'd suggest:

    Use big files with a fixed size limit. Keep appending rows until you hit that limit then start a new file. Maintain a separate index file containing your date values with pointers to your data in the fixed size files.

    The main problems with lots of small files are: a). most filesystems don't cope very well with directories with many small files (you'd need something like rieserfs), and b). if you're on a shared VPS chances are you've got an inode limit.

    So, yes, I'm suggesting you build your own basic database.
     
  3. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    Thanks, I too assumed that having many small files was a bad idea, but since I hate to use IFs I wanted to get a second opinion from someone who knows more than me. I was already inclined towards a hybrid approach, now you gave me a confirmation. For now I'll just start with one big file to see how it goes, I might split by year eventually. I might also store redundant data for year / month / day combinations and perform checks upon them if both filters are enabled, e.g. filter for "201501" at once instead of filtering for "2015" and "01" separately, I am quite sure a few wasted bytes per line are well worth the removal of one IF. I might also do that at run time I guess, by comparing "201501" to (year * 100 + month), or something.
     
  4. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    Sorry to post again, but I have another question. Is there a way, in php, to detect if a page is being loaded into a frame and return its name? I'm considering the chance to allow other people embed a page from my website inside an iframe, and I would like the presentation to change a little in those cases. I looked on Google with no success, it looks like it's not one of the $_SERVER variables (even 'REQUEST_URI' returns the URI of the framed page, and not the one of the container).
     
  5. GerbilSoft

    GerbilSoft

    RickRotate'd. Administrator
    2,971
    76
    28
    USA
    rom-properties
    That's entirely client-side. The only way to handle it would be to include a client-side javascript that issues a different page request depending on whether or not the page is in an iframe.
     
  6. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    Thank you. My venture on Google indeed gave me the impression that I'd need to use javascript, but I hoped there could be a workaround for that. Too bad there isn't, but I won't pollute my project with something I consider on par with a sin (javascript). I added a GET variable for the purpose, the average user won't even know how to mess with it.
     
  7. Vangar

    Vangar

    Member
    3,654
    62
    28
    I'm interested in why you can't use a database for what seems like database data.
     
  8. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    It's data I gather with another program I wrote, and that program runs on my own computer. It reads the data from two other websites, and cleans it up a lot. Once the data I extracted from those two websites is cleaned and sorted, I have to save it somewhere. Since I'm doing it with a program on my own computer, the most practical solution is to save it as a file, and CSV is quite handy. Storing it into a database would require me to either save it as a file made up of SQL instruction and somehow process it (from a mySql console or from a custom php file), or I can still output a CSV as I'm doing now and call a php page which takes that CSV just once and puts it into a database. I have to admit those are two valid options, but they would add more steps to my procedure, which I finalised in the past days and it works quite well.

    I know it would be *MUCH* better if I somehow managed to gather the data directly from a php script or something so I didn't need to pass through my computer, but that's definitely too hard for me, it was hard enough to do it in a language and an environment I'm familiar with (VBA code in Microsoft Excel, which I also use to sort / process the data (which spans on 58 sheets, by the way)), no way I'm doing all that importing / cleaning / sorting in php, I don't know if that would even be possible.

    I predict a maximum size of ~2100 rows for the CSV file, and so far (~1900 rows) the server is doing well with it.
     
  9. Billy

    Billy

    RIP Oderus Urungus Member
    2,119
    179
    43
    Colorado, USA
    Indie games
    Have you looked into SQLite or anything like that? It's not client-server, so it's great for embedding into desktop apps (the db is saved as a file), and there's bindings for PHP.
     
  10. nineko

    nineko

    I am the Holy Cat Tech Member
    6,308
    486
    63
    italy
    That's another option I admit I overlooked.
     
  11. Skeledroid

    Skeledroid

    Member
    227
    0
    0
    I would just convert the CSV to JSON so you can use json_encode/decode in PHP to mess around with all the data as arrays in RAM.